Data-driven methods are growing increasingly popular in practice. Most classical machine learning and statistical methods view the underlying process which generates the data as fixed: the study is primarily focused on the mapping from data distributions to classifier. However, it is important to consider the effects in the other direction as well: how does the classifier chosen by a learner change the data distribution the learner sees? In particular, how do we close the loop around machine learning deployments in practice?
These closed loop effects can arise in many real world settings. One instance is strategic classification: whenever a data source has a stake in which label a classifier applies to it, they will seek cost-effective ways to manipulate their data to earn the desired label. For example, credit scoring classifiers are heavily guarded for fear of the potential for gaming (Hardt:2016we)
. Alternatively, deployments of the classifier can both skew future datasets and also have causal influences over the real-world processes at play. For example, a classifier that predicts crime recidivism influences the opportunities available to individuals(Dressel:2018uf).
Formally, we consider this problem in the framework introduced in Perdomo:2020tz. Let denote the loss when the learner’s decision is (e.g. can be the parameters of the chosen classifier) and the data has realized value . Furthermore, let denote the data distribution when the learner’s decision is . In this framework, the performative risk is given by:
Whereas classical machine learning results treat the distribution as fixed, the performative prediction framework models the decision-dependent distribution as a mapping . However, in many real world-deployments, this decision-dependent distribution shift may not be explicitly included in the learner’s updates. This leads to algorithms based on inexact repeated minimization. Define the decoupled performative risk as:
The decoupled performative risk separates the two ways that the decision variable affects the performative risk. Through the argument, affects the classification error; through the argument,
causes a decision-dependent distribution shift. In this paper, we shall analyze the steady-state behavior of stochastic gradient descent algorithms:
Here, is some zero-mean noise process. Note that the gradient is evaluated only with respect to the first argument, i.e. the updates are based only on the effect of
on the loss function, and ignore the distribution shift caused by. In other words, the learner draws several observations from the distribution , and, treating this distribution as fixed, updates their model parameters based on stochastic gradient descent: they are descending the gradient of the cost function .
In particular, we focus on settings where there may be multiple local equilibria, and classify their regions of attraction for these equilibria. In many settings of interest, there may be multiple steady-state outcomes, and it is of interest to determine which outcome will be chosen by the dynamics in Equation (3). As a motivating example, we consider a model of how a company’s demographics can affect the pool of applications that apply for jobs at the company. In this model, the initial demographics of the company determine the steady-state demographics of the company. Our results allow us to characterize which regions of the parameter space will converge to which equilibria. We discuss this example in greater formal detail in Section 3.1.
Our main theoretical results can be informally summarized as follows. Theorem 1 states that trajectories of inexact repeated risk minimization will converge exponentially fast to a neighborhood of local performative risk minimizers, and stay in this neighborhood for all future time. It also provides a sufficient condition to under-approximate the regions of attraction for each local performative risk minimizer. In the special case of vanishing perturbations, these trajectories will converge to the minimizers themselves. As a corollary, this implies that performatively stable points will be near performatively optimal points, which can be seen as a continuous-time analog to results proved in Perdomo:2020tz. Theorem 2 states a geometric condition on the performative perturbation which ensures that trajectories of repeated risk minimization will converge to local performative risk minimizers, intuitively based on the idea that the perturbation does not push against convergence.
These results allow us to identify the regions of attraction for various steady-state outcomes. As observed in Miller:2021te, these various outcomes can be interpreted as different echo chambers: essentially the decision variable can act as a sort of self-fulfilling prophecy.111It is worth noting that we take a slightly different interpretation of an ‘echo chamber’ in this paper. In Miller:2021te, the echo chambers are defined as performatively stable points. In this paper, we consider the regions near each locally performatively optimal point as an echo chamber. As we will discuss in Section 3.1, we are interested in settings where there may be many local performative risk minimizers that attract learning methods depending on initialization. In settings with multiple echo chambers, we consider the question of which echo chamber will come to dominate, based on the initialization of the learner.
The rest of the paper is organized as follows. In Section 2, we discuss the related literature. In Section 3, we introduce the problem statement and the mathematical concepts used for our results, and provide a motivating example based on job applicant pools in Section 3.1. In section 4, we analyze the gradient flow associated with performative risk minimization, and in Section 5, we analyze the flows associated with repeated risk minimization. We demonstrate numerical results in Section 6, and provide closing remarks in Section 7.
There has been a great deal of interest in studying decision-dependent distributions. In the context of operations research, this has been studied under either the name decision-dependent uncertainty or endogenous uncertainty. In Jonsbraten:1998wc, Jonsbraten:1998wk, and Goel:2004wb, the authors considered oil field optimization, with a framework that captures how information revelation can be affected by one’s decisions. In Peeta:2010tb, the authors consider infrastructure investment, and how investments can affect the future likelihood of disasters. For a taxonomy of the work in the operations research community, we refer the reader to Hellemo:2018tf.
Another form of decision-dependent distributions is strategic classification. In these works, the data source is seen as a utility-maximizing agent. The distribution shift resulting from the learner’s decision is modeled by a best response function. In Hardt:2016we and Bruckner:2011wy, the authors formulate the problem as a Stackelberg game where the data source responds to the announced classifier. In Dong:2018td, the authors consider when the data source’s preferences are hidden information and provide sufficient conditions for convexity of the overall strategic classification task. In Akyol:2016wu, the authors quantify the cost of strategic classification for the classifier. In Milli:2019tf and Hu:2019wu, the authors note that certain groups may be disproportionately affected as institutions incorporate methods to counter data sources gaming the classifier. In Miller:2020vy, the authors formulate strategic classification in a causal framework.
Most related to our work is recent efforts in performative prediction. This was introduced in (Perdomo:2020tz). In this formulation, rather than explicitly modeling the form of the distribution shift, it proposes to analyze the decision-dependent distribution shift in terms of general properties of the mapping, where is the distribution of the data when the learner’s decision is . In Perdomo:2020tz, the authors introduced the concepts related to performative prediction, demonstrated that neither the performatively stable nor performatively optimal points are subsets of each other, provided sufficient conditions for exact repeated risk minimization (defined as finding the exact minima with respect to at each time step) to converge, and provided conditions in which performatively stable points are near performatively optimal points. In Mendler-Dunner:2020vd, the authors analyze inexact repeated risk minimization (defined as an update step with respect to at each time step) from a stochastic optimization framework. In this paper, we build on the inexact repeated risk minimization framework. Miller:2021te provided sufficient conditions for performative risk itself to be convex. Brown:2020wg extended these results to settings where the distribution updates may have an internal state. In Drusvyatskiy:2020wk, the authors show that many inexact repeated risk minimization algorithms will also converge nicely, due to the way in which the performative perturbation decays near the solution. This shares many ideas with our work here, but we focus on the case where there may be multiple attractive equilibria, and generalize to settings where the perturbation itself may not vanish. In contrast to previous works which provide sufficient conditions to guarantee that an outcome is approached globally, we focus on understanding local regions of attraction for various outcomes.
This work draws on ideas from control theory; in particular, the analysis of gradient flows, Lyapunov functions, and perturbation analysis are the tools we use throughout. We refer the reader to Hirsch:2012tx and Khalil:2001wj as good references for these suite of tools.
3 Performative prediction, flows, and perturbations
In this section, we introduce the mathematical concepts used throughout this paper. As previously mentioned, the framework used throughout this paper builds on the framework of performative prediction, introduced in Perdomo:2020tz.
In Section 1, we have already defined the performative risk in Equation (1) and the decoupled performative risk in Equation (2). Furthermore, we say that is a local performative risk minimizer is is a local minima of . We say is locally performatively stable if is a local minima of . In general, neither imply the other (Perdomo:2020tz).
Additionally, we consider the performative risk minimizing (PRM) gradient flow, defined by the following differential equation:
This vector field can be represented by the gradient of a function, which lends the flow to nice analysis. Under mild conditions, the trajectories of Equation (4) will converge to local minima of the performative risk.
However, as noted in Section 1, many deployments of machine learning do not explicitly model the distribution shift, and, consequently, do not directly minimize the performative risk. We define the repeated risk minimizing (RRM) flow as solutions to the differential equation:
We define the performative perturbation:
In this paper, we view the PRM gradient flow as the nominal dynamics, and the RRM flow as the perturbed dynamics. The PRM gradient flow has nice properties arising from the fact it is a gradient flow, and, under certain conditions on the performative perturbation, we can prove properties about the RRM flow, which is the quantity of interest. In particular, we show ultimate bounds on the distance between the trajectories of RRM flow and the local performative risk minimizers. This also implies that under certain conditions on the performative risk, all performatively stable points are near performative risk minimizers, as was observed in Perdomo:2020tz.
Throughout this paper, we will be using tools from perturbation analysis in control theory. For a complete vector field , let denote the unique solution to the differential equation with initial condition . For a scalar-valued function and a vector field , we can define the derivative along trajectories as . We say a point is an equilibrium point if . An equilibrium point is locally asymptotically stable if there exists a neighborhood such that for all . A set is positively invariant if for all and , we have . Additionally, given a set , we say two points and are path-connected in if there exists a continuous function such that and . This forms an equivalence relation defined on , and each equivalence class is a connected component of .
3.1 A motivating example: decision-dependent distribution shift in job applicant pools
Before we present our analysis of the PRM gradient flow and the RRM flow, we introduce a motivating example which motivates the study of multiple local equilibria. This toy example considers a model for decision-dependent distribution shift in the applicant pool for a job based on the past hiring decisions of a company.
In this model, individuals are characterized by three variables: 1) their group membership, , 2) , which is their true productive capacity and independent of their group membership , and 3) , their observable productivity, which is a distorted version of their true productive capacity. The distribution of in the applicant pool will be decision-dependent, which we shall specify shortly. We let , which is viewed as observable features and output .
A learner wants to use historical hiring data to determine which individuals to hire into their company. The variable of interest is , which is the individual’s true productive capacity. The learner must make the decision based on , and can observe after the fact. We assume the learner uses a linear classifier. Let , and the individual with observable features is hired if . We also assume the learner uses a logistic loss function:
Now, let us define the decision-dependent distribution shift . In our model, if group has insufficient representation among the hired population, then future rounds will have fewer applications from this group. Furthermore, if group is well-represented in the hired population, then this encourages this group to apply in the future. We suppose there is some critical fraction : if less than of the hired population is of group 1, the next applicant pool will have reduced participation by group 1; if the hired population has group 1 represented by more than , then the next applicant pool will have increased participation up to a saturation fraction .
Formally, this means we define as follows. The distributions of and do not depend on . Let denote the fraction of the applicant pool that was group 1 in the previous iteration, and let denote the fraction of the previously accepted applicants which was group 1. The group membership , where is given by:222We note that, technically, this formulation requires the mapping have some notion of ‘state’, since the next distribution depends on not only but the fraction of the previous applicant pool and accepted applicants . Such extensions have been considered generally in Brown:2020wg, but, for this example, it suffices to add and as components of and , since both quantities are known by the learner.
This update rule decreases for and increases for .
We show numerical some of the results in Figure 1. In this example, we set the critical fraction as , the saturation fraction
, and the observation variance. We initialized , and, at each iteration , the dataset was redrawn with 100 samples from . In this model, since is uncorrelated with the quantity of interest , we see the classifier converges to a vertical line which ignores the group membership in both cases. However, in the initialization that is unfavorable to group 1, we see a dwindling participation by group 1, even though the classifier after 300 iterations is relatively fair.333As mentioned in the previous footnote, we augment with
, the probability of seeing. Thus, although in both initialization, we see the classifier parameters are converging to the same point, with this augmentation, we can view these as two separate equilibria. However, for ease of presentation, we avoided cluttering notation with this augmentation.
This model motivates the study of regions of attraction for different equilibria. Our results can identify the region of attraction for different outcomes. A company for one reason or another may have historically hired more from one group of individuals than another. This example shows that when there are decision-dependent distribution shifts, the initial conditions can affect the final outcomes. In particular, for this setting, it is of interest to identify the region of convergence for different equilibria.
4 Analysis of performative risk minimizing gradient flow
In this section, we consider PRM gradient flow, defined by Equation (4). We observe that gradient flows provide complete vector fields, and that trajectories will converge to local performative risk minimizers under very mild conditions.
First, we state a proposition guaranteeing that flow is well-defined. The compact sublevel sets ensure that trajectories of Equation (4) remain bounded, which is sufficient to guarantee existence and uniqueness of solutions globally. For proof of the following proposition, we refer the reader to either Khalil:2001wj or Hirsch:2012tx.
Proposition 1 (Existence and uniqueness of gradient flows).
Suppose the performative risk is continuously differentiable, and its sublevel sets are compact for every . Then for any initial condition , there exists a unique solution to the differential equation in Equation (4), defined for all .
Next, we note that gradient flows have nice properties from the perspective of optimization. Namely: every isolated local minima is locally asymptotically stable, and we can provide sufficient conditions to characterize a subset of the region of convergence.
Proposition 2 (Convergence of gradient flows).
Suppose the performative risk is twice continuously differentiable, and is an isolated local performative risk minimizer. Then is a locally asymptotically stable equilibrium of Equation (4). Furthermore, take any such that . Let denote the connected component of that contains . If is the only local performative minimizer in , then all solutions with initial conditions in converge to .
Since is an isolated local minimizer and the performative risk is twice continuously differentiable, there exists a neighborhood such that is non-zero for all . By continuity, there exists some constant such that the connected component of containing is contained in . Since it is a sublevel set of and on its boundary, it is positively invariant. Furthermore, since for all on this set, is locally asymptotically stable by standard Lyapunov arguments (see, e.g. Khalil:2001wj). ∎
The sublevel sets of the performative risk are positively invariant with respect to the PRM gradient flow. Furthermore, because of the continuity of trajectories, each connected component will also be positively invariant. This, tandem with the fact that trajectories must either converge to a local minima or go off to infinity, also implies the previous proposition.
With minimal assumptions, isolated local performative risk minimizers are all locally attractive in the PRM gradient flow. In Section 5, we will view the PRM gradient flow as the nominal dynamics. From this perspective, we analyze the RRM flow as a perturbation from these nominal dynamics. To be able to do any perturbation-based analysis, we will need some stronger conditions on the convergence of the gradient flow associated with performative risk minimization. We note these assumptions here.
Assumption 1 (Sufficient curvature of the performative risk).
Fix some isolated local performative risk minimizer . We assume there exists positive constants , , and such that the following holds in a neighborhood of :
We will let denote the radius of this neighborhood, so the above inequalities are valid on the set .
Assumption 1 provides conditions on which can be used as a Lyapunov function locally.
5 Analysis of repeated risk minimizing flow
In the previous section, we consider the PRM gradient flow and showed that the trajectories converge to local performative risk minimizers in very general settings. In this section, we will consider the RRM flow, defined by Equation (5). The RRM flow is not necessarily a gradient flow, and generally will not inherit the nice properties we saw in Section 4.
The following theorem provides conditions on the transient response and steady-state behavior of the RRM flow. Prior to , the trajectories converge exponentially quickly. After , we have an ultimate bound that holds.
Theorem 1 (Ultimate bounds for RRM flow).
Suppose that there exists positive constants and such that the following holds on :
Additionally, suppose the initial condition satisfies:
Take any such that:
Then, there exists a such that:
For all :
For all :
Let . Note that on and if and only if . Furthermore, note that .
These inequalities are valid so long as stays within , which we will ensure later in the proof. Note that is sufficiently small (by assumption) to ensure that .
Let . Take any and note that:
Let . If , then:
Trajectories of Equation (5) has two stages: a transient due to its initial condition, and then an ultimate bound due to the perturbation. Let . Prior to , we have:
Note that this inequality also provides an upper bound on . Additionally, note that this implies the bound , by our assumption on the initial condition. Prior to , our trajectory stays in , where our inequalities are valid.
At time , we have . Note that this inequality implies . Since on the boundary of , we have that is a positively invariant set. So, for all , we have . Using Equation (7), we have the following for all :
The condition on ensures that this quantity is bounded by , and the trajectory stays in for . This proves our desired result. ∎
Note that, in the special case where , we have that the RRM flow converges exponentially quickly to locally. Similarly, in the special case where Assumption 1 holds everywhere (i.e. ), then there is only one minimizer , and all initial conditions converge to a neighborhood of exponentially fast.
Additionally, note that locally performatively stable points are equilibria of the RRM flow. This result provides constraints on where performatively stable points can be. Suppose again that Assumption 1 holds globally (i.e. ) and, consequently, there exists only one minimizer . In this special case, Theorem 1 shows that all performatively stable points must be close to , which is a continuous-time analog to Theorem 4.3 in Perdomo:2020tz.
5.1 Performative alignment
From the previous analysis, we also identify conditions on the directions of the performative perturbations that are sufficient to show the convergence of Equation (5), the RRM flow, to performative risk minimizers.
Theorem 2 (Performative alignment).
Let . Since is a locally asymptotically equilibria of the PRM flow, we have: , for , and for . The performative alignment condition ensures that as well, and the desired result follows. ∎
We refer to Equation (10) as the performative alignment condition. This condition states that the performative perturbation never increases the performative risk, and the convergence of performative risk minimization is sufficient to guarantee convergence of repeated risk minimization. In other words, the perturbation is either sufficiently small or pointing in the correct direction to ensure that can still act as a Lyapunov function.
Another perspective on performative alignment is to consider the performative risk as a bilinear form whose arguments are parameterized by . In particular, consider the decoupled performative risk . Let and let
denote the probability distribution associated with. Then, we can write . From this perspective, is a bilinear form in and . As such, the performative alignment condition becomes a condition on the way in which and are parameterized by .
6 Numerical results
In this section, we revisit the model introduced in Section 3.1. We again use the parameters: critical participation rate , saturation participation rate , and observational noise . The participation rate is initialized to . In lieu of explicitly calculating the expectation, we drew 100 data points at each iteration and used the empirical distribution to calculate update rules, as would typically be done in practice. All of the trials, including those in Section 3.1, were conducted with a constant step-size of , and with the weight for the update rule in Equation (6).
). This plot is shows the average and standard deviation across 100 trials.
In Figure 2 (Left), we visualize the participation rate of group 1 after 600 iterations, based on the initialization of the weight vector . (In all cases, the bias term was initialized to and the initial participation was initialized to .) We can see the set of initial conditions in which group 1 continues to participate in the job application pool.
Next, we consider the effect of scaling the perturbation on the values of across time. In this model, we can scale the perturbation by changing the weight parameter in Equation (6). The bounds in Theorem 1 depend on the size of the perturbation; the region of attraction is larger for smaller perturbations. Figure 2 (Right) is conducted with the favored initialization , and with initial probability . The visualization shows the average value and standard deviation across 100 trials. We can see for small perturbations , the probability does not change significantly across time. For larger perturbations , the performative perturbation can be large enough to push us out of the region of attraction for the nominal dynamics.
7 Closing remarks
In this paper, we analyzed the problem of performative prediction in settings where multiple isolated equilibria may be of interest. We analyzed the gradient flow of performative risk minimization, and identified regions of attraction for various equilibria. We viewed repeated risk minimization flow as a perturbation of the PRM gradient flow. In particular, we used a Lyapunov function for the PRM gradient flow to analyze the trajectories of the RRM flow. We found conditions on which RRM flow will converge to the local PRM minimizers, and conditions on which they will converge to a neighborhood of PRM minimizers. Stochastic approximation results allowed us to state when repeated risk minimization will approximate the RRM flow studied.
These results provide a method to analyze the regions of attraction for various equilibria under repeated risk minimization. In real-world settings with decision-dependent distributions, we expect many situations where the initialization may have a significant outcome on the trajectories and final outcomes.