It is useful to train classifiers with data-dependent constraints in order to achieve certain guarantees on the training set, such as statistical parity or other fairness guarantees, specified recall, or a desired positive classification rate (e.g. Scott and Nowak, 2005; Zafar et al., 2015; Goh et al., 2016; Woodworth et al., 2017; Narasimhan, 2018)). However, a key question is whether the achieved constraints will generalize. For example: will a classifier trained to produce statistical parity on training examples still achieve statistical parity at evaluation time?
Unfortunately, the answer is “not quite.” Because such constraints are data-dependent, overfitting can occur, and constraints that were satisfied on the training set should be expected to be slightly violated on an i.i.d. test set. This is particularly problematic in the context of fairness constraints, which will typically be chosen based on real-world requirements (e.g. the rule of some US laws (Biddle, 2005; Vuolo and Levy, 2013; Zafar et al., 2015; Hardt et al., 2016)). In this paper, we investigate how well constraints generalize, and propose algorithms to improve the generalization of constraints to new examples.
Specifically, we consider problems that minimize a loss function subject to data-dependent constraints, expressed in terms ofexpectations over a data distribution :
is a feature vector,is the data distribution over , is a space of model parameters for the function class of interest, and are loss functions associated with the objective and the constraints ***Table 5, in the appendix, summarizes our notation.. We do not require these loss functions to be convex. Appendix A contains two examples of how Equation 1 can be used to express certain data-dependent constraints (see Goh et al. (2016) and Narasimhan (2018) for more).
One typically trains a classifier on a finite training set drawn from , but the true goal is to satisfy constraints in expectation over , as in Equation 1. To this end, we build on a long line of prior work that treats constrained optimization as a two-player game (e.g. Christiano et al., 2011; Arora et al., 2012; Rakhlin and Sridharan, 2013; Kearns et al., 2017; Narasimhan, 2018; Agarwal et al., 2018). In this setting, the first player optimizes the model parameters , and the second player enforces the constraints, e.g. using the Lagrangian formulation:
In practice, one would approximate the Lagrangian with a finite i.i.d. training sample from , and the first player would minimize over the model parameters while the second player maximizes over the Lagrange multipliers .
Our key idea is to treat constrained optimization similarly to hyperparameter optimization: just as one typically chooses hyperparameters based on a validation set, instead of the training set, to improve classifier generalization, we would like to choose the Lagrange multipliers on a validation set to improveconstraint generalization. In “inner” optimizations we would, given a fixed , minimize the empirical Lagrangian on the training set. Then, in an “outer” optimization, we would choose a that results in the constraints being satisfied on the validation set. Such an approach, could it be made to work, would not eliminate the constraint generalization problem completely—hyperparameter overfitting (e.g. Ng, 1997) is a real problem—but would mitigate it, since constraint generalization would no longer depend on size of the training sample and the complexity of
(which could be extremely large, e.g. for a deep neural network), but rather on the size of the validation sample and the effective complexity of, which, being -dimensional, is presumably much simpler than .
While the above approach is intuitive, challenges arise when attempting to analyze it. The most serious is that since is chosen based on the training set, and on the validation set, the -player is minimizing a different function than the -player is maximizing, so the corresponding two-player game is non-zero-sum (the players have different cost functions). To handle this, we must depart from the typical Lagrangian formulation, but the key idea remains: improving generalization by using a separate validation set to enforce the constraints.
Fortunately, the recent work of Cotter et al. (2018)
gives a strategy for dealing with a non-zero-sum game in the context of constrained supervised learning. We adapt their approach to this new setting to give bounds on constraint generalization that are agnostic to model complexity. After some preliminary definitions in Section3, in Section 4 we present two algorithms for which we can provide theoretical bounds.
In Section 5, we perform a set of experiments demonstrating that our
two-dataset approach successfully improves constraint generalization even when
our theoretical results do not hold. In other words, providing independent
datasets to the - and -players seems to work well
as a heuristic
as a heuristicfor improving constraint generalization.
2 Related Work
While several recent papers have proved generalization bounds for constrained problems (e.g. Goh et al., 2016; Agarwal et al., 2018; Donini et al., 2018), the problem of improving constraint generalization is a fairly new one, having, so far as we know, only been previously considered in the work of Woodworth et al. (2017)
, who handled generalization subject to “equalized odds” constraints in the setting ofHardt et al. (2016). Specifically, their approach is to first learn a predictor on , and then to learn a “correction” on
to more tightly satisfy the fairness constraints. The second stage requires estimating only a constant number of parameters, and the final predictor consequently enjoys a generalization guarantee for the fairness constraints which is independent of the predictor’s complexity, with only a modest penalty to the loss. However, their approach relies heavily upon the structure of equalized odds constraints: it requires that any classifier can be modified to satisfy the fairness constraintsand have low loss on a validation set by tuning only a small number of parameters.
Woodworth et al. (2017)’s overall approach can be summarized as “train a complicated model on a training set, and then a simple correction on a validation set”. If, as they show to be the case for equalized odds constraints, the “simple correction” is capable of satisfying the constraints without significantly compromising on quality, then this technique results in a well-performing model for which the validation constraint generalization depends not on the complexity of the “complicated model”, but rather of that of the “simple correction”. In this paper, we extend Woodworth et al. (2017)’s two-dataset idea to work on data-dependent constraints in general.
Our primary baseline is Agarwal et al. (2018)’s recently-proposed algorithm for fair classification using the Lagrangian formulation. Their proposal, like our Algorithm 1, uses an oracle to optimize w.r.t. (they use the terminology “best response”), and, like all of our algorithms, results in a stochastic classifier. However, our setting differs slightly from theirs—they focus on fair classification, while we work in the slightly more general inequality-constrained setting (Equation 1). For this reason, in Appendix D we provide an analysis of the Lagrangian formulation for inequality constrained optimization.
3 Background & Definitions
Our algorithms are based on the non-zero-sum two-player game proposed by Cotter et al. (2018), which they call the “proxy-Lagrangian” formulation. The key novelty of their approach is the use of “proxy” constraint losses, which are essentially surrogate losses that are used by only one of the two players (the -player). It is because the two players use different losses that their proposed game is non-zero-sum. The motivation behind their work is that a surrogate might be necessary when the constraint functions are non-differentiable or discontinuous (e.g. for fairness metrics, which typically constrain proportions, i.e. linear combinations of indicators), but the overall goal is still to satisfy the original (non-surrogate) constraints. Our work differs in that we use a non-zero-sum game to provide different datasets to the two players, rather than different losses.
Despite this difference, the use of proxy-constraints is perfectly compatible with our proposal, so we permit the approximation of each of our constraint losses with a (presumably differentiable) upper-bound . These are used only by the -player; the -player uses the original constraint losses. The use of proxy constraint losses is entirely optional: one is free to choose for all .
Let and be two random datasets each drawn i.i.d. from a data distribution . Given proxy constraint losses for all , and , the empirical proxy-Lagrangians of Equation 1 are:
where is the -dimensional simplex.
The difference between the above, and Definition 2 of Cotter et al. (2018), is that is an empirical average over the training set, while is over the validation set. The -player seeks to minimize over , while the -player seeks to maximize over . In words, the -player will attempt to satisfy the original constraints on the validation set by choosing how much to penalize the proxy constraints on the training set.
Our ultimate interest is in generalization, and our bounds will be expressed in terms of both the training and validation generalization errors, defined as follows:
Define the training generalization error such that:
for all and all (the objective and proxy constraint losses, but not the original constraint losses).
Likewise, define the validation generalization error to satisfy the analogous inequality in terms of :
for all and all (the original constraint losses, but not the objective or proxy constraint losses).
Throughout this paper, is the set of iterates found by one of our proposed algorithms. Each of our guarantees will be stated for a particular stochastic model supported on (i.e. is a distribution over ), instead of for a single deterministic . Notice that the above definitions of and also apply to such stochastic models: by the triangle inequality, if every generalizes well, then any supported on generalizes equally well, in expectation.
We seek a solution that (i) is nearly-optimal, (ii) nearly-feasible, and (iii) generalizes well on the constraints. The optimality and feasibility goals were already tackled by Cotter et al. (2018) in the context of the proxy-Lagrangian formulation of Definition 1. They proposed having the -player minimize ordinary external regret, and the -player minimize swap regret using an algorithm based on Gordon et al. (2008). Rather than finding a single solution (a pure equilibrium of Definition 1), they found a distribution over solutions (a mixed equilibrium). Our proposed approach follows this same pattern, but we build on top of it to address challenge (iii): generalization.
To this end, we draw inspiration from Woodworth et al. (2017) (see Section 2), and isolate the constraints from the complexity of by using two independent datasets: and . The “training” dataset will be used to choose a good set of model parameters , and the “validation” dataset to choose , and thereby impose the constraints. Like Woodworth et al. (2017), the resulting constraint generalization bounds will be independent of the complexity of the function class.
We’ll begin, in Section 4.1, by proposing and analyzing an oracle-based algorithm that improves generalization by discretizing the candidate set, but makes few assumptions (not even convexity). Next, in Section 4.2, we give an algorithm that is more “realistic”—there is no oracle, and no discretization—but requires stronger assumptions, including strong convexity of the objective and proxy-constraint losses (but not of the original constraint losses ).
In Section 5, we will present and perform experiments on simplified “practical” algorithms with no guarantees, but that incorporate our key idea: having the -player use an independent validation set.
4.1 Covering-based Algorithm
|3||Let // Fixed point of , i.e. a stationary distribution|
|4||Let // Discretization to closest point in|
|6||Let be a supergradient of w.r.t.|
|7||Update // and are element-wise|
|8||Project for // Column-wise projection w.r.t. KL divergence|
The simplest way to attack the generalization problem, and the first that we propose, is to discretize the space of allowed s, and associate each with a unique , where this association is based only on the training set. If the set of discretized s is sufficiently small, then the set of discretized s will likewise be small, and since it was chosen independently of the validation set, its validation performance will generalize well.
Specifically, we take to be a radius- (external) covering of w.r.t. the -norm. The set of allowed s is exactly the covering centers, while, following Chen et al. (2017), Agarwal et al. (2018) and Cotter et al. (2018), the associated s are found using an approximate Bayesian optimization oracle:
A -approximate Bayesian optimization oracle is a function for which:
for any that can be written as for some . Furthermore, every time it is given the same , will return the same (i.e. it is deterministic).
We will take the discretized set of s to be the oracle solutions corresponding to the covering centers, i.e. . The proof of the upcoming theorem shows that if the radius parameter is sufficiently small, then for any achievable objective function value and corresponding constraint violations, there will be a that is almost as good. Hence, despite the use of discretization, we will still be able to find a nearly-optimal and nearly-feasible solution. Additionally, since the set of discretized classifiers is finite, we can apply the standard generalization bound for a finite function class, which will be tightest when we take to be as large as possible while still satisfying our optimality and feasibility requirements.
Algorithm 1 combines our proposed discretization with the oracle-based proxy-Lagrangian optimization procedure proposed by Cotter et al. (2018). As desired, it finds a sequence of solutions for which it is possible to bound independently of the complexity of the function class parameterized by , and finds a random parameter vector supported on that is nearly-optimal and nearly-feasible. discrete Given any , there exists a covering such that, if we take and , where is a bound on the gradients, then the following hold, where is the set of results of Algorithm 1.
Optimality and Feasibility: Let
be a random variable taking values from, defined such that
with probability, and let . Then is nearly-optimal in expectation:
Additionally, if there exists a that satisfies all of the constraints with margin (i.e. for all ), then:
where is a bound on the range of the objective loss.
Generalization: With probability over the sampling of :
where for all , and assuming that the range of each is the interval . discrete The particular values we choose for and come from Lemma LABEL:lem:discrete-convergence, taking , , and . The optimality and feasibility results then follow from Theorem LABEL:thm:dataset-suboptimality.
For the bound on , notice that by Lemma LABEL:lem:covering-number, there exists a radius- covering w.r.t. the -norm with . Substituting this, and the definition of , into the bound of Lemma LABEL:lem:discrete-generalization yields the claimed bound.
|One||†††This condition could be removed by defining the feasibility margin in terms of instead of , causing to depend on the particular training sample, instead of being solely a property of the constrained problem and choice of proxy-constraint losses.|
|(Theorem LABEL:thm:discrete)||(Theorem LABEL:thm:continuous)|
When reading the above result, it’s natural to wonder about the role played by . Recall that, unlike the Lagrangian formulation, the proxy-Lagrangian formulation (Definition 1) has a weight associated with the objective, in addition to the weights associated with the constraints. When the th constraint is violated, the corresponding will grow, pushing towards zero. Conversely, when the constraints are satisfied, will be pushed towards one. In other words, the magnitude of encodes the -player’s “belief” about the feasibility of the solution. Just as, when using the Lagrangian formulation, Lagrange multipliers will tend to be small on a feasible problem, the proxy-Lagrangian objective weight will tend to be large on a feasible problem, as shown by Equation 15, which guarantees that will be bounded away from zero provided that there exists a margin-feasible solution with a sufficiently large margin . In practice, of course, one need not rely on this lower bound: one can instead simply inspect the behavior of the sequence of ’s during optimization.
Equation 15 causes our results to be gated by the feasibility margin. Specifically, it requires the training and validation datasets to generalize well enough for to stay within the feasibility margin . Past this critical threshold, can be lower-bounded by a constant, and can therefore be essentially ignored. To get an intuitive grasp of this condition, notice that it is similar to requiring -margin-feasible solutions on the training dataset to generalize well enough to also be margin-feasible (with a smaller margin) on the validation dataset, and vice-versa.
Table 1 contains a comparison of our bounds, obtained with the proxy-Lagrangian formulation and two datasets, versus bounds for the standard Lagrangian on one dataset. The “Assuming” column contains a condition resulting from the above discussion. There are two key ways in which our results improve on those for the one-dataset Lagrangian: (i) in the “Infeasibility” column, our approach depends on instead of , and (ii): as shown in Table 2, for our algorithms the generalization performance of the constraints is bounded independently of the complexity of .
It’s worth emphasizing that this generalization bound (Table 2) is distinct from the feasibility bound (the “Infeasibility” column of Table 1). When using our algorithms, testing constraint violations will always be close to the validation violations, regardless of the value of . The “Assuming” column is only needed when asking whether the validation violations are close to zero.
4.2 Gradient-based Algorithm
|3||Let // fixed point of , i.e. a stationary distribution|
|5||Initialize // Assumes|
|6||Let be a subgradient of w.r.t.|
|9||Let be a gradient of w.r.t.|
|10||Update // and are element-wise|
|11||Project for // Column-wise projection w.r.t. KL divergence|
Aside from the unrealistic requirement for a Bayesian optimization oracle, the main disadvantage of Algorithm 1 is that it relies on discretization. Our next algorithm instead makes much stronger assumptions—strong convexity of the objective and proxy constraint losses, and Lipschitz continuity of the original constraint losses—enabling us to dispense with discretization entirely in both the algorithm and the corresponding theorem statement.
The proof of the upcoming theorem, however, still uses a covering. The central idea is the same as before, with one extra step: thanks to strong convexity, every (approximate) minimizer of is close to one of the discretized parameter vectors . Hence, the set of such minimizers generalizes as well as , plus an additional term measuring the cost that we pay for approximating the minimizers with elements of .
The strong convexity assumption also enables us to replace the oracle call with an explicit minimization procedure: gradient descent. The result is Algorithm 2, which, like Algorithm 1, both finds a nearly-optimal and nearly-feasible solution, and enables us to bound independently of the complexity of . Unlike Algorithm 1, however, it is realistic enough to permit a straightforward implementation. continuous Suppose that is compact and convex, and that is -strongly convex in for all . Given any , if we take , and , where is as in Theorem LABEL:thm:discrete and is a bound on the subgradients, then the following hold, where is the set of results of Algorithm 1.
Optimality and Feasibility: Let be a random variable taking values from , defined such that with probability , and let . Then is nearly-optimal in expectation:
Additionally, if there exists a that satisfies all of the constraints with margin (i.e. for all ), then:
where is as in Theorem LABEL:thm:discrete.
Generalization: If, in addition to the above requirements, is -Lipschitz continuous in for all , then with probability over the sampling of :
where and are as in Theorem LABEL:thm:discrete. continuous The particular values we choose for , and come from Lemma LABEL:lem:continuous-convergence, taking . The optimality and feasibility results then follow from Theorem LABEL:thm:dataset-suboptimality.
For the bound on , notice that by Lemma LABEL:lem:covering-number, there exists a radius- external covering w.r.t. the -norm with . Substituting into the bound of Lemma LABEL:lem:continuous-generalization:
Substituting the definition of then yields the claimed result. The above optimality and feasibility guarantees are very similar to those of Theorem LABEL:thm:discrete, as is shown in Table 1 (in which the only difference is the definition of ). Algorithm 2’s generalization bound (Equation 19) is more complicated than that of Algorithm 1 (Equation 16), but Table 2 shows that the two are roughly comparable. Hence, the overall theoretical performance of Algorithm 2 is very similar to that of Algorithm 1, and, while it does rely on stronger assumptions, it neither uses discretization, nor does it require an oracle.
|1||Initialize // Assumes|
|4||Let // fixed point of , i.e. a stationary distribution|
|5||Let be a stochastic subgradient of w.r.t.|
|6||Let be a stochastic gradient of w.r.t.|
|8||Update // and are element-wise|
|9||Project for // Column-wise projection w.r.t. KL divergence|
|1||Initialize , // Assumes|
|3||Let be a stochastic subgradient of w.r.t.|
|4||Let be a stochastic gradient of w.r.t.|
|5||Update // Projected SGD updates …|
|6||Update // …|
While Section 4 has demonstrated the theoretical performance of Algorithms 1 and 2, we believe that our proposed two-dataset approach is useful as a heuristic for improving constraint generalization performance, even when one is not using a theoretically-justified algorithm. For this reason, we experiment with two “practical” algorithms. The first, Algorithm 3, is a bare-bones version of Algorithm 2, in which and are updated simultaneously using stochastic updates, instead of in an inner and outer loop. This algorithm implements our central idea—imposing constraints using an independent validation dataset—without compromising on simplicity or speed. The purpose of the second, Algorithm 4, is to explore how well our two-dataset idea can be applied to the usual Lagrangian formulation. For this algorithm, proxy-constraints and the use of two independent datasets are essentially “tacked on” to the Lagrangian. Neither of these algorithms enjoys the theoretical guarantees of Section 4, but, as we will see, both are still successful at improving constraint generalization.
We present two sets of experiments, the first on simulated data, and the second on real data. In both cases, each dataset is split into thee parts: training, validation and testing. We compare our proposed two-dataset approach, in which is the training dataset and is the validation dataset, to the the natural baseline one-dataset approach of using the union of the training and validation sets to define both and . Hence, both approaches “see” the same total amount of data during training.
This difference between the data provided to the two algorithms leads to a slight complication when reporting “training” error rates and constraint violations. For the two-dataset approach, the former are reported on (used to learn ), and the latter on (used to learn ). For the baseline one-dataset algorithm, both quantities are reported on the full dataset (i.e. the union of the training and validation sets). “Testing” numbers are always reported on the testing dataset.
Our implementation uses TensorFlow, and is based onCotter et al. (2018)’s open-source constrained optimization library. To avoid a hyperparameter search, we replace the stochastic gradient updates of Algorithms 3 and 4 with ADAM (Kingma and Ba, 2014), using the default parameters. For both our two-dataset algorithm and the one-dataset baseline, the result of training is a sequence of iterates , but instead of keeping track of the full sequence, we only store a total of evenly-spaced iterates for each run. Rather than using the weighted predictor of Theorems LABEL:thm:discrete and LABEL:thm:continuous, we use the “shrinking” procedure of Cotter et al. (2018) (see Appendix B) to find the best stochastic classifier supported on the sequence of iterates.
In all of our experiments, the objective and proxy constraint functions are hinge upper bounds on the quantities of interest, while the original constraint functions are precisely what we claim to constrain (in these experiments, proportions, represented as linear combinations of indicator functions).
5.1 Simulated-data Experiments
|0/1 error||Constraint violation|
|5 Hidden Units|
|0/1 error||Constraint violation|
|10 Hidden Units|
|0/1 error||Constraint violation|
|100 Hidden Units|
|0/1 error||Constraint violation|
Our first experiment uses a simulated binary classification problem designed to be especially prone to overfitting. To generate the dataset, we first draw points from two overlapping Gaussians in , and another points from the same distribution. For each , we let the classification label indicate which of the two Gaussians was drawn from, and generate a feature vector such that the th feature satisfies . Our results are averaged over ten runs, with different random splits of the data into equally-sized training, validation and testing datasets.
The classification task is learn a classifier on
that determines which of the two Gaussian distributions generated the example, with the model’s recall constrained to be at least. The parameter partly controls the amount of overfitting: as , a linear classifier on approaches a 1-nearest neighbor classifier over , which one would expect to overfit badly.
We trained four sets of models using Algorithm 3: linear, and one-hidden-layer neural networks with , and
hidden ReLU units. We also variedbetween and . Figures 1 and 2 show that our approach consistently comes closer to satisfying the constraints on the testing set, but that, as one would expect, this comes at a slight cost in testing accuracy. Unsurprisingly, our approach is most advantageous for the most complex models (-hidden unit), and less so for the simplest (linear).
5.2 Real-data Experiments
|Dataset||Model||Training examples||Testing examples||Features|
|Communities and Crime||Linear|
|Business Entity Resolution||Lattice|