# Online Learning with an Unknown Fairness Metric

We consider the problem of online learning in the linear contextual bandits setting, but in which there are also strong individual fairness constraints governed by an unknown similarity metric. These constraints demand that we select similar actions or individuals with approximately equal probability (arXiv:1104.3913), which may be at odds with optimizing reward, thus modeling settings where profit and social policy are in tension. We assume we learn about an unknown Mahalanobis similarity metric from only weak feedback that identifies fairness violations, but does not quantify their extent. This is intended to represent the interventions of a regulator who "knows unfairness when he sees it" but nevertheless cannot enunciate a quantitative fairness metric over individuals. Our main result is an algorithm in the adversarial context setting that has a number of fairness violations that depends only logarithmically on T, while obtaining an optimal O(√(T)) regret bound to the best fair policy.

• 1 publication
• 13 publications
• 30 publications
• 47 publications
02/13/2020

### Metric-Free Individual Fairness in Online Learning

We study an online learning problem subject to the constraint of individ...
06/23/2021

### A Unified Approach to Fair Online Learning via Blackwell Approachability

We provide a setting and a general approach to fair online learning with...
08/30/2018

### Fair Algorithms for Learning in Allocation Problems

Settings such as lending and policing can be modeled by a centralized ag...
11/13/2020

### Metric-Free Individual Fairness with Cooperative Contextual Bandits

Data mining algorithms are increasingly used in automated decision makin...
03/08/2018

### Probably Approximately Metric-Fair Learning

We study fairness in machine learning. A learning algorithm, given a tra...
03/08/2018

### Fairness Through Computationally-Bounded Awareness

We study the problem of fair classification within the versatile framewo...
01/24/2020

### Case Study: Predictive Fairness to Reduce Misdemeanor Recidivism Through Social Service Interventions

The criminal justice system is currently ill-equipped to improve outcome...

## 1 Introduction

The last several years have seen an explosion of work studying the problem of fairness in machine learning. Yet there remains little agreement about what “fairness” should mean in different contexts. In broad strokes, the literature can be divided into two families of fairness definitions: those aiming at

group fairness, and those aiming at individual fairness.

Group fairness definitions are aggegrate in nature: they partition individuals into some collection of protected groups (say by race or gender), specify some statistic of interest (say, positive classification rate or false positive rate), and then require that a learning algorithm equalize this quantity across the protected groups. On the other hand, individual fairness definitions ask for some constraint that binds on the individual level, rather than only over averages of people. Often, these constraints have the semantics that “similar people should be treated similarly” Dwork et al. (2012).

Individual fairness definitions have substantially stronger semantics and demands than group definitions of fairness. For example, Dwork et al. (2012) lay out a compendium of ways in which group fairness definitions are unsatisfying. Yet despite these weaknesses, group fairness definitions are by far the most prevalent in the literature (see e.g. Kamiran and Calders (2012); Hajian and Domingo-Ferrer (2013); Kleinberg et al. (2017); Hardt et al. (2016); Friedler et al. (2016); Zafar et al. (2017); Chouldechova (2017) and Berk et al. (2017) for a survey). This is in large part because notions of individual fairness require making stronger assumptions on the setting under consideration. In particular, the definition from Dwork et al. (2012) requires that the algorithm designer know a “task-specific fairness metric.”

Learning problems over individuals are also often implicitly accompanied by some notion of merit, embedded in the objective function of the learning problem. For example, in a lending setting we might posit that each loan applicant is either “creditworthy” and will repay a loan, or is not creditworthy and will default — which is what we are trying to predict. Joseph et al. (2016a) take the approach that this measure of merit — already present in the model, although initially unknown to the learner — can be taken to be the similarity metric in the definition of Dwork et al. (2012), requiring informally that creditworthy individuals have at least the same probability of being accepted for loans as defaulting individuals. (The implicit and coarse fairness metric here assigns distance zero between pairs of creditworthy individuals and pairs of defaulting individuals, and some non-zero distance between a creditworthy and a defaulting individual.) This resolves the problem of how one should discover the “fairness metric”, but results in a notion of fairness that is necessarily aligned with the notion of “merit” (creditworthiness) that we are trying to predict.

However, there are many settings in which the notion of merit we wish to predict may be different or even at odds with the notion of fairness we would like to enforce. For example, notions of fairness aimed at rectifying societal inequities that result from historical discrimination can aim to favor the disadvantaged population (say, in college admissions), even if the performance of the admitted members of that population can be expected to be lower than that of the advantaged population. Similarly, we might desire a fairness metric incorporating only those attributes that individuals can change in principle (and thus excluding ones like race, age and gender), and that further expresses what are and are not meaningful differences between individuals, outside the context of any particular prediction problem. These kinds of fairness desidera can still be expressed as an instantiation of the definition from Dwork et al. (2012), but with a task-specific fairness metric separate from the notion of merit we are trying to predict.

In this paper, we revisit the individual fairness definition from Dwork et al. (2012). This definition requires that pairs of individuals who are close in the fairness metric must be treated “similarly” (e.g. in an allocation problem such as lending, served with similar probability). We investigate the extent to which it is possible to satisfy this fairness constraint while simultaneously solving an online learning problem, when the underlying fairness metric is Mahalanobis but not known to the learning algorithm, and may also be in tension with the learning problem. One conceptual problem with metric-based definitions, that we seek to address, is that it may be difficult for anyone to actually precisely express a quantitative metric over individuals — but they nevertheless might “know unfairness when they see it.” We therefore assume that the algorithm has access to an oracle that knows intuitively what it means to be fair, but cannot explicitly enunciate the fairness metric. Instead, given observed actions, the oracle can specify whether they were fair or not, and the goal is to obtain low regret in the online learning problem — measured with respect to the best fair policy — while also limiting violations of individual fairness during the learning process.

### 1.1 Our Results and Techniques

We study the standard linear contextual bandit setting. In rounds , a learner observes arbitrary and possibly adversarially selected -dimensional contexts, each corresponding to one of actions. The reward for each action is (in expectation) an unknown linear function of the contexts. The learner seeks to minimize its regret.

The learner also wishes to satisfy fairness constraints

, defined with respect to an unknown distance function defined over contexts. The constraint requires that the difference between the probabilities that any two actions are taken is bounded by the distance between their contexts. The learner has no initial knowledge of the distance function. Instead, after the learner makes its decisions according to some probability distribution

at round , it receives feedback specifying for which pairs of contexts the fairness constraint was violated. Our goal in designing a learner is to simultaneously guarantee near-optimal regret in the contextual bandit problem (with respect to the best fair policy), while violating the fairness constraints as infrequently as possible. Our main result is a computationally efficient algorithm that guarantees this for a large class of distance functions known as Mahalanobis distances (these can be expressed as for some matrix ).

Theorem (Informal): There is a computationally efficient learning algorithm in our setting that guarantees that for any Mahalanobis distance, any time horizon , and any error tolerance :

1. (Learning) With high probability, obtains regret to the best fair policy (See Theorem 3 for a precise statement.)

2. (Fairness) With probability , violates the unknown fairness constraints by more than on at most many rounds. (Theorem 4.)

We note that the quoted regret bound requires setting , and so this implies a number of fairness violations of magnitude more than that is bounded by a function growing logarithmically in . Other tradeoffs between regret and fairness violations are possible.

These two goals: of obtaining low regret, and violating the unknown constraint a small number of times — are seemingly in tension. A standard technique for obtaining a mistake bound with respect to fairness violations would be to play a “halving algorithm”, which would always act as if the unknown metric is at the center of the current version space (the set of metrics consistent with the feedback observed thus far) — so that mistakes necessarily remove a non-trivial fraction of the version space, making progress. On the other hand, a standard technique for obtaining a diminishing regret bound is to play “optimistically” – i.e. to act as if the unknown metric is the point in the version space that would allow for the largest possible reward. But “optimistic” points are necessarily at the boundary of the version space, and when they are falsified, the corresponding mistakes do not necessarily reduce the version space by a constant fraction.

We prove our theorem in two steps. First, in Section 3

, we consider the simpler problem in which the linear objective of the contextual bandit problem is known, and the distance function is all that is unknown. In this simpler case, we show how to obtain a bound on the number of fairness violations using a linear-programming based reduction to a recent algorithm which has a mistake bound for learning a linear function with a particularly weak form of feedback

Lobel et al. (2017). A complication is that our algorithm does not receive all of the feedback that the algorithm of Lobel et al. (2017) expects. We need to use the structure of our linear program to argue that this is ok. Then, in Section 4, we give our algorithm for the complete problem, using large portions of the machinery we develop in Section 3.

We note that in a non-adversarial setting, in which contexts are drawn from a distribution, the algorithm of Lobel et al. (2017) could be more simply applied along with standard techniques for contextual bandit learning to give an explore-then-exploit style algorithm. This algorithm would obtain bounded (but suboptimal) regret, and a number of fairness violations that grows as a root of . The principal advantages of our approach are that we are able to give a number of fairness violations that has only logarithmic dependence on , while tolerating contexts that are chosen adversarially, all while obtaining an optimal regret bound to the best fair policy.

There are several papers in the algorithmic fairness literature that are thematically related to ours, in that they both aim to bridge the gap between group notions of fairness (which can be semantically unsatisfying) and individual notions of fairness (which require very strong assumptions). Zemel et al. (2013)

attempt to automatically learn a representation for the data in a batch learning problem (and hence, implicitly, a similarity metric) that causes a classifier to label an equal proportion of two protected groups as positive. They provide a heuristic approach and an experimental evaluation. Two recent papers (

Kearns et al. (2017) and Hébert-Johnson et al. (2017)) take the approach of asking for a group notion of fairness, but over exponentially many implicitly defined protected groups, thus mitigating what Kearns et al. (2017) call the “fairness gerrymandering” problem, which is one of the principal weaknesses of group fairness definitions. Both papers give polynomial time reductions which yield efficient algorithms whenever a corresponding agnostic learning problem is solvable. In contrast, in this paper, we take a different approach: we attempt to directly satisfy the original definition of individual fairness from Dwork et al. (2012), but with substantially less information about the underlying similarity metric.

Starting with Joseph et al. (2016a), several papers have studied notions of fairness in classic and contextual bandit problems. Joseph et al. (2016a) study a notion of “meritocratic” fairness in the contextual bandit setting, and prove upper and lower bounds on the regret achievable by algorithms that must be “fair” at every round. This can be viewed as a variant of the Dwork et al. (2012) notion of fairness, in which the expected reward of each action is used to define the “fairness metric”. The algorithm does not originally know this metric, but must discover it through experimentation. Joseph et al. (2016b) extend the work of Joseph et al. (2016a) to the setting in which the algorithm is faced with a continuum of options at each time step, and give improved bounds for the linear contextual bandit case. Jabbari et al. (2017)

extend this line of work to the reinforcement learning setting in which the actions of the algorithm can impact its environment. Finally,

Liu et al. (2017) consider a notion of fairness based on calibration in the simple stochastic bandit setting.

There is a large literature that focuses on learning Mahalanobis distances — see Kulis et al. (2013) for a survey. In this literature, the closest paper to our work focuses on online learning of Mahalanobis distances (Jain et al. (2009)). However, this result is in a very different setting from the one we consider here. In Jain et al. (2009), the algorithm is repeatedly given pairs of points, and needs to predict their distance. It then learns their true distance, and aims to minimize its squared loss. In contrast, in our paper, the main objective of the learning algorithm is orthogonal to the metric learning problem — i.e. to minimize regret in the linear contextual bandit problem, but while simultaneously learning and obeying a fairness constraint, and only from weak feedback noting violations of fairness.

## 2 Model and Preliminaries

### 2.1 Linear Contextual Bandits

We study algorithms that operate in the linear contextual bandits

setting. A linear contextual bandit problem is parameterized by an unknown vector of linear coefficients

, with . Algorithms in this setting operate in rounds . In each round , an algorithm observes contexts , scaled such that . We write to denote the entire set of contexts observed at round . After observing the contexts, the algorithm chooses an action . After choosing an action, the algorithm obtains some stochastic reward such that is subgaussian111 with is sub-gaussian, if for all , . and . The algorithm does not observe the reward for the actions not chosen. When the action is clear from context, and write instead of .

In this paper, we will be discussing algorithms that are necessarily randomized. To formalize this, we denote a history including everything observed by the algorithm up through but not including round as The space of such histories is denoted by . An algorithm is defined by a sequence of functions each mapping histories and observed contexts to probability distributions over actions:

 ft:Ht×Rd×k→Δ[k].

We write to denote the probability distribution over actions that plays at round : . We view as a vector over , and so denotes the probability that plays action at round . We denote the expected reward of the algorithm at day as . It will sometimes also be useful to refer to the vector of expected rewards across all actions on day . We denote it as

 ¯rt=(⟨xt1,θ⟩,…,⟨xtk,θ⟩).

Note that this vector is of course unknown to the algorithm.

### 2.2 Fairness Constraints and Feedback

We study algorithms that are constrained to behave fairly in some manner. We adapt the definition of fairness from Dwork et al. (2012) that asserts, informally, that “similar individuals should be treated similarly”. We imagine that the decisions that our contextual bandit algorithm makes correspond to individuals, and that the contexts correspond to features pertaining to individuals. We adopt the following (specialization of) the fairness definition from Dwork et al, which is parameterized by a distance function .

###### Definition 1 (Dwork et al. (2012)).

Algorithm is Lipschitz-fair on round with respect to distance function if for all pairs of individuals :

 |πti−πtj|≤d(xti,xtj).

For brevity, we will often just say that the algorithm is fair at round , with the understanding that we are always talking about this one particular kind of fairness.

One of the main difficulties in working with Lipschitz fairness (as discussed in Dwork et al. (2012)) is that the distance function plays a central role, but it is not clear how it should be specified. In this paper, we concern ourselves with learning from feedback. In particular, algorithms will have access to a fairness oracle.

Informally, the fairness oracle will take as input: 1) the set of choices available to at each round , and 2) the probability distribution that uses to make its choices at round , and returns the set of all pairs of individuals for which violates the fairness constraint.

###### Definition 2 (Fairness Oracle).

Given a distance function , a fairness oracle is a function defined such that:

 Od(xt,πt)={(i,j):|πti−πtj|>d(xti,xtj)}

Formally, algorithms in our setting will operate in the following environment:

###### Definition 3.
1. An adversary fixes a linear reward function with and a distance function . is given access to the fairness oracle .

2. In rounds to :

1. The adversary chooses contexts with and gives them to .

2. chooses a probability distribution over actions, and chooses action .

3. receives reward and observes feedback from the fairness oracle.

Because of the power of the adversary in this setting, we cannot expect algorithms that can avoid arbitrarily small violations of the fairness constraint. Instead, we will aim to limit significant violations.

###### Definition 4.

Algorithm is -unfair on pair at round with respect to distance function if

 |πti−πtj|>d(xti,xtj)+ϵ.

Given a sequence of contexts and a history (which fixes the distribution on actions at day ) We write

 Unfair(L,ϵ,ht)=k−1∑i=1k∑j=i+11(|πti−πtj|>d(xti,xtj)+ϵ)

to denote the number of pairs on which is -unfair at round .

Given a distance function and a history , the -fairness loss of an algorithm is the total number of pairs on which it is -unfair:

 FairnessLoss(L,hT+1,ϵ)=T∑t=1Unfair(L,ϵ,ht)

For a shorthand version, we’ll write .

We will aim to design algorithms that guarantee that their fairness loss is bounded with probability in the worst case over the instance: i.e. in the worst case over both and , and in the worst case over the distance function (within some allowable class of distance functions – see Section 2.4).

### 2.3 Regret to the Best Fair Policy

In addition to minimizing fairness loss, we wish to design algorithms that exhibit diminishing regret to the best fair policy. We first define a linear program that we will make use of throughout the paper. Given a vector and a vector , we denote by the following linear program:

 maximizeπ={p1,…,pk} k∑i=1piai subject to |pi−pj|≤ci,j,∀(i,j) k∑i=1pi≤1

We write to denote an optimal solution to . Given a set of contexts , recall that is the vector representing the expected reward corresponding to each context (according to the true, unknown linear reward function ). Similarly, we write to denote the vector representing the set of distances between each pair of contexts (according to the true, unknown distance function ): .

Observe that corresponds to the distribution over actions that maximizes expected reward at round , subject to satisfying the fairness constraints — i.e. the distribution that an optimal player, with advance knowledge of would play, if he were not allowed to violate the fairness constraints at all. This is the benchmark with respect to which we define regret:

###### Definition 5.

Given an algorithm (), a distance function , a linear parameter vector , and a history (which includes a set of contexts ), its regret is defined to be:

 Regret(L,θ,d,hT+1)=T∑t=1Ei∼π(¯rt,¯dt)[¯rti]−T∑t=1Ei∼ft(ht,xt)[¯rti]

For shorthand version, we’ll write .

Our goal will be to design algorithms for which we can bound regret with high probability over the randomness of 222We assume that is generated by algorithm , meaning randomness only comes from the stochastic reward and the way in which each arm is selected according to the probability distribution calculated by the algorithm. We don’t assume any distributional assumption over in the worst case over , , and ().

### 2.4 Mahalanobis Distance

In this paper, we will restrict our attention to a special family of distance functions which are parameterized by a matrix :

###### Definition 6 (Mahalanobis distances).

A function is a Mahalanobis distance function if there exists a matrix such that for all :

 d(x1,x2)=||Ax1−Ax2||2

where denotes Euclidean distance. Note that if is not full rank, then this does not define a metric — but we will allow this case (and be able to handle it in our algorithmic results).

Mahalanobis distances will be convenient for us to work with, because squared Mahalanobis distances can be expressed as follows:

 d(x1,x2)2 = ||Ax1−Ax2||22 = ⟨A(x1−x2),A(x1−x2)⟩ = (x1−x2)⊤A⊤A(x1−x2) = d∑i,j=1Gi,j(x1−x2)i(x1−x2)j

where . Observe that when and are fixed, this is a linear function in the entries of the matrix . We will use this property to reason about learning , and thereby learning .

## 3 Warmup: The Known Objective Case

In this section, we consider an easier case of the problem in which the linear objective function is known to the algorithm, and the distance function is all that is unknown. In this case, we show via a reduction to an online learning algorithm of Lobel et al. (2017), how to simultaneously obtain a logarithmic regret bound and a logarithmic (in ) number of fairness violations. The analysis we do here will be useful when we solve the full version of our problem (in which is unknown) in Section 4.

### 3.1 Outline of the Solution

Recall that since we know , at every round after seeing the contexts, we know the vector of expected rewards that we would obtain for selecting each action. Our algorithm will play at each round the distribution that results from solving the linear program , where is a “guess” for the pairwise distances between each context . (Recall that the optimal distribution to play at each round is .)

The main engine of our reduction is an efficient online learning algorithm for linear functions recently given by Lobel et al. (2017) which is further described in Section 3.2. Their algorithm, which we refer to as , works in the following setting. There is an unknown vector of linear parameters . In rounds , the algorithm observes a vector of features , and produces a prediction for the value . After it makes its prediction, the algorithm learns whether its guess was too large or not, but does not learn anything else about the value of . The guarantee of the algorithm is that the number of rounds in which its prediction is off by more than is bounded by 333If the algorithm also learned whether or not its guess was in error by more than at each round, variants of the classical halving algorithm could obtain this guarantee. But the algorithm does not receive this feedback, which is why the more sophisticated algorithm of Lobel et al. (2017) is needed..

Our strategy will be to instantiate

copies of this distance estimator — one for each pair of actions — to produce guesses

intended to approximate the squared pairwise distances . From this we derive estimates of the pairwise distances . Note that this is a linear estimation problem for any Mahalanobis distance, because by our observation in Section 2.4, a squared Mahalanobis distance can be written as a linear function of the unknown entries of the matrix which defines the Mahalanobis distance.

The complication is that the algorithms expect feedback at every round, which we cannot always provide. This is because the fairness oracle provides feedback about the distribution used by the algorithm, not directly about the guesses . These are not the same, because not all of the constraints in the linear program are necessarily tight — it may be that . For any copy of that does not receive feedback, we can simply “roll back” its state and continue to the next round. But we need to argue that we make progress — that whenever we are -unfair, or whenever we experience large per-round regret, then there is at least one copy of that we can give feedback to such that the corresponding copy of has made a large prediction error, and we can thus charge either our fairness loss or our regret to the mistake bound of that copy of .

As we show, there are three relevant cases.

1. In any round in which we are -unfair for some pair of contexts and , then it must be that , and so we can always update the th copy of and charge our fairness loss to its mistake bound. We formalize this in Lemma 1.

2. For any pair of contexts such that we have not violated the fairness constraint, and the th constraint in the linear program is tight, we can provide feedback to the th copy of (its guess was not too large). There are two cases. Although the algorithm never knows which case it is in, we handle each case separately in the analysis.

1. For every constraint in that is tight in the optimal solution, . In this case, we show that our algorithm does not incur very much per round regret. We formalize this in Lemma 4.

2. Otherwise, there is a tight constraint such that . In this case, we may incur high per-round regret — but we can charge such rounds to the mistake bound of the th copy of using Lemma 1.

### 3.2 The Distance Estimator

First, we fix some notation for the algorithm. We write to instantiate a copy of with a mistake bound for -misestimations. The mistake bound we state for is predicated on the assumption that the norm of the unknown linear parameter vector is bounded by , and the norms of the arriving vectors are bounded by . Given an instantiation of and a new vector for which we would like a prediction, we write: for its guess of the value of . We use the following notation to refer to the feedback we provide to : If and we provide feedback, we write . Otherwise, if and we give feedback, we write . In some rounds, we may be unable to provide the feedback that is expecting: in these rounds, we simply “roll-back” its internal state. We can do this because the mistake bound for holds for every sequence of arriving vectors . If we give feedback to in a given round , we write write otherwise.

###### Definition 7.

Given an accuracy parameter , a linear parameter vector , a sequence of vectors , a sequence of guesses and a sequence of feedback indicators, , the number of valid -mistakes made by is:

 Mistakes(ϵ)=T∑t=11(vt=1∧|gt−⟨ut,α⟩|>ϵ)

In words, it is the number of -mistakes made by in rounds for which we provided the algorithm feedback.

We now state a version of the main theorem from Lobel et al. (2017), adapted to our setting444In Lobel et al. (2017), the algorithm receives feedback in every round, and the scale parameters and are normalized to be . But the version we state is an immediate consequence.:

###### Lemma 1 (Lobel et al. (2017)).

For any and any sequence of vectors , makes a bounded number of valid -mistakes.

 Mistakes(ϵ)=O(mlog(m⋅B1⋅B2ϵ))

### 3.3 The Algorithm

algocf[htbp]

For each pair of arms , our algorithm instantiates a copy of , which we denote by : we also subscript all variables relevant to with (e.g. ). The underlying linear parameter vector we want to learn , where maps a matrix to a vector of size by concatenating its rows into a vector. Similarly, given a pair of contexts , we will define . will output guess for the value , as

 ⟨flatten(G),flatten((xti−xtj)(xti−xtj)⊤)⟩=d∑a,b=1Ga,b(xti−xtj)a(xti−xtj)b=(¯dti,j)2

We take as our estimate for the distance between and .

The algorithm then chooses an arm to pull according to the distribution , where . The fairness oracle returns all pairs of arms that violate the fairness constraints. For these pairs we provide feedback to : the guess was too large. For the remaining pairs of arms , there are two cases. If the th constraint in was not tight, then we provide no feedback (. Otherwise, we provide feedback: the guess was not too large. The pseudocode appears as Algorithm LABEL:alg:knownobj.

First we derive the valid mistake bound that the algorithms incur in our parameterization.

###### Lemma 2.

For pair , the total number of valid mistakes made by is bounded as:

 Mistakes(ϵ2)=O(d2log(d⋅||A⊤A||Fϵ))

where the distance function is defined as and denotes the Frobenius norm.

###### Proof.

This follows directly from Lemma 1, and the observations that in our setting, , , and

 B2≤maxt||uti,j||2≤maxt||(xti−xtj)||2≤4.

We next observe that since we only instantiate copies of in total, Lemma 2 immediately implies the following bound on the total number of rounds in which any distance estimator that receives feedback provides us with a distance estimate that differs by more than from the correct value:

###### Corollary 1.

The number of rounds where there exists a pair such that feedback is provided () and its estimate is off by more than is bounded:

 ∣∣{t:∃(i,j):vtij=1∧|^dti,j−¯dti,j|>ϵ}∣∣≤O(k2d2log(d⋅||A⊤A||Fϵ))
###### Proof.

This follows from summing the valid mistake bounds for each copy of , and noting that an mistake in predicting the value of implies an mistake in predicting the value of . ∎

We now have the pieces to bound the -unfairness loss of our algorithm:

###### Theorem 1.

For any sequence of contexts and any Mahalanobis distance :

 FairnessLoss(Lknown−θ,T,ϵ)≤O(k2d2log(d⋅||ATA||Fϵ))
###### Proof.
 FairnessLoss(Lknown−θ,T,ϵ) =T∑t=1Unfair(Lknown−θ,ϵ) ≤T∑t=1∑i,j1(|πti−πtj|>¯dtij+ϵ) =∑i,jT∑t=11({vtij=1∧^dtij>dtij+ϵ}) ≤∑i,jT∑t=11({vtij=1∧|^dtij−dtij|>ϵ}) =O(k2d2log(d⋅||A⊤A||Fϵ)) Corollary 1

We now turn our attention to bounding the regret of the algorithm. Recall from the overview in Section 3.1, that our plan will be to divide rounds into two types. In rounds of the first type, our distance estimates corresponding to every tight constraint in the linear program have only small error. We cannot bound the number of such rounds, but we can bound the regret incurred in any such rounds. In rounds of the second type, we have at least one significant error in the distance estimate corresponding to a tight constraint. We might incur significant regret in such rounds, but we can bound the number of such rounds.

The following lemma bounds the decrease in expected per-round reward that results from under-estimating a single distance constraint in our linear progreamming formulation.

###### Lemma 3.

Fix any vector of distance estimates and any vector of rewards . Fix a constant and any pair of coordinates . Let be the vector such that and for , then

###### Proof.

The plan of the proof is to start with and perform surgery on it to arrive at a new probability distribution that satisfies the constraints of , and obtains objective value at least . Because is feasible, it lower bounds the objective value of the optimal solution , which yields the theorem.

To reduce notational clutter, for the rest of the argument we write to denote . Without loss of generality, we assume that . If , then is still a feasible solution to , and we are done. Thus, for the rest of the argument, we can assume that . We write

We now define our modified distribution :

 p′i=⎧⎨⎩pi−Δpa≤pipa−Δpa−Δ≤pi

We’ll partition the coordinates of into which of the three cases they fall into in our definition of above. , , and . It remains to verify that is a feasible solution to , and that it obtains the claimed objective value.

#### Feasibility:

First, observe that . This follows because is coordinate-wise smaller than , and by assumption, was feasible. Thus, .

Next, observe that by construction, for all . To see this, first observe that where the last inequality follows because . We then consider the three cases:

1. For , because .

2. For , .

3. For , .

Finally, we verify that for all , . First, observe that , and so the inequality is satisfied for index pair . For all the other pairs , we have , so it is enough to show that . Note that for all with , if and , we have that . Therefore, it is sufficient to verify the following six cases:

1. :

2. :

3. :

4. :

5. :

6. :

Thus, we have shown that is a feasible solution to .

#### Objective Value:

Note that for each index , . Therefore we have:

 ⟨r,π(r,d)⟩−⟨r,π(r,d′)⟩ ≤⟨r,π(r,d)⟩−⟨r,p′⟩ =⟨r,p−p′⟩ ≤ϵk∑i=1ri

which completes the proof. ∎

We now prove the main technical lemma of this section. It states that in any round in which the error of our distance estimates for tight constraints is small (even if we have high error in the distance estimates for slack constraints), then we will have low per-round regret.

###### Lemma 4.

At round , if for all pairs of indices , we have either:

1. or

2. (corresponding to an LP constraint that is not tight)

then:

 ⟨rt,π(rt,¯dt)⟩−⟨rt,π(rt,^dt)⟩≤ϵk3

for any vector with .

###### Proof.

First, define to be the coordinate-wise maximum of and : i.e. the vector such that for every pair of coordinates , . To simplify notation, we will write , , and .

We make three relevant observations:

1. First, because is a relaxation of , it has only larger objective value. In other words, we have that . Thus, it suffices to prove that .

2. Second, for all pairs , . Thus, if we had , we also have .

3. Finally, by construction, for every pair