The Price of Fair PCA: One Extra Dimension

10/31/2018 ∙ by Samira Samadi, et al. ∙ Georgia Institute of Technology 0

We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the ML community has witnessed an onslaught of charges that real-world machine learning algorithms have produced “biased” outcomes. The examples come from diverse and impactful domains. Google Photos labeled African Americans as gorillas 

[Twitter, 2015; Simonite, 2018] and returned queries for CEOs with images overwhelmingly male and white [Kay et al., 2015], searches for African American names caused the display of arrest record advertisements with higher frequency than searches for white names [Sweeney, 2013]

, facial recognition has wildly different accuracy for white men than dark-skinned women 

[Buolamwini and Gebru, 2018], and recidivism prediction software has labeled low-risk African Americans as high-risk at higher rates than low-risk white people [Angwin et al., 2018].

The community’s work to explain these observations has roughly fallen into either “biased data” or “biased algorithm” bins. In some cases, the training data might under-represent (or over-represent) some group, or have noisier labels for one population than another, or use an imperfect proxy for the prediction label (e.g., using arrest records in lieu of whether a crime was committed). Separately, issues of imbalance and bias might occur due to an algorithm’s behavior, such as focusing on accuracy across the entire distribution rather than guaranteeing similar false positive rates across populations, or by improperly accounting for confirmation bias and feedback loops in data collection. If an algorithm fails to distribute loans or bail to a deserving population, the algorithm won’t receive additional data showing those people would have paid back the loan, but it will continue to receive more data about the populations it (correctly) believed should receive loans or bail.

Many of the proposed solutions to “biased data” problems amount to re-weighting the training set or adding noise to some of the labels; for “biased algorithms”, most work has focused on maximizing accuracy subject to a constraint forbidding (or penalizing) an unfair model. Both of these concerns and approaches have significant merit, but form an incomplete picture of the ML pipeline and where unfairness might be introduced therein. Our work takes another step in fleshing out this picture by analyzing when dimensionality reduction

might inadvertently introduce bias. We focus on principal component analysis (henceforth PCA), perhaps the most fundamental dimensionality reduction technique in the sciences 

[Pearson, 1901; Hotelling, 1933; Jolliffe, 1986]. We show several real-world data sets for which PCA incurs much higher average reconstruction error for one population than another, even when the populations are of similar sizes. Figure 1 shows that PCA on labeled faces in the wild data set (LFW) has higher reconstruction error for women than men even if male and female faces are sampled with equal weight.

Figure 1: Left: Average reconstruction error of PCA on labeled faces in the wild data set (LFW), separated by gender. Right: The same, but sampling 1000 faces with men and women equiprobably (mean over 20 samples).

This work underlines the importance of considering fairness and bias at every stage of data science, not only in gathering and documenting a data set 

[Gebru et al., 2018]

and in training a model, but also in any interim data processing steps. Many scientific disciplines have adopted PCA as a default preprocessing step, both to avoid the curse of dimensionality and also to do exploratory/explanatory data analysis (projecting the data into a number of dimensions that humans can more easily visualize). The study of human biology, disease, and the development of health interventions all face both aforementioned difficulties, as do numerous economic and financial analysis. In such high-stakes settings, where statistical tools will help in making decisions that affect a diverse set of people, we must take particular care to ensure that we share the benefits of data science with a diverse community.

We also emphasize this work has implications for representational rather than just allocative harms, a distinction drawn by Crawford [2017] between how people are represented and what goods or opportunities they receive. Showing primates in search results for African Americans is repugnant primarily due to its representing and reaffirming a racist painting of African Americans, not because it directly reduces any one person’s access to a resource. If the default template for a data set begins with running PCA, and PCA does a better job representing men than women, or white people over minorities, the new representation of the data set itself may rightly be considered an unacceptable sketch of the world it aims to describe.

Our work proposes a different linear dimensionality reduction which aims to represent two populations  and with similar fidelity—which we formalize in terms of reconstruction error. Given an -dimensional data set and its -dimensional approximation, the reconstruction error of the data with respect to its low-dimensional approximation is the sum of squares of distances between the original data points and their approximated points in the -dimensional subspace. To eliminate the effect of size of a population, we focus on average reconstruction error over a population. One possible objective for our goal would find a -dimensional approximation of the data which minimizes the maximum reconstruction error over the two populations. However, this objective doesn’t avoid grappling with the fact that population may perfectly embed into dimensions, whereas might require many more dimensions to have low reconstruction error. In such cases, this objective would not necessarily favor a solution with average reconstruction error of for and for over one with error for and error for . This holds even if requires reconstruction error to be embedded into  dimensions and thus the first solution is nearly optimal for both populations in dimensions.

This motivates our focus on finding a projection which minimizes the maximum additional or marginal reconstruction error for each population above the optimal into projection for that population alone. This quantity captures how much a population’s reconstruction error increases by including another population in the dimensionality reduction optimization. Despite this computational problem appearing more difficult than solving “vanilla” PCA, we introduce a polynomial-time algorithm which finds an into -dimensional embedding with objective value better than any -dimensional embedding. Furthermore, we show that optimal solutions have equal additional average error for populations and .

Summary of our results

We show PCA can overemphasize the reconstruction error for one population over another (equally sized) population, and we should therefore think carefully about dimensionality reduction in domains where we care about fair treatment of different populations. We propose a new dimensionality reduction problem which focuses on representing and with similar additional error over projecting or individually. We give a polynomial-time algorithm which finds near-optimal solutions to this problem. Our algorithm relies on solving a semidefinite program (SDP), which can be prohibitively slow for practical applications. We note that it is possible to (approximately) solve an SDP with a much faster multiplicative-weights style algorithm, whose running time in practice is equivalent to solving standard PCA at most 10-15 times. The details of the algorithm are given in the full version of this work. We then evaluate the empirical performance of this algorithm on several human-centric data sets.

2 Related work

This work contributes to the area of fairness for machine learning models, algorithms, and data representations. One interpretation of our work is that we suggest using Fair PCA, rather than PCA, when creating a lower-dimensional representation of a data set for further analysis. Both pieces of work which are most relevant to our work take the posture of explicitly trying to reduce the correlation between a sensitive attribute (such as race or gender) and the new representation of the data. The first piece is a broad line of work  [Zemel et al., 2013; Beutel et al., 2017; Calmon et al., 2017; Madras et al., 2018; Zhang et al., 2018] that aims to design representations which will be conditionally independent of the protected attribute, while retaining as much information as possible (and particularly task-relevant information for some fixed classification task). The second piece is the work by Olfat and Aswani [2018], who also look to design PCA-like maps which reduce the projected data’s dependence on a sensitive attribute. Our work has a qualitatively different goal: we aim not to hide a sensitive attribute, but instead to maintain as much information about each population after projecting the data. In other words, we look for representation with similar richness for population as , rather than making and indistinguishable.

Other work has developed techniques to obfuscate a sensitive attribute directly [Pedreshi et al., 2008; Kamiran et al., 2010; Calders and Verwer, 2010; Kamiran and Calders, 2011; Luong et al., 2011; Kamiran et al., 2012; Kamishima et al., 2012; Hajian and Domingo-Ferrer, 2013; Feldman et al., 2015; Zafar et al., 2015; Fish et al., 2016; Adler et al., 2016]. This line of work diverges from ours in two ways. First, these works focus on representations which obfuscate the sensitive attribute rather than a representation with high fidelity regardless of the sensitive attribute. Second, most of these works do not give formal guarantees on how much an objective will degrade after their transformations. Our work directly minimizes the amount by which each group’s marginal reconstruction error increases.

Much of the other work on fairness for learning algorithms focuses on fairness in classification or scoring [Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2016; Chouldechova, 2017], or online learning settings [Joseph et al., 2016; Kannan et al., 2017; Ensign et al., 2017b, a]. These works focus on either statistical parity of the decision rule, or equality of false positives or negatives, or an algorithm with a fair decision rule. All of these notions are driven by a single learning task rather than a generic transformation of a data set, while our work focuses on a ubiquitous, task-agnostic preprocessing step.

3 Notation and vanilla PCA

We are given -dimensional data points represented as rows of matrix . We will refer to the set and matrix representation interchangeably. The data consists of two subpopulations and corresponding to two groups with different value of a binary sensitive attribute (e.g., males and females). We denote by the concatenation of two matrices by row. We refer to the row of as , the column of as and the element of as . We denote the Frobenius norm of matrix by and the

-norm of the vector

by . For , we write . denotes the size of a set . Given two matrices and of the same size, the Frobenius inner product of these matrices is defined as .

3.1 Pca

This section recalls useful facts about PCA that we use in later sections. We begin with a reminder of the definition of the PCA problem in terms of minimizing the reconstruction error of a data set.

Definition 3.1.

(PCA problem) Given a matrix , find a matrix of rank at most that minimizes .

We will refer to as an optimal rank- approximation of . The following well-known fact characterizes the solutions to this classic problem [e.g., Shalev-Shwartz and Ben-David, 2014].

Fact 3.1.

If is a solution to the PCA problem, then for a matrix with . The columns of

are eigenvectors corresponding to top

eigenvalues of .

The matrix is called a projection matrix.

4 Fair PCA

Given the -dimensional data with two subgroups and , let be optimal rank- PCA approximations for and , respectively. We introduce our approach to fair dimensionality reduction by giving two compelling examples of settings where dimensionality reduction inherently makes a tradeoff between groups and . Figure (a)a shows a setting where projecting onto any single dimension either favors or (or incurs significant reconstruction error for both), while either group separately would have a high-fidelity embedding into a single dimension. This example suggests any projection will necessarily make a trade off between error on and error on .

Our second example (shown in Figure (b)b) exhibits a setting where and suffer very different reconstruction error when projected onto one dimension: has high reconstruction error for every projection while has a perfect representation in the horizontal direction. Thus, asking for a projection which minimizes the maximum reconstruction error for groups and might require incurring additional error for while not improving the error for . So, minimizing the maximum reconstruction error over and fails to account for the fact that two populations might have wildly different representation error when embedded into dimensions. Optimal solutions to such objective might behave in a counterintuitive way, preferring to exactly optimize for the group with larger inherent representation error rather than approximately optimizing for both groups simultaneously. We find this behaviour undesirable—it requires sacrifice in quality for one group for no improvement for the other group.

(a) The best one dimensional PCA projection for group is vector and for group it is vector .
(b) Group has a perfect one-dimensional projection. For group , any one-dimensional projection is equally bad.
Figure 2:
Remark 4.1.

We focus on the setting where we ask for a single projection into dimensions rather than two separate projections because using two distinct projections (or more generally two models) for different populations raises legal and ethical concerns. Learning two different projections also faces no inherent tradeoff in representing or with those projections.111Lipton et al. [2017] has asked whether equal treatment requires different models for two groups.

We therefore turn to finding a projection which minimizes the maximum deviation of each group from its optimal projection. This optimization asks that and suffer a similar loss for being projected together into dimensions compared to their individually optimal projections. We now introduce our notation for measuring a group’s loss when being projected to rather than to its optimal -dimensional representation:

Definition 4.2 (Reconstruction error).

Given two matrices and of the same size, the reconstruction error of with respect to is defined as

Definition 4.3 (Reconstruction loss).

Given a matrix , let be the optimal rank- approximation of . For a matrix with rank at most we define

Then, the optimization that we study asks to minimize the maximum loss suffered by any group. This captures the idea that, fixing a feasible solution, the objective will only improve if it improves the loss for the group whose current representation is worse. Furthermore, considering the reconstruction loss and not the reconstruction error prevents the optimization from incurring error for one subpopulation without improving the error for the other one as described in Figure (b)b.

Definition 4.4 (Fair PCA).

Given data points in with subgroups and , we define the problem of finding a fair PCA projection into -dimensions as optimizing

(1)

where and are matrices with rows corresponding to rows of for groups and respectively.

This definition does not appear to have a closed-form solution (unlike vanilla PCA—see Fact 3.1). To take a step in characterizing solutions to this optimization, Theorem 4.5 states that a fair PCA low dimensional approximation of the data results in the same loss for both groups.

Theorem 4.5.

Let be a solution to the Fair PCA problem (1), then

Before proving Theorem 4.5, we need to state some building blocks of the proof, Lemmas 4.6, 4.7, and 4.8. For the proofs of the lemmas please refer to the appendix B.

Lemma 4.6.

Given a matrix such that , let . Let be an orthonormal basis of the row space of and . Then

The next lemma presents some equalities that we will use frequently in the proofs.

Lemma 4.7.

Given a matrix with orthonormal columns, we have:

Let the function measure the reconstruction error of a fixed matrix with respect to its orthogonal projection to the input subspace . The next lemma shows that the value of the function  at any local minimum is the same.

Lemma 4.8.

Given a matrix , and a -dimensional subspace , let the function denote the reconstruction error of matrix with respect to its orthogonal projection to the subspace , that is , where by abuse of notation we use inside the norm to denote the matrix which has an orthonormal basis of the subspace as its columns. The value of the function  at any local minimum is the same.

Proof of Theorem 4.5:

Consider the functions and defined in Lemma 4.8. It follows from Lemma 4.6 and Lemma 4.7 that for with we have

(2)

Therefore, the Fair PCA problem is equivalent to

We proceed to prove the claim by contradiction. Let be a global minimum of and assume that

(3)

Hence, since is continuous, for any matrix with in a small enough neighborhood of , . Since is a global minimum of , it is a local minimum of or equivalently a local minimum of because of (2).

Let be an orthonormal basis of the eigenvectors of corresponding to eigenvalues . Let be the subspace spanned by . Note that . Since the loss is always non-negative for both and , (3) implies that . Therefore, and . By Lemma 4.8, this is in contradiction with being a global minimum and being a local minimum of .

5 Algorithm and analysis

In this section, we present a polynomial-time algorithm for solving the fair PCA problem. Our algorithm outputs a matrix of rank at most and guarantees that it achieves the fair PCA objective value equal to the optimal -dimensional fair PCA value. The algorithm has two steps: first, relax fair PCA to a semidefinite optimization problem and solve the SDP; second, solve an LP designed to reduce the rank of said solution. We argue using properties of extreme point solutions that the solution must satisfy a number of constraints of the LP with equality, and argue directly that this implies the solution must lie in or fewer dimensions. We refer the reader to  Lau et al. [2011] for basics and applications of this technique in approximation algorithms.

Theorem 5.1.

There is a polynomial-time algorithm that outputs an approximation matrix of the data such that it is either of rank and is an optimal solution to the fair PCA problem OR it is of rank , has equal losses for the two populations and achieves the optimal fair PCA objective value for dimension .

Input : ,
Output : 
1 Find optimal rank- approximations of as

(e.g. by Singular Value Decomposition). Let (

be a solution to the SDP:
(4)
s.t.
2 Apply Singular Value Decomposition to , .
3 Find an extreme solution of the LP:
(5)
(6)
(7)
(8)
(9)
Set where .
return
Algorithm 1 Fair PCA

Proof of Theorem 5.1: The algorithm to prove Theorem 5.1 is presented in Algorithm 1. Using Lemma 4.7, we can write the semi-definite relaxation of the fair PCA objective (Def. 4.4) as SDP (4). This semi-definite program can be solved in polynomial time. The system of constraints (5)-(9

) is a linear program in the variables

(with the ’s fixed). Therefore, an extreme point solution is defined by equalities, at most three of which can be constraints in (6)-(8) and the rest (at least of them) must be from the or for . Given the upper bound of on the sum of the ’s, this implies that at least of them are equal to , i.e., at most two are fractional and add up to .

Case 1.

All the eigenvalues are integral. Therefore, there are eigenvalues equal to . This results in orthogonal projection to -dimension.

Case 2.

of eigenvalues are in and two eigenvalues . Since we have tight constraints, this means that both of the first two constraints are tight. Therefore

where the inequality is by observing that is a feasible solution. Note that the loss of group given by an affine projection is

where the last inequality is by the choice of . The same equality holds true for group . Therefore, gives the equal loss of for two groups. The embedding corresponds to the affine projection of any point (row) of defined by the solution .

In both cases, the objective value is at most that of the original fairness objective.

The result of Theorem 5.1 in two groups generalizes to more than two groups as follows. Given data points in with subgroups , and the desired number of dimensions of projected space, we generalize Definition 4.4 of fair PCA problem as optimizing

(10)

where are matrices with rows corresponding to rows of for groups .

Theorem 5.2.

There is a polynomial-time algorithm to find a projection such that it is of dimension at most and achieves the optimal fairness objective value for dimension .

In contrast to the case of two groups, when there are more than two groups in the data, it is possible that all optimal solutions to fair PCA will not assign the same loss to all groups. However, with extra dimensions, we can ensure that the loss of each group remains at most the optimal fairness objective in dimension. The result of Theorem 5.2 follows by extending algorithm in Theorem 5.1 by adding linear constraints to SDP and LP for each extra group. An extreme solution of the resulting LP contains at most of ’s that are strictly in between 0 and 1. Therefore, the final projection matrix has rank at most .

Runtime

We now analyze the runtime of Algorithm 1, which consists of solving SDP (4) and finding an extreme solution to an LP (5)-(9). The SDP and LP can be solved up to additive error of in the objective value in [Ben-Tal and Nemirovski, 2001] and [Schrijver, 1998] time, respectively. The running time of SDP dominates the algorithm both in theory and practice, and is too slow for practical uses for moderate size of .

We propose another algorithm of solving SDP using the multiplicative weight (MW) update method. In theory, our MW takes iterations of solving standard PCA, giving a total of  runtime, which may or may not be faster than depending on . In practice, however, we observe that after appropriately tuning one parameter in MW, the MW algorithm achieves accuracy within tens of iterations, and therefore is used to obtain experimental results in this paper. Our MW can handle data of dimension up to a thousand with running time in less than a minute. The details of implementation and analysis of MW method are in Appendix A.

6 Experiments

We use two common human-centric data sets for our experiments. The first one is labeled faces in the wild (LFW) [Huang et al., 2007], the second is the Default Credit data set [Yeh and Lien, 2009]. We preprocess all data to have its mean at the origin. For the LFW data, we normalized each pixel value by . The gender information for LFW was taken from Afifi and Abdelhamed [2017]

, who manually verified the correctness of these labels. For the credit data, since different attributes are measurements of incomparable units, we normalized the variance of each attribute to be equal to 1. The code of all experiments is publicly available at

https://github.com/samirasamadi/Fair-PCA.

Results

We focus on projections into relatively few dimensions, as those are used ubiquitously in early phases of data exploration. As we already saw in Figure 1

left, at lower dimensions, there is a noticeable gap between PCA’s average reconstruction error for men and women on the LFW data set. This gap is at the scale of up to 10% of the total reconstruction error when we project to 20 dimensions. This still holds when we subsample male and female faces with equal probability from the data set, and so men and women have equal magnitude in the objective function of PCA (Figure 

1 right).

Figure 3 shows the average reconstruction error of each population (Male/Female, Higher/Lower education) as the result of running vanilla PCA and Fair PCA on LFW and Credit data. As we expect, as the number of dimensions increase, the average reconstruction error of every population decreases. For LFW, the original data is in 1764 dimensions (4242 images), therefore, at 20 dimensions we still see a considerable reconstruction error. For the Credit data, we see that at 21 dimensions, the average reconstruction error of both populations reach 0, as this data originally lies in 21 dimensions. In order to see how fair are each of these methods, we need to zoom in further and look at the average loss of populations.

Figure 4 shows the average loss of each population as the result of applying vanilla PCA and Fair PCA on both data sets. Note that at the optimal solution of Fair PCA, the average loss of two populations are the same, therefore we have one line for “Fair loss”. We observe that PCA suffers much higher average loss for female faces than male faces. After running fair PCA, we observe that the average loss for fair PCA is relatively in the middle of the average loss for male and female. So, there is improvement in terms of the female average loss which comes with a cost in terms of male average loss. Similar observation holds for the Credit data set. In this context, it appears there is some cost to optimizing for the less well represented population in terms of the better-represented population.

Figure 3: Reconstruction error of PCA/Fair PCA on LFW and the Default Credit data set.
Figure 4: Loss of PCA/Fair PCA on LFW and the Default Credit data set.

7 Future work

This work is far from a complete study of when and how dimensionality reduction might help or hurt the fair treatment of different populations. Several concrete theoretical questions remain using our framework. What is the complexity of optimizing the fairness objective? Is it NP-hard, even for ? Our work naturally extends to predefined subgroups rather than just , where the number of additional dimensions our algorithm uses is . Are these additional dimensions necessary for computational efficiency?

In a broader sense, this work aims to point out another way in which standard ML techniques might introduce unfair treatment of some subpopulation. Further work in this vein will likely prove very enlightening.

Acknowledgements

This work was supported in part by NSF awards CCF-1563838, CCF-1717349, and CCF-1717947.

References

Appendix A Improved runtime of semi-definite relaxation by multiplicative weight update method

In this section, we show the multiplicative weight (MW) algorithm and runtime analysis to solve the fair PCA relaxation in two groups for matrix up to additive error in iterations of solving a standard PCA, such as Singular Value Decomposition (SVD). Because SVD takes time, the SDP relaxation (4) for two groups can be solved in . Comparing to runtime of an SDP solver that is commonly implemented with the interior point method [Ben-Tal and Nemirovski, 2001], our algorithm may be faster or slower depending on . In practice, however, we tune the parameter of MW algorithm much more aggressively than in theory, and often take the last iterate solution of MW rather the average when the last iterate performs better, which gives a much faster convergence rate. Our runs of MW show that MW converges in at most 10-20 iterations. Therefore, we use MW to implement our fair PCA algorithm. We note at the conclusion of this section that the algorithm and analysis can be extended to solving fair PCA in groups up to additive error in iterations.

Technically, the number of iterations for groups is , where is the width of the problem, as defined in Arora et al. [2012]. can usually be bounded by the maximum number of input or the optimal objective value. For our purpose, if the total variance of input data over all dimension is , then the width is at most . For simplicity, we assume (e.g. by normalization in prepossessing step), hence obtaining the bound on number of iterations.

We first present an algorithmic framework and the corresponding analysis in the next two subsections, and later apply those results to our specific setting of solving the SDP (4) from fair PCA problem. The previous work by Arora et al. [2012] shows how we may solve a feasibility problem of an LP using MW technique. Our main theoretical contribution is to propose and analyze the optimization counterpart of the feasibility problem, and the MW algorithm we need to solve such problem. The MW we develop fits more seamlessly into our fair PCA setting and simplifies the algorithm to be implemented for solving the SDP (4).

a.1 Problem setup and oracle access

We first formulate the feasibility problem and its optimization counterpart in this section. The previous and new MW algorithms and their analysis are presented in the following Section A.2.

a.1.1 Previous work: multiplicative weight on feasibility problem

Problem

As in Arora et al. [2012], we are given as an real matrix, , and as a convex set in , and the goal is to check the feasibility problem

(11)

by giving a feasible or correctly deciding that such does not exist.

Oracle Access

We assume the existence if an oracle that, given any probability vector over constraints of (11), correctly answers a single-constraint problem

(12)

by giving a feasible or correctly deciding that such does not exist. We may think of (12) as a weighted version of (11), with weights on each constraint being .

As (12) consists of only one constraint, solving (12) is much easier than (11) in many problem settings. For example, in our PCA setting, solving (4) directly is non-trivial, but the weighted version (12) is a standard PCA problem: we weight each group based on , and then apply a PCA algorithm (Singular Value Decomposition) on the sum of two weighted groups. The solution gives an optimal value of in (12). More details of application in fair PCA settings are in Section A.3

a.1.2 New setting: multiplicative weight on optimization problem

Problem

The previous work gives an MW framework for the feasibility question. Here we propose an optimization framework, which asks for the best rather than an existence of . The optimization framework can be formally stated as, given as an real matrix, , and as a convex set in , we need to solve

(13)

where denotes the vector with entries 1. Denote the optimum of (13).

With the same type of oracle access, we may run (11) for iterations to do binary search for the correct value of optimum up to an additive error . However, our main contribution is to modify the previous multiplicative weight algorithm and the definition of the oracle to solve (13) without guessing the optimum . This improves the runtime slightly (reduce the factor) and simplifies the algorithm.

Feasibility Oracle Access

We assume the existence of an oracle that, given any probability vector over constraints of (13), correctly answers a single-constraint problem

(14)

There is always such because multiplying (13) on the left by shows that one of such is the optimum of (13). However, finding one may not be as trivial as asserting problem’s feasibility. In general, (14) can be tricky to solve since we do not yet know the value of .

Optimization Oracle Access

We define the oracle that, given over constraints of (13), correctly answers one maximizer of

(15)

which is stronger than and is sufficient to solve (14). This is because of (13) is one feasible to (15), so the optimum of (15) is at most . Therefore, the optimum by (15) can be a feasible solution to (14). In many setting, because (14) is only one-constraint problem, it is possible to solve the optimization version (15) instead. For example, in our fair PCA on two groups setting, we can solve the (15) by standard PCA on the union of two groups after an appropriate weighting on each group. More details of application in fair PCA settings are in Section A.3.

a.2 Algorithm and Analysis

The line of proof follows similarly from Arora et al. [2012]. We first state the technical property that the oracle satisfies in our optimization framework, then show how to use that property to bound the number of iterations. We fix as an real matrix, , and is a convex set in

Definition A.1.

(analogous to Arora et al. [2012]) An -bounded for parameter is an algorithm which, given , solve (14). Also, there is a fixed (i.e. fixed across all possible ) of constraints such that for all output by this algorithm,

(16)
(17)

Note that even though we do not know , if we know the range of for all , we can bound the range of . Therefore, we can still find a useful that an oracle satisfies.

Now we are ready to state the main result of this section: that we may solve the optimization version by multiplicative update as quickly as solving the feasibility version of the problem.

Theorem A.2.

Let be given. Suppose there exists -bounded and to solving (14). Then there exists an algorithm that solves (13) up to additive error , i.e. outputs such that

(18)

The algorithm calls times and has additional time per call.

Proof.

The proof follows similarly as Theorem 3.3 in Arora et al. [2012], but we include details here for completeness. The algorithm is multiplicative update in nature, as in equation (2.1) of Arora et al. [2012]. The algorithm starts with uniform over constraints. Each step the algorithm asks the with input and receive . We use the loss vector to update the weight for the next step with learning rate . After iterations (which will be specified later), the algorithm outputs .

Note that using either the loss and behaves the same algorithmically due to the renormalization step on the vector . Therefore, just for analysis, we use a hypothetical loss to update (this loss can’t be used algorithmically since we do not know ). By Theorem 2.1 in Arora et al. [2012], for each constraint and all ,

(19)

By property (14) of the ,

(20)

We now split into two cases. If , then (19) and (20) imply

Multiplying the last inequality by and rearranging terms, we have

(21)

If , then (19) and (20) imply

Multiplying inequality by and rearranging terms, we have

(22)

To use (21) and (22) to show that is close to 0 simultaneously for two cases, pick (note that by requiring , so we may apply Theorem 2.1 in Arora et al. [2012]). Then for all , we have

(23)

Hence, (21) implies

(24)

and (22) implies

(25)

using the fact that . ∎

a.3 Application of multiplicative update method to the fair PCA problem

In this section, we apply MW results for solving LP to solve the SDP relaxation (4) of fair PCA.

LP formulation of fair PCA relaxation

The SDP relaxation (4) of fair PCA can be written in the form (13) as an LP with two constraints

(26)
(27)
(28)

for some constants , where the feasible region of variables is over a set of PSD matrices:

(29)

We will apply the multiplicative weight algorithm to solve (26)-(28).

Oracle Access

First, we present an the oracle in Algorithm 2, which is in the form (15) and therefore can be used to solve (14). As defined in (15), the optimization oracle, given a weight vector , should be able to solve the LP with one weighted constraint obtained from weighting two constraints (27) and (28) by . However, because both constraints involve only dot products of same variable with constant matrices and , which are linear functions, the weighted constraint will involve the dot product of the same variable with weighted sum of those constant matrices .

Input : ,
Output : 
,
,
1 Set to be the matrix with top principles components of as columns;
2 return  , ;
Algorithm 2 Fair PCA oracle (oracle to Algorithm 3)
MW Algorithm

Our multiplicative weight update algorithm for solving fair PCA relaxation (26)-(28) is presented in Algorithm 3. The algorithm follows exactly from the construction in Theorem A.2. The runtime analysis of our MW Algorithm 3 follows directly from the same theorem.

Input : , , positive integer
Output : 
,
,
1 Initialize ;
2 for  do
3       ;
4       , for ;
5       , for ;
6      
7 end for
return ,
Algorithm 3 Multiplicative weight update for fair PCA
Corollary A.3.

Let . Algorithm 3 finds a near-optimal (up to additive error of ) solution to (26)-(28) in iterations of solving standard PCA, and therefore in running time.

Proof.

We first check that the oracle presented in Algorithm 2 satisfies -boundedness and find those parameters. We may normalize the data so that the variances of and are bounded by 1. Therefore, for any PSD matrix , we have . In addition, in the application to fair PCA setting, we have . Hence, for any feasible by the definition of (recall Definition 3.1). Therefore,