1 Introduction
Machine learning (ML) has recently seen an explosion of applications in settings to guide or make choices directly affecting people. Examples include applications in lending, marketing, education, and many more. Close on the heels of the adoption of ML methods in these everyday domains have been any number of examples of ML methods displaying unsavory behavior towards certain demographic groups. These have spurred the study of fairness
of machine learning algorithms. Numerous mathematical formulations of fairness have been proposed for supervised learning settings, each with their strengths and shortcomings in terms of what they disallow and how difficult they may be to satisfy
(e.g., Dwork et al., 2012; Hardt et al., 2016; Chouldechova, 2017; Kleinberg et al., 2017). Somewhat more recently, the community has begun to study appropriate notions of fairness for unsupervised learning settings (e.g.,
Chierichetti et al., 2017; Celis et al., 2018b; Samadi et al., 2018; see Section 5).In particular, the recent work of Chierichetti et al. (2017) proposes a notion of fairness for clustering: namely, that each cluster has proportional representation from different demographic groups. Their paper provides approximation algorithms for center and median clustering that incorporate this fairness notion. The followup work of Schmidt et al. (2018) extends this to means clustering. These papers open up an important line of work that aims at studying the following questions for clustering: a) How to incorporate fairness constraints into popular clustering objectives and algorithms? and b) What is the price of fairness? For example, the results of Chierichetti et al. indicate that achieving fair clusterings comes at a significant increase in the center/median objective value. While the above works focus on clustering data sets in Euclidean/metric spaces, a large body of clustering problems involve graph data. On such data, spectral clustering (SC; von Luxburg, 2007) is the method of choice in practice. In this paper, we extend the above line of work by studying the implications of incorporating the fairness notion of Chierichetti et al. into SC.
The contributions of this paper are as follows:

[wide, labelwidth=!, labelindent=0pt]

We show how to incorporate the constraints that in each cluster, every group should be represented with the same proportion as in the original data set into the SC framework. For continuity with prior work (as discussed above; also see Section 5), we refer to these constraints as fairness constraints and speak of fair clusterings. However, the terms proportionality and proportional would be a more formal description of our goal. Our approach to incorporate the fairness constraints is analogous to existing versions of constrained SC that try to incorporate mustlink constraints (see Section 5). In contrast to the work of Chierichetti et al. (2017), which always yields a fair clustering no matter how much the objective value increases compared to an unfair clustering, our approach does not guarantee that we end up with a fair clustering. Rather, our approach guides SC to find a good and fair clustering if such a one exists.

Indeed, we prove that our algorithms find a good and fair clustering in a natural variant of the famous stochastic block model that we propose. In our variant, demographic groups have strong intergroup connectivity, but also exhibit a “natural” clustering structure that is fair. We provide a rigorous analysis of our algorithms showing that they can recover this fair clustering with high probability. To the best of our knowledge, such an analysis has not been done before for constrained versions of SC.

We conclude by giving experimental results on realworld data sets where proportional clustering can be a desirable goal, comparing the proportionality and objective value of standard SC to our methods. Our experiments confirm that our algorithms tend to find fairer clusterings compared to standard SC. A surprising finding is that in many real data sets achieving higher proportionality often comes at minimal cost, namely, that our methods produce clusterings that are fairer, but have objective values very close to those of clusterings produced by standard SC. This complements the results of Chierichetti et al. (2017), where achieving fairness constraints exactly
comes at a significant cost in the objective value, and indicates that in some scenarios fairness and objective value need not be at odds with one other.
Notation For , we use . denotes the
identity matrix and
is thedenotes a vector of length
with all entries equaling 1. For a matrix , we denote the transpose of by . For , denotes the trace of , that is . If we say that a matrix is positive (semi)definite, this implies the matrix is symmetric.2 Spectral Clustering
To set the ground and introduce terminology, we review Spectral Clustering (SC). There are several versions of SC (von Luxburg, 2007). For ease of presentation, here we focus on unnormalized SC (Hagen & Kahng, 1992). In Appendix A, we adapt all findings of this section and the following Section 3 to normalized SC (Shi & Malik, 2000).
Let be an undirected graph on . We assume that each edge between two vertices and carries a positive weight encoding the strength of similarity between the verices. If there is no edge between and , we set . We assume that , for all . Given , unnormalized SC aims to partition into clusters with minimum value of the RatioCut objective function as follows (see von Luxburg, 2007, for details): for a clustering we have
(1) 
where denotes the weight of the cut , that is
Let be the weighted adjacency matrix of and be the degree matrix, that is a diagonal matrix with the vertex degrees , , on the diagonal. Let denote the unnormalized graph Laplacian matrix. Note that is positive semidefinite. A key insight is that if we encode a clustering by a matrix with
(2) 
then . Hence, in order to minimize the RatioCut function over all possible clusterings, we could instead solve
(3) 
Spectral clustering relaxes this minimization problem by replacing the requirement that has to be of form (2) with the weaker requirement that , that is it solves
(4) 
Since is symmetric, it is well known that a solution to (4) is given by the matrix
that contains some orthonormal eigenvectors corresponding to the
smallest eigenvalues (respecting multiplicities) of
as columns (Lütkepohl, 1996, Section 5.2.2). Consequently, the first step of SC is to compute such an optimal by computing the smallest eigenvalues and corresponding eigenvectors. The second step is to infer a clustering from . While there is a onetoone correspondence between a clustering and a matrix of the form (2), this is not the case for a solution to the relaxed problem (4). Usually, a clustering of is inferred from by applying means clustering to the rows of . We summarize unnormalized SC as Algorithm 1. Note that, in general, there is no guarantee on how close the RatioCut value of the clustering obtained by Algorithm 1 to the RatioCut value of an optimal clustering (solving (3)) is.3 Adding Fairness Constraints
We now extend the above setting to incorporate fairness constraints. Suppose that the data set contains groups such that . Chierichetti et al. (2017) proposed a notion of fairness for clustering asking that every cluster contains approximately the same number of elements from each group . For a clustering , define the balance of cluster as
(5) 
The higher the balance of each cluster, the fairer is the clustering according to the notion of Chierichetti et al. (2017). For any clustering, we have , so that this fairness notion is actually asking for a clustering in which in every cluster, each group is (approximately) represented with the same fraction as in the whole data set . The following lemma shows how to incorporate this goal into the RatioCut minimization problem (3) using a linear constraint on .
Lemma 1 (Fairness constraints as linear constraint on ).
For , let be the groupmembership vector of , that is if and otherwise. Let be a clustering that is encoded as in (2). We have, for every ,
Proof.
This simply follows from
and . ∎
Consequently, if we want to find a clustering that minimizes the RatioCut objective function and is as fair as possible, we have to solve
(6) 
where is the matrix that has the vectors , , as columns. In the same way as we have relaxed (3) to (4), we may relax the minimization problem (6) to
(7) 
Our proposed approach to incorporate the fairness notion by Chierichetti et al. (2017) into the SC framework consists of solving (7) instead of (4) (and, as before, applying means clustering to the rows of an optimal in order to infer a clustering). Our approach is analogous to the numerous versions of constrained SC that try to incorporate mustlink constraints (“vertices A and B should end up in the same cluster”) by putting a linear constraint on (e.g., Yu & Shi, 2004; Kawale & Boley, 2013; see Section 5).
Next, we describe a straightforward way to solve (7), which is also discussed by Yu & Shi (2004). It is easy to see that . We need to assume that since otherwise (7) does not have any solution. Let be a matrix whose columns form an orthonormal basis of the nullspace of . We can substitute for and then, using that , problem (7) becomes
(8) 
Similarly to problem (4), a solution to (8) is given by the matrix that contains some orthonormal eigenvectors corresponding to the smallest eigenvalues (respecting multiplicities) of as columns. We then set .
This way of solving (7) gives rise to our “fair” version of unnormalized SC as stated in Algorithm 2. Note that just as there is no guarantee on the RatioCut value of the output of Algorithm 1 or Algorithm 2 compared to an optimal clustering, in general, there is also no guarantee on how fair the output of Algorithm 2 is. We will still refer to Algorithm 2 as our fair version of SC. Similarly to how we proceeded here, in Appendix A, we incorporate the fairness constraints into normalized SC and state our fair version of normalized SC as Algorithm 3.
One might wonder why we do not simply run standard SC on each group separately in order to derive a fair version. In Appendix D we show why such an idea does not work.
Computational complexity We provide a complete discussion in Appendix B. With the implementation as stated, the complexity of both Algorithm 2 and Algorithm 3 is regarding time and regarding space, which is the same as the worstcase complexity of standard SC when the number of clusters can be arbitrary. One could apply one of the techniques suggested in the existing literature on constrained spectral clustering to speed up computation (e.g., Yu & Shi, 2004, or Xu et al., 2009; see Section 5), but most of these techniques only work for clusters.
4 Variant of the Stochastic Block Model
In this section, our goal is to model data sets that have two or more meaningful groundtruth clusterings, of which only one is fair, and show that our algorithms recover the fair groundtruth clustering. If there was only one meaningful groundtruth clustering and this clustering was fair, then any clustering algorithm that is able to recover the groundtruth clustering (e.g., standard SC) would be a fair algorithm. To this end, we define a variant of the famous stochastic block model (SBM; Holland et al., 1983). The SBM is a random graph model that has been widely used to study the performance of clustering algorithms, including standard SC (see Section 5 for related work). In the traditional SBM there is a groundtruth clustering of the vertex set into clusters, and in a random graph generated from the model, two vertices and are connected with a probability that only depends on which clusters and belong to.
In our variant of the SBM we assume that comprises groups and is partitioned into groundtruth clusters such that , , for some with . Hence, in every cluster each group is represented with the same fraction as in the whole data set and this groundtruth clustering is fair. Now we define a random graph on by connecting two vertices and with a certain probability that only depends on whether and are in the same cluster (or not) and on whether and are in the same group (or not). More specifically, we have
(9) 
and assume that . As in the ordinary SBM, connecting and is independent of connecting and for . Every edge is assigned a weight of , that is no two connected vertices are considered more similar to each other than any two other connected vertices.
An example of a graph generated from our model (with and ) can be seen in Figure 1. We can see that there are two meaningful groundtruth clusterings into two clusters: and . Among these two clusterings, only is fair since while for . Note that the clustering has a smaller RatioCut value than because there are more edges between and ( or ) than between and ( or ). As we will see in the experiments in Section 6 (and can also be seen from the proof of the following Theorem 1), for such a graph, standard SC is very likely to return the unfair clustering as output. In contrast, our fair versions of SC return the fair clustering with high probability:
Theorem 1 (SC with fairness constraints succeeds on variant of stochastic block model).
Let comprise groups and be partitioned into groundtruth clusters such that for all and
(10) 
Let be a random graph constructed according to our variant of the stochastic block model (9) with probabilities , , , satisfying and for some .
Assume that we run Algorithm 2 or Algorithm 3 (stated in Appendix A) on , where we apply a approximation algorithm to the means problem encountered in the last step of Algorithm 2 or Algorithm 3, for some . Then, for every , there exist constants and , , such that the following is true:

[leftmargin=*]

Unnormalized SC with fairness constraints
If
(11) then with probability at least , the clustering returned by Algorithm 2 misclassifies at most
(12) many vertices.

Normalized SC with fairness constraints
Let . If
(13) then with probability at least , the clustering returned by Algorithm 3 misclassifies at most
(14) many vertices.
We make several remarks on Theorem 1:

[wide, labelwidth=!, labelindent=0pt]

By “misclassifies at most many vertices” we mean that, considering the index of the cluster that a vertex belongs to as the vertex’s class label, there exists a permutation of cluster indices such that up to this permutation the clustering returned by our algorithm predicts the correct class label for all but many vertices.

The condition (11) is satisfied, for sufficiently large and assuming that (see the next remark), in various regimes: assuming that for some , it is satisfied in the dense regime , but also in the sparse regime for some .
The same is true for condition (13), but here we require and . We suspect that condition (13), with respect to , is stronger than necessary. We also suspect that the error bound in (14) is not tight with respect to . Note that in (14), both in the dense and in the sparse regime, the term is dominating over the term by the factor .
Both in the dense and in the sparse regime, under these assumptions on , and , the error bounds (12) and (14) divided by , that is the fraction of misclassified vertices, tends to zero as goes to infinity. Using the terminology prevalent in the literature on community detection in SBMs (see Section 5), we may say that our algorithms are weakly consistent or solve the almost exact recovery problem.

There are efficient approximation algorithms for the means problem in . An algorithm by Ahmadian et al. (2017) achieves a constant approximation factor and has running time polynomial in , and , where is the number of data points. There is also the famous approximation algorithm by Kumar et al. (2004) with running time linear in and , but exponential in and . The algorithm most widely used in practice (e.g., as default method in Matlab) is means++, which is a randomized approximation algorithm (Arthur & Vassilvitskii, 2007).

We show empirically in Section 6 that our algorithms are also able to find the fair groundtruth clustering in a graph constructed according to our variant of the SBM when (10) is not satisfied, that is when the clusters are of different size or the balance of the fair groundtruth clustering is smaller than (i.e. for some ). For Algorithm 3 the violation of (10) can be more severe than for Algorithm 2. In general, we observe Algorithm 3 to outperform Algorithm 2. This is in accordance with standard SC, for which normalized SC has been observed to outperform unnormalized SC (von Luxburg, 2007; Sarkar & Bickel, 2015).
The proof of Theorem 1 can be found in Appendix C. It consists of two technical challenges (described here only for the unnormalized case). The first one is to compute the eigenvalues and eigenvectors of the matrix , where is the expected Laplacian matrix of the random graph and is the matrix computed in Algorithm 2. Let be a matrix containing some orthonormal eigenvectors corresponding to the smallest eigenvalues of as columns and be a matrix containing orthonormal eigenvectors corresponding to the smallest eigenvalues of , where is the observed Laplacian matrix of . The second challenge is to prove that with high probability, is close to . For doing so we make use of the famous DavisKahan sin Theorem (Davis & Kahan, 1970). After that, we can use existing results about means clustering of perturbed eigenvectors (Lei & Rinaldo, 2015) to derive the theorem.
5 Related Work
Spectral clustering and stochastic block model SC is one of the most prominent clustering techniques, with a long history and an abundance of related papers. See von Luxburg (2007) or Nascimento & de Carvalho (2011) for general introductions and an overview of the literature. There are numerous papers on constrained SC, where the goal is to incorporate prior knowledge about the target clustering (usually in the form of mustlink and/or cannotlink constraints) into the SC framework (e.g., Yu & Shi, 2001, 2004; Joachims, 2003; Lu & CarreiraPerpinan, 2008; Xu et al., 2009; Wang & Davidson, 2010; Eriksson et al., 2011; Maji et al., 2011; Kawale & Boley, 2013; Khoreva et al., 2014; Wang et al., 2014; Cucuringu et al., 2016). Most of these papers are motivated by the use of SC in image or video segmentation. Closely related to our work are the papers by Yu & Shi (2004); Xu et al. (2009); Eriksson et al. (2011); Kawale & Boley (2013), which incorporate the prior knowledge by imposing a linear constraint in the RatioCut or NCut optimization problem analogous to how we derived our fair versions of SC. These papers provide efficient algorithms to solve the resulting optimization problems. However, the iterative algorithms by Xu et al. (2009); Eriksson et al. (2011); Kawale & Boley (2013) only work for clusters. The method by Yu & Shi (2004) works for arbitrary and could be used to speed up the computation of a solution of (7) or (18) compared to our straightforward way as implemented by Algorithm 2 and Algorithm 3, respectively, but requires to modify the eigensolver in use.
The stochastic block model (SBM; Holland et al., 1983) is the canonical model to study the performance of clustering algorithms. There exist several variants of the original model such as the degreecorrected SBM or the labeled SBM. For a recent survey see Abbe (2018). In the labeled SBM, vertices can carry a label that is correlated with the groundtruth clustering. This is quite the opposite of our model, in which the groupmembership information is “orthogonal” to the groundtruth clustering. Several papers show the consistency (i.e., the capability to recover the groundtruth clustering) of different versions of SC on the SBM or the degreecorrected SBM under different assumptions (Rohe et al., 2011; Fishkind et al., 2013; Qin & Rohe, 2013; Lei & Rinaldo, 2015; Joseph & Yu, 2016; Su et al., 2017). For example, Rohe et al. (2011) show consistency of normalized SC assuming that the minimum expected vertex degree is in , while Lei & Rinaldo (2015) show that SC based on the adjacency matrix is consistent requiring only that the maximum expected degree is in . Note that these papers also make assumptions on the eigenvalues of the expected Laplacian or adjacency matrix while all assumptions and guarantees stated in our Theorem 1 directly depend on the connection probabilities of our model. We are not aware of any work providing consistency results for constrained SC methods as we do in this paper.
Fairness By now, there is a huge body of work on fairness in machine learning. For a recent paper providing an overview of the literature on fair classification see Donini et al. (2018). Our paper adds to the literature on fair methods for unsupervised learning tasks (Chierichetti et al., 2017; Celis et al., 2018a, b, c; Samadi et al., 2018; Schmidt et al., 2018). Note that all these papers assume to know which demographic group a data point belongs to just as we do. We discuss the pieces of work most closely related to our paper.
Chierichetti et al. (2017) proposed the notion of fairness for clustering underlying our paper. It is based on the fairness notion of disparate impact (Feldman et al., 2015) and the rule (Zafar et al., 2017), respectively, which essentially say that the output of a machine learning algorithm should be independent of a sensitive attribute. In their paper, Chierichetti et al. focus on median and center clustering. For the case of a binary sensitive attribute, that is there are only two demographic groups, they provide approximation algorithms for the problems of finding a clustering with minimum median / center cost under the constraint that all clusters have some prespecified level of balance. Subsequently, Rösner & Schmidt (2018) provide an approximation algorithm for such a fair center problem with multiple groups. Schmidt et al. (2018) build upon the fairness notion and techniques of Chierichetti et al. and devise an approximation algorithm for the fair means problem, assuming that there are only two groups of the same size.
6 Experiments
In this section, we present a number of experiments. We first study our fair versions of spectral clustering, Algorithm 2 and Algorithm 3, on synthetic data generated according to our variant of the SBM and compare our algorithms to standard SC. We also study how robust our algorithms are with respect to a certain perturbation of our model. We then compare our algorithms to standard SC on real network data. We implemented all algorithms in Matlab. We used the builtin function for means clustering with all parameters set to their default values except for the number of replicates, which we set to 10. In the following, all plots show average results obtained from running an experiment for 100 times.
2nd row, 1st plot: , and , , ; 2nd plot: , and , , ; 3rd plot: , , and , , , ; 4th plot: , and , , , .
6.1 Synthetic Data
We run experiments on our variant of the SBM introduced in Section 4. To asses the quality of a clustering we measure the fraction of misclassified vertices w.r.t. the fair groundtruth clustering (cf. Section 4), which we refer to as error.
In the experiments of Figure 2, we study the performance of standard unnormalized and normalized SC and of our fair versions, Algorithm 2 and Algorithm 3, as a function of . Due to the high running time of Algorithm 3 (cf. Section 3), we only run it up to . All plots show the error of the methods, except for the fourth plot in the first row, which shows their runtime. We study several parameter settings. For the plots in the first row, Assumption (10) in Theorem 1 is satisfied, that is for all and . In this case, in accordance with Theorem 1, both Algorithm 2 and Algorithm 3 are able to recover the fair groundtruth clustering if is just large enough while standard SC always fails to do so. Algorithm 3 yields significantly better results than Algorithm 2 and requires much smaller values of for achieving zero error. This comes at the cost of a higher running time of Algorithm 3 (still it is in as claimed in Section 3). The runtime of Algorithm 2 is the same as the runtime of standard normalized SC. For the plots in the second row, Assumption (10) in Theorem 1 is not satisfied. We consider various scenarios of cluster sizes and group sizes (however, we always have , , , so that is as fair as possible). When the cluster sizes are different, but the group sizes are all equal to each other (1st plot in the 2nd row) or Assumption (10) is only slightly violated (2nd plot), both Algorithm 2 and Algorithm 3 are still able to recover the fair groundtruth clustering. Compared to the plots in the first row, Algorithm 2 requires a larger value of though, even though is smaller. Algorithm 3 achieves (almost) zero error already for in these scenarios. When Assumption (10) is strongly violated (3rd and 4th plot), Algorithm 2 fails to recover the fair groundtruth clustering, but Algorithm 3 still succeeds.
In the experiments shown in Figure 3, we study the error of Algorithm 2 (left plot) and Algorithm 3 (right plot) as a function of when is roughly fixed. More precisely, for and , we have (Alg. 2; left) or (Alg. 3; right), which allows for fair groundtruth clusterings satisfying (10). We consider connection probabilities , , , for . Unsurprisingly, for both Algorithm 2 and Algorithm 3 the error is monotonically increasing with . The rate of increase critically depends on (or the probabilities ). For Algorithm 2, this is even more severe. There is only a small range in which the various curves exhibit polynomial growth, which makes it impossible to empirically evaluate whether our error guarantees (12) and (14) are tight with respect to .
In the experiments of Figure 4, we consider a perturbation of our model as follows: first, for (left plot) or (right plot), and we generate a graph from our model just as before (Assumption (10) is satisfied; in particular, the two groups have the same size), but then we assign some of the vertices in the first group to the other group. Concretely, for a perturbation parameter , each vertex in the first group is assigned to the second one with probability independently of each other. The case is our model without any perturbation. If , there is only one group and our algorithms technically coincide with standard unnormalized or normalized SC. The two plots show the error of our algorithms and standard SC as a function of . Both our algorithms show the same behavior. They are robust against the perturbation up to . They yield the same error as standard SC for .
6.2 Real Data
In the experiments of Figure 5, we evaluate the performance of standard unnormalized and normalized SC versus our fair versions on real network data. The quality of a clustering is measured through its “Balance” (defined as the average of the balance (5) over all clusters; shown on left axis of the plots) and its RatioCut (1) or NCut (15) value (right axis). All networks that we are working with are the largest connected component of an originally unconnected network.
The first row of Figure 5 shows the results as a function of the number of clusters for two high school friendship networks (Mastrandrea et al., 2015). Vertices correspond to students and are split into two groups of males and females. FriendshipNet has 127 vertices and an edge between two students indicates that one of them reported friendship with the other one. FacebookNet consists of 155 vertices and an edge between two students indicates friendship on Facebook. As we can see from the plots, compared to standard SC, our fair versions improve the output clustering’s balance (by / / / on average over ) while almost not changing its RatioCut or NCut value.
The second row shows the results for DrugNet, a network encoding acquaintanceship between drug users in Hartford, CT (Weeks et al., 2002). In the left two plots, the network consists of 185 vertices split into two groups of males and females (we had to remove some vertices for which the gender was not known). In the right two plots, the network has 193 vertices split into three ethnic groups of African Americans, Latinos and others. Again, our fair versions of SC quite significantly improve the balance of the output clustering over standard SC (by / / / on average over ). However, in the right two plots we also observe a moderate increase of the RatioCut or NCut value.
7 Discussion
In this work, we presented an algorithmic approach towards incorporating fairness constraints into the SC framework. We provided a rigorous analysis of our algorithms and proved that they can recover fair groundtruth clusterings in a natural variant of the stochastic block model. Furthermore, we provided strong empirical evidence that often in real data sets, it is possible to achieve higher demographic proportionality at minimal additional cost in the clustering objective.
An important direction for future work is to understand the price of fairness in the SC framework if one needs to satisfy the fairness constraints exactly. One way to achieve this would be to run the fair means algorithm of Schmidt et al. (2018) in the last step of our Algorithms 2 or 3. We want to point out that the algorithm of Schmidt et al. currently does not extend beyond two groups of the same size. Second, our experimental results on the stochastic block model provide evidence that our algorithms are robust to moderate levels of perturbations in the group assignments. Characterizing this robustness rigorously is an intriguing open problem.
References
 Abbe (2018) Abbe, E. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research (JMLR), 18:1–86, 2018.

Ahmadian et al. (2017)
Ahmadian, S., NorouziFard, A., Svensson, O., and Ward, J.
Better guarantees for kmeans and Euclidean kmedian by primaldual algorithms.
In Symposium on Foundations of Computer Science (FOCS), 2017.  Arthur & Vassilvitskii (2007) Arthur, D. and Vassilvitskii, S. kmeans++: The advantages of careful seeding. In Symposium on Discrete Algorithms (SODA), 2007.
 Bai et al. (2000) Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H. (eds.). Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, 2000.
 Bhatia (1997) Bhatia, R. Matrix Analysis. Springer, 1997.
 Boucheron et al. (2004) Boucheron, S., Lugosi, G., and Bousquet, O. Concentration inequalities. In Advanced Lectures on Machine Learning. Springer, 2004.

Celis et al. (2018a)
Celis, L. E., Huang, L., and Vishnoi, N. K.
Multiwinner voting with fairness constraints.
In
International Joint Conference on Artificial Intelligence (IJCAI)
, 2018a.  Celis et al. (2018b) Celis, L. E., Keswani, V., Straszak, D., Deshpande, A., Kathuria, T., and Vishnoi, N. K. Fair and diverse DPPbased data summarization. In International Conference on Machine Learning (ICML), 2018b.
 Celis et al. (2018c) Celis, L. E., Straszak, D., and Vishnoi, N. K. Ranking with fairness constraints. In International Colloquium on Automata, Languages and Programming (ICALP), 2018c.
 Chierichetti et al. (2017) Chierichetti, F., Kumar, R., Lattanzi, S., and Vassilvitskii, S. Fair clustering through fairlets. In Neural Information Processing Systems (NIPS), 2017.
 Chouldechova (2017) Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
 Cucuringu et al. (2016) Cucuringu, M., Koutis, I., Chawla, S., Miller, G., and Peng, R. Simple and scalable constrained clustering: A generalized spectral method. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
 Davis & Kahan (1970) Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
 Donini et al. (2018) Donini, M., Oneto, L., BenDavid, S., ShaweTaylor, J., and Pontil, M. Empirical risk minimization under fairness constraints. In Neural Information Processing Systems (NeurIPS), 2018.
 Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Innovations in Theoretical Computer Science Conference (ITCS), 2012.
 Eriksson et al. (2011) Eriksson, A., Olsson, C., and Kahl, F. Normalized cuts revisited: A reformulation for segmentation with linear grouping constraints. Journal of Mathematical Imaging and Vision, 39(1):45–61, 2011.
 Feldman et al. (2015) Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. Certifying and removing disparate impact. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2015.
 Fishkind et al. (2013) Fishkind, D. E., Sussman, D. L., Tang, M., Vogelstein, J. T., and Priebe, C. E. Consistent adjacencyspectral partitioning for the stochastic block model when the model parameters are unknown. SIAM Journal on Matrix Analysis and Applications, 34(1):23–39, 2013.
 Golub & Van Loan (2013) Golub, G. H. and Van Loan, C. F. Matrix Computations. John Hopkins University Press, 2013.
 Hagen & Kahng (1992) Hagen, L. and Kahng, A. B. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 11(9):1074–1085, 1992.
 Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Neural Information Processing Systems (NIPS), 2016.
 Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. Stochastic blockmodels: First steps. Social Networks, 5:109–137, 1983.
 Joachims (2003) Joachims, T. Transductive learning via spectral graph partitioning. In International Conference on Machine Learning (ICML), 2003.
 Joseph & Yu (2016) Joseph, A. and Yu, B. Impact of regularization on spectral clustering. The Annals of Statistics, 44(4):1765–1791, 2016.
 Kawale & Boley (2013) Kawale, J. and Boley, D. Constrained spectral clustering using L1 regularization. In SIAM International Conference on Data Mining (SDM), 2013.

Khoreva et al. (2014)
Khoreva, A., Galasso, F., Hein, M., and Schiele, B.
Learning mustlink constraints for video segmentation based on
spectral clustering.
In
German Conference on Pattern Recognition (GCPR)
, 2014.  Kleinberg et al. (2017) Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent tradeoffs in the fair determination of risk scores. In Innovations in Theoretical Computer Science Conference (ITCS), 2017.
 Kumar et al. (2004) Kumar, A., Sabharwal, Y., and Sen, S. A simple linear time approximation algorithm for means clustering in any dimensions. In Symposium on Foundations of Computer Science (FOCS), 2004.
 Lei & Rinaldo (2015) Lei, J. and Rinaldo, A. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015.

Li et al. (2011)
Li, M., Lian, X.C., Kwok, J. T.Y., and Lu, B.L.
Time and space efficient spectral clustering via column sampling.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2011.  Lu & CarreiraPerpinan (2008) Lu, Z. and CarreiraPerpinan, M. A. Constrained spectral clustering through affinity propagation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
 Lütkepohl (1996) Lütkepohl, H. Handbook of Matrices. Wiley & Sons, 1996.
 Maji et al. (2011) Maji, S., Vishnoi, N. K., and Malik, J. Biased normalized cuts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
 Mastrandrea et al. (2015) Mastrandrea, R., Fournet, J., and Barrat, A. Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PloS ONE, 10(9):1–26, 2015. Data available on http://www.sociopatterns.org/datasets/highschoolcontactandfriendshipnetworks/.
 Nascimento & de Carvalho (2011) Nascimento, M. C. V. and de Carvalho, A. C. P. L. F. Spectral methods for graph clustering  A survey. European Journal of Operational Research, 211(2):221–231, 2011.
 Qin & Rohe (2013) Qin, T. and Rohe, K. Regularized spectral clustering under the degreecorrected stochastic blockmodel. In Neural Information Processing Systems (NIPS), 2013.
 Rohe et al. (2011) Rohe, K., Chatterjee, S., and Yu, B. Spectral clustering and the highdimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
 Rösner & Schmidt (2018) Rösner, C. and Schmidt, M. Privacy preserving clustering with constraints. In International Colloquium on Automata, Languages, and Programming (ICALP), 2018.
 Samadi et al. (2018) Samadi, S., Tantipongpipat, U., Morgenstern, J., Singh, M., and Vempala, S. The price of fair PCA: One extra dimension. In Neural Information Processing Systems (NeurIPS), 2018.
 Sarkar & Bickel (2015) Sarkar, P. and Bickel, P. J. Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics, 43(3):962–990, 2015.
 Schmidt et al. (2018) Schmidt, M., Schwiegelshohn, C., and Sohler, C. Fair coresets and streaming algorithms for fair kmeans clustering. arXiv:1812.10854 [cs.DS], 2018.
 Shi & Malik (2000) Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
 Su et al. (2017) Su, L., Wang, W., and Zhang, Y. Strong consistency of spectral clustering for stochastic block models. arXiv:1710.06191 [stat.ME], 2017.
 von Luxburg (2007) von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.

Vu & Lei (2013)
Vu, V. Q. and Lei, J.
Minimax sparse principal subspace estimation in high dimensions.
The Annals of Statistics, 41(6):2905–2947, 2013.  Wang & Davidson (2010) Wang, X. and Davidson, I. Flexible constrained spectral clustering. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2010.
 Wang et al. (2014) Wang, X., Qian, B., and Davidson, I. On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1):1–30, 2014.
 Weeks et al. (2002) Weeks, M. R., Clair, S., Borgatti, S. P., Radda, K., and Schensul, J. J. Social networks of drug users in highrisk sites: Finding the connections. AIDS and Behavior, 6(2):193–206, 2002. Data available on https://sites.google.com/site/ucinetsoftware/datasets/covertnetworks/drugnet.
 Xu et al. (2009) Xu, L., Li, W., and Schuurmans, D. Fast normalized cut with linear constraints. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 Yan et al. (2009) Yan, D., Huang, L., and Jordan, M. I. Fast approximate spectral clustering. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2009.
 Yu & Shi (2001) Yu, S. X. and Shi, J. Grouping with bias. In Neural Information Processing Systems (NIPS), 2001.
 Yu & Shi (2004) Yu, S. X. and Shi, J. Segmentation given partial grouping constraints. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):173–183, 2004.
 Zafar et al. (2017) Zafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
Appendix
Appendix A Adding Fairness Constraints to Normalized Spectral Clustering
In this section we derive a fair version of normalized spectral clustering (similarly to how we proceeded for unnormalized spectral clustering in Sections 2 and 3 of the main paper).
Normalized spectral clustering aims at partitioning into clusters with minimum value of the NCut objective function as follows (see von Luxburg, 2007, for details): for a clustering we have
(15) 
where . Encoding a clustering by a matrix with
(16) 
we have . Note that any of the form (16) satisfies . Normalized spectral clustering relaxes the problem of minimizing over all of the form (16) to
(17) 
Substituting for (we need to assume that does not contain any isolated vertices since otherwise does not exist), problem (17) becomes
Similarly to unnormalized spectral clustering, normalized spectral clustering computes an optimal by computing the smallest eigenvalues and some corresponding eigenvectors of and applies means clustering to the rows of (in practice, can be computed directly by solving the generalized eigenproblem , , ; see von Luxburg, 2007).
Now we want to derive our fair version of normalized spectral clustering. The first step is to show that Lemma 1 holds true if we encode a clustering as in (16):
Lemma 2 (Fairness constraint as linear constraint on for normalized spectral clustering).
For , let be the groupmembership vector of , that is if and otherwise. Let be a clustering that is encoded as in (16). We have, for every ,
Proof.
This simply follows from
and . ∎
Lemma 2 suggests that in a fair version of normalized spectral clustering, rather than solving (17), we should solve
(18) 
where is the matrix that has the vectors , , as columns (just as in Section 3). It is and we need to assume that since otherwise (18) does not have any solution. Let be a matrix whose columns form an orthonormal basis of the nullspace of . We substitute for and then problem (18) becomes
(19) 
Assuming that does not contain any isolated vertices, is positive definite and hence has a positive definite square root, that is there exists a positive definite with . We can substitute for and then problem (19) becomes
(20) 
A solution to (20) is given by the matrix that contains some orthonormal eigenvectors corresponding to the smallest eigenvalues (respecting multiplicities) of as columns. This gives rise to our fair version of normalized spectral clustering as stated in Algorithm 3.
Appendix B Computational Complexity of our Algorithms
The costs of standard spectral clustering (e.g., Algorithm 1) are dominated by the complexity of the eigenvector computations and are commonly stated to be, in general, in regarding time and regarding space for an arbitrary number of clusters , unless approximations are applied (Yan et al., 2009; Li et al., 2011). In addition to the computations performed in Algorithm 1, in Algorithm 2 and Algorithm 3 we have to compute an orthonormal basis of the nullspace of , perform some matrix multiplications, and (only for Algorithm 3) compute the square root of an matrix and the inverse of this square root. All these computations can be done in regarding time and regarding space (an orthonormal basis of the nullspace of can be computed by means of an SVD; see, e.g., Golub & Van Loan, 2013), and hence our algorithms have the same worstcase complexity as standard spectral clustering. On the other hand, if the graph , and thus the Laplacian matrix , is sparse or is small, then the eigenvector computations in Algorithm 1 can be done more efficiently than with cubic running time (Bai et al., 2000). This is not the case for our algorithms as stated. However, one could apply one of the techniques suggested in the existing literature on constrained spectral clustering to speed up computation (e.g., Yu & Shi, 2004, or Xu et al., 2009; see Section 5 of the main paper). With the implementation as stated, in our experiments in Section 6 of the main paper we observe that Algorithm 2 has a similar running time as standard normalized spectral clustering while the running time of Algorithm 3 is significantly higher.
Appendix C Proof of Theorem 1
We split the proof of Theorem 1 into four parts. In the first part, we analyze the eigenvalues and eigenvectors of the expected adjacency matrix and of the matrix , where is the expected Laplacian matrix and is the matrix computed in the execution of Algorithm 2 or Algorithm 3. In the second part, we study the deviation of the observed matrix from the expected matrix . In the third part, we use the results from the first and the second part to prove Theorem 1 for Algorithm 2 (unnormalized SC with fairness constraints). In the fourth part, we prove Theorem 1 for Algorithm 3 (normalized SC with fairness constraints).
Notation
For , by we denote the Euclidean norm of , that is . For , by we denote the operator norm (also known as spectral norm) and by the Frobenius norm of . It is
(21) 
where is the largest eigenvalue of , and
(22) 
Note that for a symmetric matrix with we have . It follows from (21) and (22) that for any with rank at most we have
(23) 
We use to denote a universal constant that only depends on and that may change from line to line.
Part 1: Eigenvalues and eigenvectors of and of
Assuming the vertices are sorted in a way such that