Complex networks in the real world, such as social networks, communication networks and biological networks, can be modeled as graphs. Graph analysis techniques have been extensively studied to help to understand the features of networks. Community detection, which aims at finding cohesive subgraph structures in networks, is a fundamental problem in graph analysis that has attracted much attention for decades [22, 30, 16]. As an elementary model, clique has been widely used to reveal dense community structures of graphs [13, 21]. Mining cliques in a graph has a wide range of applications, including mining overlapping communities in social networks , identifying protein complexes in protein networks , and finding groups with abnormal transactions in financial networks .
Many real-life networks are often attributed graphs where vertices or edges are associated with attribute information. There are a number of studies that focus on finding communities on attributed graphs [23, 36, 12, 18, 32, 38, 40, 39]. However, those works either require high correlation of attributes in a community or aim to find communities satisfying some attribute constraints. None of them takes into account the fairness of attributes in the community.
Recently, the concept of fairness is mainly considered in the machine learning community[37, 14, 10]. Many studies reveal that a rank produced by a biased machine learning model can result in systematic discrimination and reduce visibility for an already disadvantaged group (e.g., incorporations of gender and racial and other biases) [43, 34, 5]. Therefore, many different definitions of fairness, such as individual fairness, group fairness , and related algorithms were proposed to generate a fairness ranking. Some other studies focus on the fairness in classification models, such as demographic parity  and equality of opportunity . All these studies suggest that the concept of fairness is very important in machine learning models.
Motivated by the concept of fairness in machine learning, we introduce fairness for an important graph mining task, i.e., mining cliques in a graph. Mining fair cliques has a variety of applications. For example, consider an online social network where each user has an attribute denoting his/her gender. We may want to find a clique community in which both the number of males and females reach a certain threshold, or the number of males and females are exactly the same. Compared to the traditional clique communities, the fair clique communities can overcome gender bias. In a collaboration network, each vertex has an attribute representing his/her research topic. The fair cliques can be used to identify research groups who work closely and also have diverse research topics, because the fair cliques have already considered the fairness over different research topics.
In this paper, we focus on the problem of finding fairness-aware cliques in attributed graphs where each vertex in the graph has one attribute. We propose two new models to characterize the fairness of a clique, called weak fair clique and strong fair clique respectively. A weak fair clique is a maximal subgraph which 1) is a clique, and 2) requires the number of vertices of every attribute value is no less than a given threshold , thus it can guarantee the fairness over all attributes to some extent. A strong fair clique is a maximal subgraph in which 1) the vertices form a clique, and 2) the number of vertices for each attribute value is exactly the same, thus it can fully guarantee the fairness over all attributes. We show that finding all weak or strong fair cliques is NP-hard. Furthermore, the problem of enumerating all strong fair cliques is often much more challenging than the problem of enumerating all weak fair cliques. To solve our problems, we first propose a backtracking enumeration algorithm called with a novel colorful -core based pruning technique to enumerate all weak fair cliques. Then, we propose a algorithm to enumerate all strong fair cliques based on a new attribute-alternatively-selection search strategy. We also develop several non-trivial ordering techniques to further speed up the and algorithms. Below, we summarize the main contributions of this paper.
. We propose a weak fair clique and a strong fair clique model to characterize the fairness of a cohesive subgraph. To the best of our knowledge, we are the first to introduce the concept of fairness for cohesive subgraph modeling.
. We first propose a novel concept called colorful -core and develop a linear-time algorithm to compute the colorful -core. We show that both weak fair cliques and strong fair cliques must be contained in the colorful -core, thus we can use it for pruning unpromising vertices in enumerating weak or strong fair cliques. Then, we propose a backtracking algorithm to enumerate all weak fair cliques with a colorful -core induced ordering. To enumerate all strong fair cliques, we further develop a novel fairness -core based pruning technique which is more effective than the colorful -core pruning. We also propose a backtracking algorithm
with a new attribute-alternatively-selection search strategy to enumerate all strong fair cliques. In addition, a heuristic ordering method is also proposed to further improve the efficiency of the strong fair clique enumeration algorithm.
. We conduct extensive experiments to evaluate the efficiency and effectiveness of our algorithms using four real-world networks. The results show that the colorful -core based pruning technique is very powerful which can significantly prune the original graph. The results also show that the and algorithms are efficient in practice. Both of them can enumerate all fair cliques on a large graph with 2,523,387 vertices and 7,918,801 edges in less than 3 hours. In addition, we conduct a case study on to evaluate the effectiveness of our algorithms. The results show that both and can find fair communities with different research areas, and can further keep balance of attribute values in the subgraph.
. The source code of this paper is released at Github: https://github.com/honmameiko22/fairnessclique for reproducibility purpose.
Let be an undirected, unweighted attributed graph with and . Each vertex in has an attribute and we denote its value as . Let be the set of all possible values of attribute , namely, . The cardinality of is denoted by , i.e., . For brevity, we also represent as . We denote the set of neighbors of a vertex by , and the degree of by . For a vertex subset , the subgraph induced by is defined as , where and is the vertex attribute in .
In a graph , a clique is a complete subgraph where each pair of vertices in is connected. Based on the concept of clique, we present two fairness-aware clique models as follows.
() Given an attributed graph and an integer , a clique of is a weak fair clique of if (1) for each value , the number of vertices whose value equals is no less than ; (2) there is no clique satisfying (1).
Consider a graph with in Fig. 1(a). Suppose that . By Definition 1, we can see that the subgraph induced by the vertex set is a weak fair clique. This is because the number of vertices with attribute value in is 4 ( ), and with attribute is 3 (). Moreover, there does not exist a subgraph that contains and also satisfies the condition (1) in Definition 1.
Clearly, by Definition 1, the weak fair clique model exhibits the fairness property over all types of vertices (with different attribute values), as it requires the number of vertices for each attribute in the subgraph must be no less than . However, the weak fair clique model may not strictly guarantee the fairness for all attributes. Below, we propose a strong fair clique definition which strictly requires the subgraph has the same number of vertices for each attribute.
() Given an attributed graph and an integer , a clique of is a strong fair clique of if (1) for each , the number of vertices whose value equals is no less than ; (2) the number of vertices for each is exactly the same; (3) there is no clique satisfying (1) and (2).
Reconsider the attributed graph in Fig. 1(a). Again, we assume that . By definition, we can easily check that the subgraph induced by is a strong fair clique. Note that the subgraph induced by is a weak fair clique, but it is not a strong fair clique, as it violates the condition (2) in Definition 2.
Problem statement. Given an attributed graph and an integer , our goal is to enumerate all weak fair cliques and strong fair cliques in respectively.
Reconsider the attributed graph in Fig. 1(a). Suppose that equals 2. We aim to find all 2-weak fair cliques and 2-strong fair cliques in . The answer of 2-weak fair clique enumeration is because it is the maximal clique satisfying Definition 1. We can also find that there are three 2-strong fair cliques in , i.e., , , and , thus they are the answers for 2-strong fair clique search. Clearly, all 2-strong fair cliques are subgraphs of the 2-weak fair clique.
Challenges. We first discuss the hardness of the weak fair clique enumeration problem. Considering a special case: . Clearly, the weak fair clique enumeration problem degenerates to the traditional maximal clique enumeration problem which is NP-hard. Thus, finding all weak fair cliques is also an NP-hard problem. Enumerating strong fair cliques is more challenging than enumerating all weak fair cliques for the following reasons. (1) The number of strong fair cliques is often much larger than that of weak fair cliques. By definition, we can see that a strong fair clique is always contained in a weak fair clique. On the contrary, a weak fair clique is not necessarily a strong fair clique. (2) Each weak fair clique must be a traditional maximal clique, but the strong fair clique may not be a traditional maximal clique (see Example 2), which means that it is difficult to check the maximality of strong fair cliques.
Unlike traditional maximal cliques, both weak fair cliques and strong fair cliques have an additional attribute value constraint, thus a potential solution is to apply attribute information to prune the search space. The challenges of our problems are (1) how can we efficiently prune unpromising vertices, and (2) how to maintain the fair clique property during the search procedure. To tackle the above challenges, we will propose the algorithm with a new colorful -core based pruning technique for weak fair clique enumeration; and propose the algorithm with a novel attribute-alternatively-selection strategy for enumerating all strong fair cliques. Both of our algorithms are able to correctly find all fair cliques and significantly improve the efficiency compared to the baseline enumeration algorithm.
Iii Weak fair clique enumeration
In this section, we present the algorithm to enumerate all weak fair cliques. The key idea of is that it first prunes the vertices that are not contained in any weak fair clique based on a novel concept called colorful -core. Then, it performs a carefully-designed backtracking search procedure to enumerate all results. Below, we first introduce the concept of colorful -core, followed by a heuristic search order and the algorithm.
Iii-a The colorful -core pruning
Before introducing the colorful -core based pruning technique, we first briefly review the problem of vertex coloring for a graph. The goal of vertex coloring is to color the vertices such that no two adjacent vertices have the same color [24, 17]. Given a graph , we denote by the color of a vertex . Based on the vertex coloring, we define the colorful degree of a vertex as follows.
( ) Given an attributed graph and an attribute value . The colorful-degree of vertex based on , denoted by , is the number of colors of ’s neighbors whose attribute value is , i.e., .
Clearly, each vertex has colorful degrees. Let denotes the minimum colorful degree of a vertex , i.e., . We omit the symbol in and when the context is clear. Below, we give the definition of colorful -core.
( -) Given an attributed graph and an integer , a subgraph of is a colorful -core if: (1) for each vertex , ; (2) there is no subgraph that satisfies (1) and .
Based on Definition 4, we have the following lemma.
Given an attributed graph and a parameter , any weak fair clique must be contained in the colorful (-1)-core of .
Proof: Assume that is a weak fair clique and consider a vertex . Based on Definition 1, for each , has at least neighbors in whose attribute value is . Since the vertices with the same color must not be adjacent, we have for each . Thus, if a subgraph satisfies , must not be included in .
Equipped with Lemma 1, we propose a novel algorithm, called , to compute the colorful--core of , which can be used to prune unpromising vertices in the weak fair clique enumeration procedure. The pseudo-code of is shown in Algorithm 1. The algorithm computes the colorful--core of by iteratively peeling vertices from the remaining graph based on their colorful degrees, which is a variant of the classic core decomposition algorithm [4, 25] (lines 8-20). Specifically, it first performs greedy coloring on which colors vertices based on the order of degree [26, 15] (line 1). Note that finding the optimal coloring is an NP-hard problem [17, 24], thus we use a greedy algorithm to compute a heuristic coloring which is sufficient for defining the colorful -core. A priority queue is employed to maintain the vertices with smaller which will be removed during the peeling procedure (line 2). computes the colorful degrees of all vertices to initialize (lines 3-10). records the number of ’s neighbors whose attribute values and colors are the same. After that, the algorithm computes the colorful -core of by iteratively peeling vertices from the remaining graph based on their colorful degrees (lines 11-20). Finally, returns the remaining graph as the colorful -core. Below, we analyze the complexity of Algorithm 1.
Consider the graph in Fig. 1(a). Assume that we want to search all -weak fair cliques. By Lemma 1, we invoke to calculate the colorful--core of . Specifically, we first color the vertices of using the greedy method. Then, we obtain a colored graph which is illustrated in Fig. 1(b) with seven different colors. Take the vertex as an example. connects to and in and both of them have attribute value , thus and hold. Due to , is not contained in any -weak fair clique. Thus, removes from . The removal of subsequently updates the colorful-degrees of and . repeatedly removes vertices until all the remaining vertices satisfying . Finally, we can obtain a subgraph induced by the vertex set which is a colorful--core with .
Algorithm 1 consumes time using space, where denotes the total number of colors.
Proof: In line 1, the greedy coloring procedure takes time . In lines 2-7, we can easily derive that the algorithm takes time. In lines 11-20, the algorithm can update for each in time. For each edge , the update operator only performs once, thus the total time complexity is bounded by . For the space complexity, the algorithm needs to maintain the structure for each vertex which takes at most space in total.
Iii-B The colorful -core based ordering
finds all weak fair cliques by performing a backtracking search procedure. Hence, the search order of vertices is vital as the search spaces with various orderings are significantly different. Below, we propose a heuristic order based on the colorful -core, called , which can significantly improve the performance of as confirmed in our experiments.
Consider a vertex and its neighbor with . According to Lemma 1, may be contained in a weak fair clique but is impossible. Thus, we can construct a smaller subgraph induced by ’s neighbors whose values are no less than to search weak fair cliques correctly. Inspired by this, we design a search order denoted by ; and we propose an algorithm, called , to calculate such an order. Similar to the idea of , iteratively removes a vertex with the minimum from the remaining graph. The vertices-removal ordering by this procedure is denoted as .
Algorithm 2 outlines the pseudo-code of . For each vertex , we use to indicate the rank of in our order . A heap-based structure is employed to maintain the vertices with their values, which always pops out the pair with minimum . first calculates for every vertex and pushes into (lines 3-5). Then, iteratively pops out the vertex with minimum from and records its rank in (lines 6-15). As a vertex is removed, we maintain the values for its neighbors and update (lines 9-15). It is easy to check that the time and space complexities of Algorithm 2 are the same as those of Algorithm 1.
The reason why works is that the search procedure beginning with vertices that have low ranks in tends to be less possible to form weak fair cliques. Note that the main searching time of the enumeration algorithm is spent on the vertices that have a dense and large neighborhood. can guarantee that the unpromising vertices are explored first, thus reducing the number of candidates of the vertices that have a dense and large neighborhood.
Iii-C The weak fair clique enumeration algorithm
The main idea of is to prune the unpromising vertices first, and then perform the backtracking procedure to find all weak fair cliques. Unlike the traditional maximal clique enumeration, is equipped with a colorful -core-based pruning rule and a carefully-designed ordering technique, which can significantly reduce the search space. The pseudo-code of is outlined in Algorithm 3.
The algorithm works as follows. It first initializes four sets , , , and (line 1). The set represents the currently-found clique which may be extended to a weak fair clique. is the set of vertices in which every vertex can be used to expand the current clique but has already been visited in previous search paths. is the candidate set that can be used to extend the current clique in which each vertex must be neighbors of all vertices in . After initialization, performs to prune the vertices that are definitely not contained in any weak fair clique (line 2). The algorithm then invokes the procedure to find all weak fair cliques in the pruned graph (lines 4-9). Note that may have several connected colorful -cores, so should be performed on each connected component in . An array is used to indicate whether a vertex has been searched, and it is initialized as false for each vertex. For an unvisited vertex , identifies the connected colorful--core CC containing and sets as true for all vertices within CC to denote that CC will not be searched again (line 6). then calls to derive the search order of vertices in CC, and performs the procedure on CC to enumerate all weak fair cliques (lines 7-8).
The workflow of is depicted in lines 10-26 of Algorithm 3. It first identifies whether the current is a weak fair clique (line 11). is an answer if and only if and . is empty means that no vertex can be added into . In addition, the set must be empty, otherwise any vertex in can be added into and makes non-maximal. If is not a weak fair clique, we add each vertex into and start the next iteration of (lines 12-26). Note that each candidate in is a neighbor of all vertices in , therefore after adding into , must be updated to keep out those vertices that are not adjacent with (lines 15-17). Here, we only consider the vertices whose rank is larger than ’s rank to avoid finding the same clique repeatedly. After obtaining the updated sets and , if holds, terminates as the sets cannot reach the minimum size of a weak fair clique (line 18). On the other hand, we use and to denote the number of vertices whose attribute value is in and , respectively (line 17 and line 19). By checking the count for each , we can quickly determine whether the current/next clique is promising. For any , if holds, we cannot obtain a weak fair clique even if we add the whole set into . This is because the condition (1) of Definition 1 is not satisfied, thus terminates (lines 20-23). Otherwise, the procedure derives the set by adding ’s neighbors into , and then performs the next iteration (lines 24-25). After exploring the vertex , adds it into because has already been searched in the current search path and cannot be processed in the following recursions (line 26).
Iv Strong fair clique Enumeration
In this section, we first develop an efficient strong fair clique enumeration algorithm with a novel pruning technique for the two-dimensional (2D) case, where the attributed graph has only two types of attributes (i.e., ). Then, we will show how to extend our enumeration algorithm to handle the high-dimensional case ().
Iv-a The pruning technique for 2D case
Suppose that the attributed graph has two types of attributes, i.e., . The neighbors of a vertex can be divided into groups by coloring where each group contains vertices with the same color. Clearly, by the property of coloring, only one vertex can be selected from a group to form a clique with . Below, we give a new definition of fairness degree of a vertex.
( ) Given a colored attributed graph with , the fairness degree of , denoted by , is the largest number of groups from which we select vertices so that the number of vertices with attribute is the same as the number of vertices with attribute .
By Definition 5, we can easily verify that the fairness degree of a vertex , i.e., , is an upper bound of the size of the strong fair clique containing . Therefore, for any vertex , if , then cannot be contained in any strong fair clique, because any vertex in a strong fair clique must have a fairness degree no less than by Definition 2. As a consequence, we can safely prune the vertex whose fairness degree is less than .
A remaining question is how can we efficiently compute the fairness degree for a vertex . Below, we develop an efficient approach to answer this question.
Based on the attribute values, the color groups can be divided into three categories: (1) : is a group that involves vertices of attribute only; (2) : is a group that contains vertices of attribute only; (3) : is a group that contains vertices of both and . Let , , and be the number of the groups, the groups, and the groups respectively. Suppose without loss of generality that . Then, if holds, we can easily derive that . Otherwise, we have . Based on these results, we can easily derive the fairness degree for each vertex by using the three quantities , , and . The pseudo-code of our algorithm to compute the fairness is given in lines 17-29 of Algorithm 4.
Based on the fairness degree, we can iteratively prune the vertices with fairness degrees smaller than . Below, we introduce a concept called fairness -core to characterize the reduced subgraph after iteratively peeling the unqualified vertices.
( -) Given an attributed graph with and an integer , a subgraph of is a fairness -core if: (1) for each , ; (2) there is no subgraph that satisfies (1) and .
By Definition 6, we can show that any strong fair clique must be contained in the fairness -core.
Given an attributed graph with and a parameter , any strong fair clique must be contained in the fairness -core of .
Proof: Consider a strong fair clique . According to Definition 2, assume there are vertices of attribute and vertices of attribute in . For an arbitrary vertex in , we suppose that . There are vertices of attribute and vertices of attribute in ’s neighbors. Therefore, after performing for , we have , and . Further, is equal to . Due to the arbitrariness of , the fairness degree of each vertex in must reach , too. Hence, must be contained in the fairness--core of .
Reconsider the attributed graph in Fig. 1(b). Suppose that . By Lemma 2, we consider the fairness 2-core of . For vertex , has two neighbors and , and both of them have attribute value . Clearly, we have , thus is not contained in the fairness 2-core. For vertex , the initial value of , and are . Obviously, , thus we have . Similarly, the fairness degrees of the other vertices are all equal to . Therefore, the subgraph induced by is a fairness -core. Clearly, such a subgraph contains the strong fair clique as illustrated in Example 2.
Similar to the colorful -core computation algorithm, we can also devise a peeling algorithm to compute the fairness -core by iteratively removing the vertices that have fairness degrees smaller than . The pseudo-code of our algorithm is outlined in Algorithm 4. Note that a strong fair clique is always contained in a weak fair clique, thus we can first invoke to prune vertices that are definitely not included in the weak fair cliques before computing the fairness -core of (line 1).
Algorithm 4 consumes time using space.
Proof: In line 1, Algorithm 4 invokes Algorithm 1 which takes time and space (since ). The procedure takes at most time for each vertex. Therefore, the total time overhead taken in lines 3-8 is O. In lines 9-14, for each edge , the update cost is bounded by , thus the total time complexity is . For the space complexity, the algorithm takes space to maintain the structure.
Fairness -core ordering. Similar to the , we can derive an ordering based on the fairness -core, called , for strong fair clique enumeration. In particular, is derived by iteratively removing the vertex with the minimum fairness degree which is very similar to the computational procedure of . We omit the details for brevity.
Iv-B The enumeration algorithm for 2D case
Armed with the fairness -core based pruning technique and the ordering, we propose the algorithm which alternatively picks a vertex of a specific attribute in the backtracking procedure to enumerate all strong fair cliques. The is shown in Algorithm 5. We use to represent the currently-found clique and to denote the candidate set. Similar to , first applies to prune the vertices that are definitely not contained in strong fair cliques (line 2) and then performs the procedure for each connected fairness -core in to find all results (lines 4-8).
The pseudo-code of is outlined in lines 10-27 of Algorithm 5. Since a strong fair clique requires that the numbers of vertices for each attribute are exactly the same, we develop a novel attribute-alternatively-selection mechanism to select vertices in each iteration. That is, admits an input parameter , which is initialized to (line 8), to indicate the attribute value of the vertices to be selected in the current iteration. In the next iteration, we pick the vertices with the attribute value to construct strong fair cliques (line 27). divides the candidates in into sets, where the attribute values of vertices in each set are the same, i.e., (line 14). For each candidate in , we pick one vertex at a time as a part of the currently-found clique and update the candidate set based on the ordering (lines 16-27).