1 Introduction
The ubiquity of highvelocity data streams that form graph structures gives rise to many interesting graph queries, and consequently, query processing challenges. A graph stream refers to a data stream of tuples representing graph edges that can form a graph structure. For example, in a cloudtroubleshooting application, a graph stream can be a sequence of edges representing communication logs among the cloud’s machines. Each communication log is a directed edge from a sender to a receiver, where edges have labels, e.g., the communicationprotocol used. In this paper, we focus on these labeledgraph streams that raise interesting datamanagement challenges.
More precisely, a labeledgraph stream is a graph stream where each edge is associated with a a categorical attribute (label). Associating labels to edges helps in defining and evaluating constrained graph queries, where a query filters the stream using the edge labels before query evaluation. Consider the following realworld queries on labeledgraph streams:
Communication Networks: Cloudenvironment operators usually analyze the communicationlog stream to perform realtime troubleshooting. A typical communicationlog entry describes a communication between two machines, namely the source and the destination machines, as well as some communication attributes, e.g., the roundtrip time, the sender’s application id. This stream forms a labeledgraph stream, where an edge’s label is the sender application id. A cloudtroubleshooting application, say , may issue a constrained reachabilityquery to detect if messages created by Application from a source machine reach a destination machine. This reachability query is constrained to use only edges that represent messages created by Application (i.e., Label ).
Social Networks: A social network may need to detect trending activities of a given object type (e.g., picture, video, status). A graph stream may describe user activities w.r.t. these socialnetwork objects, e.g., User shares Post . The edge labels represent activity types, e.g., comment, share, or like. The social network may detect trending posts or objects w.r.t. specific activity types, e.g., find the most reshared post.
Labeledgraph streams in applications like the aforementioned ones are usually of large volumes and high velocities. For example, a cloudtroubleshooting application of a commercial cloudservice receives a labeledgraph stream at the rate of million edges per second (i.e., a stream query acting on a oneminute window has billion edges to process). Moreover, lowlatency for processing queries becomes necessary in many applications, e.g., when detecting securitythreats in realtime. Hence, it is practical to summarize a graph stream by incrementally building a smaller streamsynopsis. This addresses the datavolume challenge, where boundedmemory is allocated to summarize the continuouslyarriving edges of a highvolume graph stream. In addition, the lowlatency requirement may be addressed by approximating the query results instead of producing exact answers.
Summarizing labeledgraph streams has additional challenges. The labels of the streamed edges are unevenly distributed. Thus, it is common in these streams to find frequent labels or infrequent ones. The uneven distribution of edge labels may not be known beforehand and may change over time. For example, a cloudtroubleshooting application will have a communicationlog graph stream with edges representing communication types that are more frequent than other types (e.g., HTTP communicationlog entries may dominate). This imbalance raises a challenge in summarizing such a stream, where no edge type (label) should be penalized w.r.t. summarization accuracy due to the rareness or the relative high frequency of its label.
In this paper, we present SBGSketch, a graphical sketching technique that automatically balances the sketch load according to the relative frequency of edge labels without penalizing edges with rare labels. Given a labeledgraph stream, say , and a fixed memorysize, say , SBGSketch uses memory to incrementally summarize both the attributes of the edges of as well as the topology of the graph formed by Stream .
The main idea of SBGSketch is to allow edges of highfrequency labels to automatically leverage unused memory previously assigned to lowfrequency labels with a guarantee that edges of lowfrequency labels can use that memory whenever needed in the future. Notice that bounding the memory allocated to handle a graph stream is important. The benefit of this memorybounding is twofold. First, edges arrive continuously with large volume and high velocity in many applications, where storing all edges is impractical in many scenarios. Second, query timeefficiency will enhance as queries will process a boundedsynopsis that is much smaller than the raw graph stream.
The contributions of this paper are as follows:

We introduce the design of SBGSketch. SBGSketch efficiently summarizes labeledgraph streams and automatically balances sketch load in streams with unpredictable and highly imbalanced edgelabel frequencies, all without penalizing edges with rare labels.

The design of SBGSketch is of interest on its own as it represents a departure from the countmin design of Cormode and Muthukrishnan [7] used in some of the most recent works in graphical sketching [23, 22], and thus enabling new applications. The main focus of this paper is to enable applications that call for labeledgraph stream sketching for skewed labeldistributions.

We show that SBGSketch can give an approximate answer to graph queries constrained to sets of edge labels. We demonstrate the use of SBGSketch to compute a variety of approximate queries, reachability queries (with no falsenegatives), edge count queries, and subgraph queries. These queries can serve a wide spectrum of applications.

We conduct extensive experiments using three realworld datasets from different domains. Results demonstrate that SBGSketch can effectively summarize labeledgraph streams, and effectively estimate constrained graphqueries. Moreover, we show that SBGSketch significantly outperforms the stateoftheart graph sketch method [22] w.r.t. estimation accuracy.
The rest of this paper proceeds as follows. Section 2.1 defines the model we follow for graph streams of labeled edges. Section 2.2 identifies the requirements that an effective sketching method should satisfy in order to handle labeledgraph streams, while Section 3 introduces our approach. Section 4 presents the structure of SBGSketch as well as the general logic for updating the sketch upon edge arrivals. Section 5 demonstrates how SBGSketch can estimate some important constrainedqueries on graph streams. Section 6 presents the experimental evaluation of SBGSketch. The related work is discussed in Section 7. Finally, Section 8 contains concluding remarks.
2 Graph Modeling and Problem Definition
2.1 The GraphStream Model
We model a labeledgraph stream, say , as a data stream of labeled edges . This graph stream forms Graph , where is the vertex set of , is the edge set formed by the streamed edges, and is the set of distinct edge labels. A streamed edge, say , is defined as , where , , , and , are the source vertex, the destination vertex, the label, and the weight (real number) of Edge , respectively. For simplicity, we assume that the graph edges are directed. However, all the techniques presented in this paper can be applied to undirected graphs. Figure 1 gives a sample graphstream of nine labelededges being streamed. For example, the edge from Vertex to Vertex is the result of receiving the following stream element .
2.2 Problem Definition and Solution Requirements
Given a labeledgraph stream, say , where , the number of distinct edgelabels, say , and a memorysize upperbound, say , we need to construct a graphical sketch, say , that satisfies the following requirements:
Construct an inmemory synopsis that does not exceed of memory.
Summarize the edge weights of Stream using an aggregate function defined by the application.
Summarize the topology of Stream to support graphtraversal queries (e.g., reachability estimation).
Consider the edge labels in the summary to support constrained graphqueries.
Consider the imbalance in the distribution of edges w.r.t labels. An edgelabel should be allowed to use larger quota from the allocated fixedmemory if its edges are more frequent than those corresponding to other labels. This requirement aims that all the edges win or achieve high summarizationaccuracy regardless of their label rareness or popularity.
The last requirement is important in realworld scenarios, where the edges are unevenly distributed w.r.t. their labels. For instance, consider a cloudtroubleshooting application, where the streamed edges are labeled by application identifiers. The messaging frequency of some applications can be much higher than those of other applications. Hence, a sketching method handling this graphstream model should not penalize the accuracy of summarizing edges due to the rareness of some labels. Moreover, avoiding this penalization should consider using memory wisely (e.g., avoid allocating exclusive largememory shares to lessrepresentative labels).
3 Overview of SBGSketch
Given a graph stream as defined in Section 2.1, our approach is to build a sketch that satisfies the requirements stated in Section 2.2. To illustrate, consider the sample graph stream in Figure 1. The proposed SBGSketch graphical sketch follows the structure that Figure 2 illustrates, where we assume that the sketch is built to aggregate the weights of the received edges by summing them (other aggregates are possible). For each distinct edgelabel, say , we allocate a subsketch that summarizes the subgraph of all the edges of Label . For instance, Figure 2 gives two subsketches, namely and , that summarize the , and the edges, respectively. This allows the graphical sketch to evaluate labelconstrained queries. For instance, a query allowing only edges will only consult Subsketch . Notice that the total size of the sketch is upperbounded by the maximum memorysize defined by the user that affects the size of each subsketch.
The idea of the graphical sketch in Figure 2 is to build a subgraph for each edge label, say , by compressing the edges of Label . In particular, each subsketch, say , has a maximum number of vertexes, say , that is smaller than the number of vertexes of the original graph stream. The graphical sketch uses a hash function to project the vertexes of the original graphstream to the vertexes in a subsketch. For instance, the hash function groups Vertexes and in both and (assuming that both subsketches use the same hash function). To illustrate how the edge weights are aggregated, consider the arrival of Edge , and Edge , where they affect only Subsketch as they are both red edges. Assume that the vertexmapping hash function groups Vertex and Vertex into one bucket, and Vertex and Vertex in another bucket (i.e., the same subsketch vertex). When processing Edge , an edge of weight will be created in Subsketch between Vertexes and . Then, when inserting Edge , the subsketch edge that has been created by will have its weight incremented by one (i.e., accumulating the weight of ). The reason is that the start vertexes of both and are mapped together, and similarly for their end vertexes.
Observe that the graphical sketch in Figure 2 summarizes both the edge weights as well as the graphstream topology. For instance, a query asking for the weight of can be evaluated by consulting Subsketch by hashing the endpoint vertexes of the query, mapping them to vertexes in , and retrieving the weight of the corresponding edge in . Also, the graphical sketch summarizes the topology of the graph stream to allow graphtraversal queries. For instance, a reachability query inquiring if Vertex is reachable from Vertex using only edges evaluates to true because there is a path connecting the two vertexes in Subsketch (i.e., ). Another example is a pattern query that estimates if there is a path of two edges from Vertex to Vertex , where the first edge is of Label , and the second is of Label . This is possible by expanding Edge by checking the outgoing edges from Vertex in the subsketch and discovering Edge . This forms a positive answer because a path of two valid connectededges from the two sketches satisfy the query.
4 The Design of SBGSketch
In this section, we highlight the general structure of SBGSketch. Given a labeled graph, say , we create an SBGSketch instance, say say , that summarizes Graph . Sketch considers the topology of so that approximating graphtraversal queries becomes possible. Assume that has vertexes, edges, and distinct edge labels. Then, we create Sketch that has matrices, where each matrix is a twodimensional matrix. Notice that is much smaller than , the number of vertexes in the graph stream. Also, is much smaller than and in real labeledgraphs, e.g., the number of interaction types in a proteininteraction network is much smaller than the number of proteins. This forms a threedimensional matrix of dimensions . Notice that it is possible to create multiple independent sketches to summarize Graph for better accuracy (see Section 6.2.2).
Figure 3 illustrates the general structure of SBGSketch, where independent sketches can be created to summarize a graph. For illustration, assume that (i.e., we have only one sketch).
Consider SBGSketch for Graph . Each cell in maintains an aggregate for a set of streamed edges as Figure 3 illustrates. The maintenance of this aggregate may differ based on the query type that is supposed to answer (e.g., a counter to answer edgefrequency queries). An incoming edge is hashed into one of the cells in as explained in Section 4.1. Notice that if multiple sketches are used, each sketch will have a different hash function to hash the vertexes. Observe that each cell holds a pair of values, namely rank and aggregate. Section 4.2 elaborates on how the rank values are used, while Section 5 shows how the aggregate values are maintained for various query types. In the next section, we focus on how the streamed edges are mapped to sketching cells.
4.1 Mapping Streamed Edges to Sketching Cells
Mapping a streamed edge to a sketching cell is a fundamental operation to update the sketch. Mapping an edge to a cell is orthogonal to the sketch update logic that depends on the query type supported by the sketch. To illustrate how streamed edges are mapped to cells in SBGSketch, refer to Figure 4 that shows a single sketch . SBGSketch generates and uses a set of hash functions. One of these hash functions, namely , maps each vertex identifier to an integral value in the range [0, d1], i.e., can map any vertex to a row or column of any matrix of the matrices of Sketch . If multiple sketches, say sketches, are used, then different pairwiseindependent hash functions are generated and are used (i.e., one hash function per sketch). To allow traversing the matrices of a single sketch efficiently, SBGSketch uses the same hash function in all the matrices.
Recall that the number of matrices in a sketch is equal to the number of distinct labels of a graph, and we assume that the distinct edgelabels are known beforehand (e.g., the different types of social relationships in a social network). Refer to Figure 4. An incoming stream Edge = is mapped as follows. First, SBGSketch has a static onetoonemapping for each label to a corresponding matrix in the Sketch. So, Edge will be mapped to one of the cells in the matrix corresponding to Label , say . Hashfunction maps Sourcevertex , and Destinationvertex to a row, and a column in Matrix , respectively. So, the cell corresponding to Edge is Cell (, ) in Matrix , or Cell (, , ), for short. Notice that using the same hash function in all the rows and columns of all the matrices allows traversing the matrices of the sketch efficiently, otherwise, materializing the hash functions as in [22] would be necessary and additional memory would be consumed from the allocated memory. For example, to traverse the outgoing edges of the endvertex of Edge (i.e., Vertex ), where the edges are labeled by Label , we can check the second row of Matrix . The reason is that Vertex has been mapped by Function to the second column as Figure 4 illustrates, and that all the matrices are adjacency matrices using the same function.
4.2 The Ranking Logic in SBGSketch
Usually, edgelabeled graphs are skewed in numbers and are unbalanced w.r.t. the frequency of edges per distinct edgelabel. For instance, in a social network, the number of familytype relationships may be much less than the friendtype relationships. This adds a challenge when building graph sketches for edgelabeled graphs. In particular, memory for summarizing edges of a specific label, say , should be proportional to the frequency of receiving edges of Label . Otherwise, precious parts of the sketch would be wasted (i.e., those matrices corresponding to lowfrequency labels would have wasted cells). The challenge becomes more difficult when the frequency of labels is not known beforehand. In this case, initializing the matrices with different dimensions may become difficult or inaccurate.
SBGSketch addresses this challenge without requiring to know the relative edgelabel frequencies beforehand. The main idea of SBGSketch is to allocate matrices of the same dimensions to all the labels, and to allow an edge, say , of Label to use cells in matrices that do not correspond to Label . The intuition behind this approach is to allow highfrequent labels (e.g., ) to use other matrices corresponding to lowfrequency labels. However, SBGSketch guarantees that the lowfrequency labels can reclaim their cells that were occupied by highfrequency labels whenever needed. To illustrate, consider an edgelabeled graph, say , with a total of five labels, i.e., L = 5. Let be SBGSketch for Graph , where consists of five matrices, namely, , , , , and as in Figure 5. Upon receiving Edge
, SBGSketch assigns a rank vector to Edge
before updating the sketch. In Section 4.3, we discuss how the rank vectors are generated and are assigned to edges. For now, it is sufficient to know that the values of a rank vector are a permutation of the values , where is the number of matrices in the sketch (i.e., the number of labels). For example, Figure 5 illustrates that Edge is assigned a ranking Vector, say , of values [0, 2, 1, 4, 3], where 0 is the highest rank, and 4 is the lowest rank.An element, say , of a rankvector for an edge determines the rank of the edge in Matrix . For instance, Figure 5 illustrates that the rank of Edge in Matrix , i.e., , is equal to 2. Notice that each cell, say , in the sketch stores a rank value, say . Rank value represents the rank of the last edge that has updated Cell . For example, in Figure 5, the yellow cell in the topleft Matrix has a rank value of 1, which means that the last edge that has updated this cell has a rank value of 1 in Matrix . Whenever an edge, say , is hashed into a cell, say , the rank of Cell as well as the rank of Edge in the matrix hosting Cell determines if Edge can use Cell . In particular, a streamed edge can use and affect Cell if and only if the edge’s rank is higher than or equal to the current rank of Cell . Comparing rank value (i.e., the rank for Edge in Matrix ) to the rank value of a cell in Matrix , say , leads to the following three cases with three possible outcomes (notice that zero is the highest rank – refer to Figure 5 for illustration):
Evict and Occupy: If > , then evict the effect of all the edges that have affected Cell , use to update Matrix by the arrival of Edge , and set the rank of to the value of . This prevents any edge of rank lower than to evict Edge from Cell . For instance, in Figure 5, Edge has a higher rank in Matrix than the last edge that has contributed to Cell in Matrix . Thus, the aggregate value in Cell is replaced by the value associated with Edge (e.g., may be set to 1 if the sketch is counting the frequency of receiving Edge ), and the rank of Cell is set to , which is zero in this example.
Update the Aggregate: If = , then update the aggregate of Cell , and leave the rank of Cell unchanged. This preserves the aggregation of the previous instances of this edge or other edges of the same rank that collide with Edge in Cell . For example, in Figure 5, Edge has the same rank in Matrix as the rank of the last edge that has contributed to Cell . Thus, the aggregate value in Cell is updated by the value associated with Edge , e.g., may be incremented by one if the sketch is counting the frequency of receiving Edge .
Do Nothing: If < , then do nothing to Cell . This means that the last edge, say , that has contributed to Cell has a higher rank that prevents Edge from evicting Edge or even contributing to Edge ’s aggregate value. For example, in Figure 5, Edge has a lower rank in Matrix than that of the last edge that has contributed to Cell . Thus, the value in Cell is kept unaffected.
SBGSketch guarantees that all edges of Label have the highest priority in the matrix corresponding to Label , i.e., Matrix . This guarantees that all the edges with Label have the highest rank of zero in their “home” matrix . Hence, an edge of Label , where , can possibly use a cell, say , in Matrix , if has never been used by an edge of Label . Moreover, edges with Label are given the privilege to evict lowerranked edges in Cell , use the cell, and disallow any edge not labeled by Label to use Cell . This is achieved by updating the rank of Cell with zero, the highest rank that cannot be evicted.
Notice that any query, say , processing Edge , should consult only the cells that hold the ranks of Edge . For example, Query , regardless of its type, when retrieving Edge from SBGSketch, will consult only matrices , , and in Figure 5. The reason is that the values at these cells may represent contributions by Edge . However, Matrices and should not be considered when querying Edge as their ranks guarantee that Edge has not contributed to their current aggregate values, otherwise, they would hold the ranks corresponding to the ranks of Edge . Notice that SBGSketch does not allow an edge to use more than one cell per matrix. Using more than one cell per matrix would increase the processing time as well as the collision rate, which may decrease the approximation accuracy. However, using one cell per matrix gives each edge a chance to use a cell that might be unoccupied in each matrix.
4.3 Generation and Mapping of Rank Vectors
In this section, we describe how SBGSketch generates the rank vectors as well as how a rank vector is assigned to an edge. Recall that a rank vector, for a given Graph , is a permutation of the integer values , where is the number of distinct edgelabels of Graph . Also, recall that zero is the highest rank. SBGSketch accepts an initialization parameter, namely , that corresponds to the number of distinct random rankvectors that SBGSketch generates and uses. SBGSketch restricts that the number of rank vectors is upperbounded by the factorial of , i.e., . This restriction makes it possible to generate rank vectors that are all unique.
For illustration, refer to an example for generating rankvectors in Figure 6. In the figure, we assume that SBGSketch is initialized for Graph that has five distinct edgelabels (i.e., = 5), and that the number of the rank vectors to generate is four (i.e., = 4). Using these parameters, SBGSketch generates and materializes four random rankvectors that are different permutations of the values . Notice that zero is not considered in the materialized rankvectors as in Figure 6. Rank zero is injected on the fly into a rank vector when that vector is selected for an incoming edge, where the injection position is controlled by the edge’s label.
To illustrate how a rank vector including zero is assigned to a streamed edge, assume that SBGSketch receives Edge as in Figure 6. SBGSketch uses a hash function, namely , that hashes Edge into a value in the integral range . In the example in Figure 6, Function accepts the source vertex, the destination vertex, and the label of Edge as inputs to hash Edge into either 0, 1, 2, or 3. In this example, Edge is assigned as its rank vector. However, does not include Rank zero that defines the matrix where Edge has the highest rank. SBGSketch uses the label of Edge to augment the selected rankvector with the zero rankvalue. This augmentation assures that Edge has the highest rank in the matrix corresponding to the label of Edge . For example, as the label of Edge is , SBGSketch injects zero into the first element in the generated rankvector, i.e., the assigned rankvector becomes . Notice that if Edge had another label, say , then Rank zero would be injected in the location corresponding to Matrix .
5 Query Estimation
In this section, we describe how SBGSketch estimates the results of various query types. In particular, Section 5.1 elaborates on frequencybased queries (e.g., the constrained edgefrequency query), and Section 5.2 describes how SBGSketch estimates graphtraversal queries (e.g., the constrained reachability query). For each query type, we demonstrate how the sketch is updated when receiving a streamed edge as well as how the sketch is queried to evaluate the query approximately.
5.1 FrequencyBased Queries
For frequencybased queries, we assume that a streamed edge is associated with an attribute, say , of a numerical type that can be aggregated. Without loss of generality, we term this query type a frequencybased query. Applications usually use this query type to estimate the occurrence frequency of a given edge or subgraph in a stream.
5.1.1 Edge Queries
Given two vertex identifiers and and an edge label , let be the exact aggregated edgeweight from Vertex to Vertex , where the edge is labeled by Label . Furthermore, let be the estimated weight of this edge.
Query represents an edgequery. For instance, in a social network, one may estimate the number of comments from User on a post by User , where a comment is represented as a directed edge from User to User with an edgelabel “comment” (other interactions could be represented by other edge labels).
Insertion of Edges: Algorithm 1 depicts how SBGSketch is maintained when inserting an edge to estimate later edge queries. Refer to Figure 8 for illustration. Assume that an instance of SBGSketch is built for processing a graph of five labels, and is receiving Edge with Rankvector . Assume further that the sketch is built to perform a sum aggregation on the weight attribute that is set to one for Edge . Figure 8(a) gives SBGSketch just before receiving Edge , where each cell, say , holds an aggregate corresponding to the weights of some aggregated edge weights, and the rank of the last edge that has contributed to Cell . For instance, in Figure 8(a), the highlighted cell in Matrix illustrates that the aggregated sum in the cell is 4, and that the last edge with the highest rank that has contributed to this cell has Rank 1 for Matrix . To update SBGSketch with Edge , the corresponding rank vector of Edge is computed as illustrated in Section 4.3. Then, the cells that are potential candidates for use by Edge are selected. In particular, the cell corresponding to in each matrix is a potential candidate (Figure 8(a) highlights these cells in yellow).
As an optimization, the cells corresponding to in each matrix are physically stored in contiguous memory, thus exhibiting high locality of memory access (i.e., the matrices given in Figure 8(a) are a logical representation of a single larger physicalmatrix). As explained in Section 4.2, according to the rank values in the potential candidate cells and the ranking vector of Edge , only a subset of these potential cells may be updated by Edge (we term these cells candidate cells). Figure 8(b) illustrates that Edge evicts the value at Matrix because Edge has the highest rank in Matrix (Lines in Algorithm 1). Eviction also happens in Matrix . However, in Matrix , the ranks are equal, so Edge increments the aggregate value of the corresponding cell (Lines in Algorithm 1). For the last two matrices, the cells are occupied by edges with higher ranks, so Edge is prevented from using these cells.
Notice that we update the cell of Matrix in Figure 8(b) for illustration purposes only. However, as an accuracy optimization, SBGSketch does not need to update that cell and will leave its value to be 4. The reason is that, in this case, SBGSketch can guarantee that Edge has never been received before. Otherwise, the candidate cell in Matrix of Figure 8(a) would have Rank zero if Edge has been encountered before. Hence, the cell in Matrix does not need to be incremented.
The reason is that
after updating the sketch with ’s arrival, the value 4 in Matrix will implicitly count the arrival of without any
changes to
Matrix .
Thus, the value of the candidate cell in Matrix of Figure 8(b) will be kept
unchanged (i.e., with value 4)
to help
increase
the estimation accuracy.
Complexity: Updating the sketch with an incoming edge takes time, where is the number of distinct labels.
EdgeQuery Estimation: Algorithm 2 depicts how SBGSketch estimates
the answer to an edgefrequency query.
Figure 9 illustrates how Query is evaluated. First, the endpoint vertexes are hashed to determine the candidate cells to check at each matrix. Then, only the candidate cells with ranks equal to
those
of the queried edge (i.e., ), are considered by computing the minimum values of these cells (Lines of Algorithm 2).
This guarantees that the estimate might be an overestimate of the actual answer, but can never be an underestimate (as each edge is guaranteed to have the highest rank in one matrix).
If multiple sketches are used, then the minimum value of the results from all the sketches will form the final answer.
Notice that if anyone of the candidate cells has a rank
higher than the corresponding rank of the edge query, then SBGSketch returns zero, indicating with certainty that the edge has never been encountered
before (see Theorem 5.1.1).
For the sake of completeness, we provide a theoretical errorestimate of SBGSketch’s error distribution.
Complexity: Approximating the aggregate weight of an edge takes time, where is the number of distinct labels.
Theorem (Informal)
Let be the number of priorities and be the number of arrivals of edge during an observation window, where is the set of vertexes in the graph stream. Let be the upper bound on given by SBGSketch and let be the absoluteerror distribution given by TCM with the same number of sketch counter cells. Let be the number of independent hash functions used in SBGSketch and TCM. Then, the distribution of the absolute error is
where
is a lower bound on the probability that
is hashed into one of the lowerpriority sketchmatrices and survives eviction ( is given in Appendix A.1).A formal version of Theorem 5.1.1 and its proof are left to Appendix A.1. Theorem 5.1.1 shows that, for the same number of sketch cells, the absolute error of SBGSketch is smaller than that of TCM. For values of , the value of tends to increase quickly with until it reaches the probability that the counter related to a given label is evicted in the sketch matrices of other labels. As an illustrative example, we use the equations in Theorem 5.1.1 to plot the curves in Figure 7. Figure 7 gives the complementary cumulative distribution of the absolute error of edges of the most frequent label, say Label , out of 100 distinct labels in SBGSketch against that of TCM when taking into account a 10% decrease in the sketch matrix size due to the ranking data structure. We set the edge arrivalrate of Label to be such that there is an average of edge collisions per sketch counter, which we define as that of other labels; we consider only one hash function for simplicity (). Note that SBGSketch gets absolute error of zero with probability while TCM with the same probability gets absolute error around 40. This happens because if SBGSketch is not able to have the Label counter evicted from the sketches of other labels, it will very likely contain the correct number of edge arrivals of edges of Label .
Theorem
Using SBGSketch, .
Proof. Refer to Appendix A.2.
5.1.2 SubGraph Queries
Aggregating edgeweights of a subgraph is considered in both gSketch [23] and TCM [22]. However, SBGSketch expands the semantics of subgraph aggregate queries to allow restricting the subgraph query by the edge labels. Given a subgraph, say , identified by a set of labeled edges, say , an exact subgraph aggregation query returns the minimum of the weights of all the edges listed by . We denote the estimate of a subgraph aggregation query by , and we adopt the semantics that if the estimated frequency of any edge in is , we estimate to be . The reason is that the subgraph identified by does not have an exact match in the stream. Notice that inserting edges in a sketch that supports subgraph queries follows the exact logic of Algorithm 1. Also, evaluating a subgraph query depends on evaluating the edgeweight estimate of each edge forming the subgraph (by directly referencing Algorithm 2).
Complexity: Approximating a subgraph query, say , takes time, where is the number of distinct labels, and is the number of edges of Query .
5.2 Graph Traversal Queries
Graph traversal queries on labeled graphs arise in many application domains. For example, reachability queries are used in communicationnetwork troubleshooting, and random walking on edgelabeled graphs is a primitive operation in many machine learning techniques (e.g.,
[13, 14, 20]). In this section, we demonstrate how SBGSketch is useful in estimating constrainedreachability queries.Given two vertexes, say and , and a set of allowed labels, say , a constrained reachability query returns true if and only if there is a path, say , from Vertex to Vertex such that each edge of Path is labeled by any of the ’s labels. We denote the estimate of a reachability Query as .
Constrained reachability queries are important primitive operations in many application domains. For instance, in a proteininteraction network, where a vertex represents a protein and a labeled edge represents an interaction type between two proteins, a user may need to estimate if two proteins interact directly or indirectly through covalent or stable interaction types, i.e., evaluating . In machinelearning applications, one can use constrainedreachability queries as a way to construct feature vectors, indicating whether or not Vertex can reach Vertex using only edges of certain labels. These features can be used for link prediction tasks (similar to Sun et al. [20]), among other applications where the learning algorithm can tolerate reachability approximation (i.e., false positives).
Insertion of Edges: To support reachability queries, the same logic used to insert edges for edge queries could be applied (see Algorithm 1). However, it is sufficient to use edge weights of one to indicate edge existence between two vertexes, where an edge weight of zero in the adjacency matrices of the sketch flags that no edge exists.
ReachabilityQuery Estimation: Any traversalbased reachability algorithm (e.g., DFS search) can be used to traverse the adjacency matrices of SBGSketch to evaluate a constrainedreachability query. However, the algorithm should check only the edges labeled by any of the labels allowed by the query. Notice that if multiple sketches are used, then each sketch evaluates the query independently. Then, the independent results are logically anded to form the final answer. The timecomplexity of the evaluation is determined by the thirdparty algorithm used in traversing the summarized topology of the sketch.
6 Experimental Evaluation
In this section, we experimentally evaluate the accuracy and the performance of SBGSketch against TCM [22], the only stateoftheart that is comparable to SBGSketch w.r.t. query expressiveness. We measure the processing time and the estimation error using various types of queries on real datasets from different domains. For measuring the estimation errors, we use the average relativeerror metric (ARE, for short). As defined in [23, 22], given Query , the relative error of , say , is defined as:
where is the actual result of Query , and is its estimated value. The average relativeerror is computed over a set of queries, say , as follows:
6.1 Datasets and the Experimental Setup
We use real datasets of labeled graphs from three different domains (communication network, social network, and biological network). We use IPFlow [2], Youtube [21], and String [1]. Table 1 summarizes the properties of the aforementioned datasets, and gives the number of distinct labels of each dataset. To verify the labelskewness in real datasets, we found that for all the datasets in Table 1, of the edges are labeled by only of the labels (i.e., frequent labels). The IPFlow dataset is a collection of anonymized communicationtraces from CAIDA’s equinixChicago monitor, where the edge labels represent the communication protocol used (e.g., HTTPS, SMTP, Telnet). The Youtube dataset is a subset of the popular videosharing service, where the vertexes represent users, the edges represent user interactions, and the edges are labeled by userinteraction types as described in [21]. The String dataset is a proteininteraction network, where the vertexes represent proteins, the edges represent interactions among the proteins, and the labels represent the proteinprotein interaction types. Notice that the number of labels in all the real labeled datasets is much less than the number of the vertexes and the edges.
Both SBGSketch and TCM [22] are implemented as C++ libraries that can be used as components by any server. Our experiments are conducted on a machine running Windows 10 on 4 cores of Intel i7 3.40 GHz and 16 GB of mainmemory. Notice that TCM is not designed to deal with labeled graphs, however, we follow the suggestion of the original paper [22] by creating a matrix for each label. Hence, SBGSketch becomes TCM if the ranking logic is removed. For fairness, we use the same memory sizes for both SBGSketch and TCM. Also note that if the rank logic data structures of SBGSketch do not reduce the number of sketch counters, the accuracy of SBGSketch is lower bounded by the accuracy of TCM. The reason is that an edge in SBGSketch is given the highest priority in the matrix corresponding to its label. Hence, it is guaranteed that an edge will experience the same hashcollision rate in the matrix corresponding to its label in both SBGSketch and TCM. For this reason, we focus on queries constrained with labels of highfrequency as they show the power of SBGSketch to leverage unused cells in the matrices corresponding to lessfrequent labels.
Dataset  # Vertexes  # Edges  # Labels 

IP Flow  237,022  22,497,005  39 
Youtube  15,088  13,628,895  5 
String Protein Network  1,520,673  348,473,440  45 
6.2 Constrained EdgeQueries
6.2.1 Varying the Sketch Size
In this set of experiments, we study the accuracy of approximating constrained edgequeries using SBGSketch and TCM, the stateoftheart. We fix the number of hash functions to two (i.e., we set in Figure 3), and we measure the estimation accuracy for various sketch sizes. A sketch size is determined using a sketchsize factor. A sketchsize factor, say , is a value between and exclusive that defines the memorysize of the sketch w.r.t. the size of the original graph dataset. For instance, if the size of the original graph dataset is MBs, and the sketchsize factor is , then the total memorysize of a single sketch will be upper bounded by MBs. We consider this for each experiment so that both SBGSketch and TCM are assigned the same memory size for fair comparisons.
We generate 10,000 constrained edgequeries, say , randomly for each dataset, say , in Table 1. Then, we run Queryset on SBGSketch and TCM with different sketchsize factors, specifically, , , , , , , and . We stream all the edges of each dataset in Table 1 before running the query sets (e.g., for the String dataset, the sketch receives 348 million edges and then we run the 10,000 queries). Figure 10 gives the average relativeerror when running as formerly described using each dataset of Table 1. As expected, the average relativeerror decreases when increasing the sketch size for both SBGSketch and TCM. The reason is that the number of collisions decrease as the sketch size increases, and the average relativeerror decreases accordingly. For all the datasets, the ARE of SBGSketch is less than that of TCM. We attribute this to the rankvectors and the ranking logic of SBGSketch. The reason is that the rankvectors and the ranking logic are the main differences between SBGSketch and TCM (i.e., removing the ranking logic turns SBGSketch to TCM). Notice that for the same memorysize, the number of cells allocated to TCM is higher than that allocated to SBGSketch (a cell in SBGSketch uses an additional byte for the rank). Although the number of cells in TCM is higher, SBGSketch achieves better average relativeerror as the ranking logic automatically handles label skew, and leverages the cells that may not be used by lowfrequency labels. In contrast, an edge in TCM of Label assigned to Matrix can never use a cell in another matrix, say Matrix , even if is not fullyoccupied by edges of Label .
Notice that the accuracy of SBGSketch relative to that of TCM increases as the graph size increases (which is used also to define the sketch size). To illustrate, we measure the TCM error that SBGSketch reduces (e.g., a % error reduction means that the average relativeerror of SBGSketch is only % of the error in TCM). Figure 11 gives the error reduction caused by SBGSketch comparing to that of TCM for all the datasets. SBGSketch significantly reduces the error of TCM, where the error reduction reaches % for the large String dataset, i.e., the estimation error of SBGSketch is just % of the TCM error on the same setup. Notice that the error reduction increases as the graph size increases. For example, the reduction reaches % for the Youtube dataset, where the size of the Youtube dataset is relatively smaller than that of the IPFlow dataset (whose the error reduction reaches %). We attribute this to the cell utilization effectiveness of the ranking module of SBGSketch. In contrast, TCM is vulnerable to wasting more cells if they are assigned to larger matrices of labels that are lowrepresented by graph edges.
The results in Figure 10 illustrate that the accuracy of SBGSketch is significantly and consistently better than that of TCM over real data from different domains. For example, consider the String proteininteraction network in Figure 10(c). When setting the sketch size to of the String dataset size, the average relativeerror of SBGSketch is , which is significantly better than , the average relativeerror of TCM for the same setup (i.e., TCM overestimates the queries by of the actual values on average, while the overestimation by SBGSketch is just ). Moreover, the accuracy of SBGSketch increases when increasing the number of the pairwiseindependent hash functions as demonstrated in Section 6.2.2.
6.2.2 Using Multiple Hash Functions
In this experiment, we measure the effect of using multiple sketches. Each sketch uses a different hash function to hash the vertexes into rows and columns of its matrices. The hash functions form a set of pairwiseindependent hash functions. In particular, we vary , the number of hash functions (as in Figure 3), while fixing the memory size of each sketch. Hence, the total memory size increases with . There are two reasons behind the setup of this experiment. First, we need to study the effect of increasing the number of hash functions while fixing all the other parameters including the dimensions of each matrix. Second, the setup is consistent with the same experimental setup reported by TCM [22].
In this experiment, we use the same set of queries described in Section 6.2, namely . The query set executes after inserting into the sketch the entire datasets described in Table 1. We fix the size of a single sketch to be of the size of each queried dataset, and we vary the number of the hash functions from to . Figure 12 gives the average relativeerror when running as formerly described using each dataset of Table 1. The average relativeerror decreases as the number of pairwiseindependent hash functions increases for both methods. However, SBGSketch consistently provides better accuracy than that of TCM. The reason is that each sketch hashes the same edges differently, and this allows the edges to collide differently in each sketch. At query time, the results from all the sketches are used, and the most accurate one dominates as the final result (as explained in Section 5). Notice that each sketch is updated and is queried independently. The advantage of processing the sketches independently is twofold. First, the accuracy is enhanced as each sketch summarizes the graph stream differently. Second, updating and querying the sketches can be preformed in parallel, which allows performance gains.
6.3 Constrained SubGraph Queries
In this set of experiments, we use all the datasets in Table 1 to evaluate the accuracy of estimating constrained subgraph queries. We generate constrained subgraph queries randomly with different variations (triangle queries, paths of different lengths, and connected subgraphs). We measure the accuracy of SBGSketch w.r.t. TCM for various sketch sizes while fixing the number of hash functions to two (i.e., = 2). It is expected to get results that comply with the results of the edgequeries in Figure 10 as the edge query logic is used as a primitive to evaluate subgraph queries. This set of experiments confirms this expectation as illustrated in Figure 13. However, the average relativeerror reduces for both SBGSketch and TCM in contrast to edge queries. We attribute this reduction to the conceptual evaluation of the subgraph queries (see Section 5.1.2). In particular, the query result is dominated by the query edge of least frequency. Hence, the relative error on average decreases in contrast to the error in estimating individual edge queries. Notice that SBGSketch always reduces the estimation error over TCM (refer to Section 6.2), where the ranking logic of SBGSketch leverages more cells than TCM to reduce collisions in the presence of skewedlabel distributions. Notice that SBGSketch handles without any advance knowledge of label distribution.
6.4 Constrained Reachability Queries
In this set of experiments, we measure the effectiveness of SBGSketch to estimate constrained reachability queries. Notice that a reachability query that evaluates to true on the original graph will always evaluate to true using a sketch of that graph. This is true for both SBGSketch and TCM as they both keep all the connectives of the input graph streams. However, due to edge collisions, both methods are vulnerable to false positives, i.e., a reachability query that evaluates to false on the original graph might be estimated as true using a sketch of the original graph. Hence, in this set of experiments, we generate random constrained reachability queries with actual results of on the original graphs (i.e., not reachable), and we measure how many of them are detected as unreachable by both SBGSketch and TCM, i.e., we measure the recall of the truenegatives, which is similar to the metric used in [22] to evaluate the effectiveness of TCM on estimating reachability queries.
We evaluate the truenegative recall of constrainedreachability queries using all the datasets listed in Table 1. We generate random reachability queries, say , where each query is constrained to use up to half the labels of the queried graphs. We ensure that all the generated queries are not reachable in the original graphs. We run Queryset with different sketchsize factors on the xaxis of Figure 14 while fixing the number of hash functions to two. The yaxis gives the percentage of the truenegatives recall (the higher the better). Figure 14 illustrates that the accuracy of SBGSketch in recalling truenegatives is very effective even for small sketch sizes. Figure 14(a) illustrates that SBGSketch and TCM have accuracy of , and , respectively, when fixing the sketch size to only of the graph stream size (i.e., SBGSketch is more accurate than TCM). SBGSketch estimates correctly over % of the queries when setting the sketch size to of the graph size for the IPFlow and the String datasets (see Figure 14(a) and Figure 14(c)), where the accuracy reaches up to %. We attribute this gain in accuracy to the ranking logic of SBGSketch that automatically balances the filling of the sketch matrices, and overcomes the issue of skewed labels, and hence, decreasing the overall hashcollision rates.
6.5 ProcessingTime Efficiency
In this set of experiments, we measure the time of constructing SBGSketch for each dataset in Table 1. Notice that inserting edges into SBGSketch during sketch construction has the same timecomplexity as evaluating edge queries, and the same holds for TCM [22]. As SBGSketch performs more logic related to rankvalues maintenance, we expect SBGSketch to take additional construction time in contrast to that of TCM. In this experiment, we fix the sketchsize to be of the graph size and compare the construction time of both SBGSketch and TCM. The yaxis in Figure 15 gives the construction time in milliseconds for the three datasets listed in Table 1. Notice that the construction time of SBGSketch is comparable to the simpler construction logic of TCM. We observe an average of % time increase over all datasets. This constructiontime increase is acceptable given the significant gain in accuracy in SBGSketch.
7 Related Work
The related work to SBGSketch can be divided into two categories: (1) sketches for general streams, and (2) sketches for graph streams. In the first category, various research work has been proposed , e.g., Adasketch [18], CountMin [7], AMS[4], Bottomk [6], and LossyCounting[15]. However, the research efforts of the first category are not optimized for graph streams (see [23]). SBGSketch, our proposed method, is designed to summarize labeledgraph streams effectively. It is important to note that the eviction ranking mechanism of SBGSketch is not related to set membership sketches, e.g., Bloom filters [12]. Bloom filters do not perform evictions or have rankings.
In the second category, the research efforts focus on processing graph queries over data streams that form graph structures (e.g., [5, 10, 23, 22]). In [5], graph queries that count the number of triangles are addressed, and [10] supports shortestpath queries. However, both [10, 5] and similar theoretical work (e.g., [8]) focus on providing theoretical bounds that may not scale for large graphs. gSketch [23] extends the idea of the CountMin sketch [7] to compute edgefrequency queries. To construct a sketch, gSketch requires either a sample of the graph stream or both a graphstream sample and a queryworkload sample. gSketch considers only unlabeledgraph streams. In contrast, SBGSketch neither requires edge samples nor queryworkload samples to summarize labeledgraph streams. In addition, SBGSketch supports graphtraversal queries that are not considered by gSketch for its supported graph model. Notice that this category does not consider the graph summarization techniques that are not designed for streaming scenarios (e.g., [16, 3, 19, 17, 11, 9]). The reason is that these techniques do not support the continuous arrival of edges in streaming applications as discussed in [22].
The most related work to SBGSketch is TCM [22]. The main motivation of TCM is to support graphtraversal queries. TCM builds independent matrices, where each matrix has two dimensions. Each matrix uses an independent hash function to summarize the graph stream (i.e., the graph summary is created times with different hash functions). A cell in a TCM sketch is addressed by the endpoints of a given edge to update the sketch on edge arrivals to summarize the graph topology along with an edge attribute. However, TCM is not optimized to handle labeledgraph streams. [22] describes without evaluation how TCM can handle graphs with different type of edges (i.e., labeledgraph streams). In particular, [22] suggests to create a matrix for each edge type. However, this approach does not handle the common edgeskewness w.r.t. the edge labels. Moreover, the edgeskewness may not be known beforehand, and may change with time to make allocating different memory sizes for each label impractical. In contrast, SBGSketch handles labeledgraph streams efficiently by reducing the error rate of TCM by up to %. Moreover, SBGSketch does not require any preknowledge about the edge distribution.
8 Conclusion
SBGSketch is a graphical sketching method that summarizes labeled graph streams, where the graph topology is considered in the summary. It assumes a stream, where each edge has one label. SBGSketch addresses the consequences of having unbalanced edgedistribution w.r.t. the edge labels. This is achieved by presenting and evaluating a ranking technique. Given a fixed sketchsize, the proposed ranking technique allows SBGSketch to automatically adapt to the unbalanced labels of the streamed edges by allowing an edge to use more than one matrix based on its ranks. Moreover, it guarantees that all the edges gain in summarization accuracy even if their labels are relativelyrare. We demonstrate how SBGSketch can be used to approximate several graphquery types that depend on an aggregation of an edge attribute and/or the topology of the graph. The experimental study over three real labeledgraphs spanning different domains show that SBGSketch reduces the estimation error of the stateoftheart by up to %.
References
 [1] http://stringdb.org/.
 [2] http://www.caida.org/data/passive/.
 [3] M. Adler and M. Mitzenmacher. Towards compressing web graphs. In Proceedings of the Data Compression Conference, DCC ’01, pages 203–, 2001.

[4]
N. Alon, Y. Matias, and M. Szegedy.
The space complexity of approximating the frequency moments.
STOC ’96, pages 20–29, 1996.  [5] Z. BarYossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. SODA ’02, pages 623–632, 2002.
 [6] E. Cohen and H. Kaplan. Tighter estimation using bottom k sketches. Proc. VLDB Endow., 1(1):213–224, Aug. 2008.
 [7] G. Cormode and S. Muthukrishnan. An improved data stream summary: The countmin sketch and its applications. J. Algorithms, 55(1).
 [8] G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. PODS ’05, pages 271–282, 2005.
 [9] W. Fan, J. Li, X. Wang, and Y. Wu. SIGMOD ’12, pages 157–168, 2012.
 [10] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semistreaming model. Theor. Comput. Sci., 348(2):207–216, Dec. 2005.
 [11] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. PrenticeHall, Inc., Upper Saddle River, NJ, USA, 1988.
 [12] A. Kirsch and M. Mitzenmacher. Less hashing, same performance: Building a better bloom filter. Random Structures & Algorithms, 33(2):187–218, 2008.
 [13] N. Lao and W. W. Cohen. Relational retrieval using a combination of pathconstrained random walks. Machine Learning, 81(1):53–67, oct 2010.
 [14] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. EMNLP, pages 529–539, 2011.
 [15] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. VLDB ’02, pages 346–357, 2002.
 [16] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. SIGMOD ’08, pages 419–432, 2008.
 [17] S. Raghavan and H. GarciaMolina. Representing web graphs. ICDE’03, pages 405–416, 2003.
 [18] A. Shrivastava, A. C. Konig, and M. Bilenko. Time adaptive sketches (adasketches) for summarizing data streams. SIGMOD ’16, pages 1417–1432, 2016.
 [19] T. Suel and J. Yuan. Compressing the graph structure of the web. DCC ’01, pages 213–, 2001.
 [20] Y. Sun, J. Han, C. C. Aggarwal, and N. V. Chawla. When will it happen?: Relationship prediction in heterogeneous information networks. In WSDM, pages 663–672, 2012.
 [21] L. Tang, X. Wang, and H. Liu. Community detection via heterogeneous interaction analysis. Data Min. Knowl. Discov., 25(1):1–33, July 2012.
 [22] N. Tang, Q. Chen, and P. Mitra. Graph stream summarization: From big bang to big crunch. SIGMOD ’16, pages 1481–1496, 2016.
 [23] P. Zhao, C. C. Aggarwal, and M. Wang. gsketch: On query estimation in graph streams. Proc. VLDB Endow., 5(3), Nov. 2011.
Appendix A Proofs
a.1 Proof of Theorem 5.1.1
Theorem
Let be the number of priorities and be the number of arrivals of edge during an observation window, where is the set of vertexes in the graph stream. Let be the number of independent hash functions used in SBGSketch. Let be the percentage reduction in the number of sketch counter cells due to the inclusion of SBGSketch priority counters. Furthermore, let be the upper bound on given by SBGSketch and let be absolute error distribution given by TCM with the same number of sketch counter cells. Assume that edges arrive according to a nonhomogeneous Poisson process (a Poisson process that varies over time) with average rate over the observation window. The observation window is defined to be of length one by an appropriate change of units. Then, the distribution of the absolute error is
where
and , , is an arbitrary constant, and 1 is the Kronecker delta function.
Proof
In what follows we say an edge has “higher priority” than an edge at sketch matrix if the priority number of is smaller than that of in . We start with the case , one hash function. The number of arrivals of an edge
in the observed time window is a Poisson distributed random variable
. Let be upper bound on returned by SBGSketch. In what follows we condition on edge having at least one arrival in the observation time window. Note that the difference is due to the collision between and other edges. Without loss of generality we define to be the matrix that edge has priority 0, i.e., the matrix of label of edge . Let be upper bound on returned by TCM assuming a increase in counter load:Note that the probability is the probability that the arrivals in are at most , , or there are more than arrivals and these extra edge arrivals are distributed into the matrices of other labels . Without loss of generality let be the sketch where edge has priority . The probability that the counter values will have values less than in , from arrivals at is
(1) 
The probability that some will have less than collision is then . The probability the counter containing survives an eviction from higher priority edges is
(2) 
where is the rate of all edge arrivals except edges with the same label as edge . A same priority edge can also collide with at . While this does not mean there will be more than collisions with , we just assume we do not want any further collisions to get a lower bound, multiplying the above by . Collecting all the terms we get the equation for hash functions.
To consider hash functions, we observe that having hash functions also increases the arrival rate per counter, multiplying it by . On the other hand, because we assume the hash functions are independent, because is the minimum value over all the sketches of independent different hash functions, the probability that for all the hash functions we have is , which concludes our proof.
a.2 Proof of Theorem 5.1.1
Proof
The proof is by contradiction. Assume that Edge was inserted into SBGSketch. Then, all the candidate cells of Edge have ranks that are either equal to or higher than the corresponding ranks of Edge (see Lines of Algorithm 1). So, when the edgequery estimator hits a cell with a rank that is lower than the corresponding rank of Edge , then this contradicts that Edge was received before.
Comments
There are no comments yet.