Log In Sign Up

Computing Graph Descriptors on Edge Streams

by   Zohair Raza Hassan, et al.

Graph feature extraction is a fundamental task in graphs analytics. Using feature vectors (graph descriptors) in tandem with data mining algorithms that operate on Euclidean data, one can solve problems such as classification, clustering, and anomaly detection on graph-structured data. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy on benchmark datasets. However, these algorithms do not scale to large graphs since: 1) they require storing the entire graph in memory, and 2) the end-user has no control over the algorithm's runtime. In this paper, we present single-pass streaming algorithms to approximate structural features of graphs (counts of subgraphs of order k ≥ 4). Operating on edge streams allows us to avoid keeping the entire graph in memory, and controlling the sample size enables us to control the time taken by the algorithm. We demonstrate the efficacy of our descriptors by analyzing the approximation error, classification accuracy, and scalability to massive graphs. Our experiments showcase the effect of the sample size on approximation error and predictive accuracy. The proposed descriptors are applicable on graphs with millions of edges within minutes and outperform the state-of-the-art descriptors in classification accuracy.


page 1

page 2

page 3

page 4


Estimating Descriptors for Large Graphs

Embedding networks into a fixed dimensional Euclidean feature space, whi...

Just SLaQ When You Approximate: Accurate Spectral Distances for Web-Scale Graphs

Graph comparison is a fundamental operation in data mining and informati...

Anonymous Walk Embeddings

The task of representing entire graphs has seen a surge of prominent res...

When VLAD met Hilbert

Vectors of Locally Aggregated Descriptors (VLAD) have emerged as powerfu...

GraphZeppelin: Storage-Friendly Sketching for Connected Components on Dynamic Graph Streams

Finding the connected components of a graph is a fundamental problem wit...

CADDeLaG: Framework for distributed anomaly detection in large dense graph sequences

Random walk based distance measures for graphs such as commute-time dist...

Efficient SVDD Sampling with Approximation Guarantees for the Decision Boundary

Support Vector Data Description (SVDD) is a popular one-class classifier...