Sequential Stratified Regeneration: MCMC for Large State Spaces with an Application to Subgraph Counting Estimation
This work considers the general task of estimating the sum of a bounded function over the edges of a graph that is unknown a priori, where graph vertices and edges are built on-the-fly by an algorithm and the resulting graph is too large to be kept in memory or disk. Prior work proposes Markov Chain Monte Carlo (MCMC) methods that simultaneously sample and generate the graph, eliminating the need for storage. Unfortunately, these existing methods are not scalable to massive real-world graphs. In this paper, we introduce Ripple, an MCMC-based estimator which achieves unprecedented scalability in this task by stratifying the MCMC Markov chain state space with a new technique that we denote ordered sequential stratified Markov regenerations. We show that the Ripple estimator is consistent, highly parallelizable, and scales well. In particular, applying Ripple to the task of estimating connected induced subgraph counts on large graphs, we empirically demonstrate that Ripple is accurate and is able to estimate counts of up to 12-node subgraphs, a task at a scale that has been considered unreachable, not only by prior MCMC-based methods, but also by other sampling approaches. For instance, in this target application, we present results where the Markov chain state space is as large as 10^43, for which Ripple computes estimates in less than 4 hours on average.
READ FULL TEXT