Modelling Correlated Bernoulli Data Part I: Theory and Run Lengths
Binary data are very common in many applications, and are typically simulated independently via a Bernoulli distribution with a single probability of success. However, this is not always the physical truth, and the probability of a success can be dependent on the outcome successes of past events. Presented here is a novel approach for simulating binary data where, for a chain of events, successes (1) and failures (0) cluster together according to a distance correlation. The structure is derived from de Bruijn Graphs - a directed graph, where given a set of symbols, V, and a 'word' length, m, the nodes of the graph consist of all possible sequences of V of length m. De Bruijn Graphs are a generalisation of Markov chains, where the 'word' length controls the number of states that each individual state is dependent on. This increases correlation over a wider area. To quantify how clustered a sequence generated from a de Bruijn process is, the run lengths of letters are observed along with run length properties.
READ FULL TEXT