Toward a Theory of Markov Influence Systems and their Renormalization

02/04/2018 ∙ by Bernard Chazelle, et al. ∙ 0

Nonlinear Markov chains are probabilistic models commonly used in physics, biology, and the social sciences. In "Markov influence systems" (MIS), the transition probabilities of the chains change as a function of the current state distribution. This work introduces a renormalization framework for analyzing the dynamics of MIS. It comes in two independent parts: first, we generalize the standard classification of Markov chain states to the dynamic case by showing how to parse graph sequences. We then use this framework to carry out the bifurcation analysis of a few important MIS families. We show that, in general, these systems can be chaotic but that irreducible MIS are almost always asymptotically periodic. We also give an example of "hyper-torpid" mixing, where a stationary distribution is reached in super-exponential time, a timescale that cannot be achieved by any Markov chain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonlinear Markov chains are popular probabilistic models in the natural and social sciences. They are commonly used in interacting particle systems, epidemic models, replicator dynamics, mean-field games, etc. [9, 13, 14, 16, 19]. They differ from the linear kind by allowing transition probabilities to vary as a function of the current state distribution.111The systems are Markovian in that the future depends only on the present: in this case the present refers to the current state distribution rather than the single state presently visited. For example, a traffic network might update its topology and edge transition rates adaptively to alleviate congestion. The traditional formulation of these models comes from physics and relies on the classic tools of the trade: stochastic differential calculus, McKean interpretations, Feynman-Kac models, Fokker-Planck PDEs, etc. [4, 6, 14, 19]

. These techniques assume all sorts of symmetries that are typically absent from the “mesoscopic” scales of natural algorithms. They also tend to operate at the thermodynamic limit, which rules out genuine agent-based modeling. Our goal is to initiate a theory of discrete-time Markov chains whose topologies vary as a function of the current probability distribution. Of course, the entire theory of finite Markov chains should be recoverable as a special case. Our contribution comes in two parts (of independent interest), which we discuss informally in this introduction.

Renormalization.

The term refers to a wide-reaching approach to complex systems that originated in quantum field theory and later expanded into statistical mechanics and dynamics. Whether in its exact or coarse-grained form, the basic idea is intuitively appealing: break down a complex system into a hierarchy of simpler parts. The concept seems so simple—isn’t it what divide-and-conquer is all about?—one can easily be deceived and miss the point. When we slap a dynamics on top of the system (think of interacting particles moving about) then the hierarchy itself creates its own dynamics between the layers. This new “renormalized” dynamics can be entirely different from the original one. Crucially, it can be both easier to analyze and more readily expressive of global properties. For example, second-order phase transitions in the Ising model correspond to fixed points of the renormalized dynamics.

What is the relation to Markov chains? You may have noticed how texts on the subject often dispatch absorbing chains quickly before announcing that from then on all chains will be assumed to be irreducible (and then, usually a few pages later, ergodic). This is renormalization at work! Indeed, although rarely so stated, the standard classification of the states of a Markov chain is a prime example of exact renormalization. Recall that the main idea behind the classification is to express the chain as an acyclic directed graph, its condensation, whose vertices correspond to the strongly connected components. This creates a two-level hierarchy (fig.1): a tree with a root (the condensation) and its children (the strongly connected components). Now, start the chain and watch what happens at the root: the probability mass flows entirely into the sinks of the condensation. Check the leaves of the tree for a detailed understanding of the motion. The renormalized dynamics (visible only in the condensation) has an attracting manifold that tells much of the story. If the story lacks excitement it is partly because the hierarchy is flattish: only two levels. Time-varying Markov chains, on the other hand, can have deep hierarchies.

Figure 1: The condensation of a graph and its renormalization.

Consider an infinite sequence of digraphs over a fixed set of vertices. A temporal random walk is defined in the obvious way by picking a starting vertex, moving to a random neighbor in , then a random neighbor in , and so on, forever [7, 8, 15, 20]. The walk is called temporal because it traverses one edge from at time

. How might one go about classifying the states of this “dynamic” Markov chain? Repeating the condensation decomposition at each step makes little sense, as it carries no information about the temporal walks. The key insight is to monitor when and where temporal walks are

extended. The cumulant graph collects all extensions and, when this process stalls, reboots the process while triggering a deepening of the hierarchy. To streamline this process, we define a grammar with which we can parse the sequence . The (exact) renormalization framework introduced in this work operates along two tracks: time and network. The first track summarizes the past to anticipate the future while the second one clusters the graphs hierarchically in dynamic fashion. The method, explained in detail in the next section, is very general and likely to be useful elsewhere.

Markov influence systems.

All finite Markov chains oscillate periodically or mix to a stationary distribution. The key fact about their dynamics is that the timescales never exceed a single exponential in the number of states. Allowing the transition probabilities to fluctuate over time at random does not change that basic fact [2, 10, 11]. Markov influence systems are quite different in that regard. Postponing formal definitions, let us think of an MIS for now as a dynamical system defined by iterating the map , where

is a probability distribution represented as a column vector in

and

is a stochastic matrix that is piecewise-constant as a function of

. We assume that the discontinuities are linear (ie, flats). The assumption is not as restrictive as it appears as we explain with a simple example.

Consider a random variable

over the distribution and fix two -by- stochastic matrices and . Define (resp. ) if

(resp. else); in other words, the Markov chain picks one of two stochastic matrices at each step depending on the variance of

with respect to the current state distribution . This clearly violates our assumption because the discontinuity is quadratic in ; hence nonlinear. This is not an issue because we can linearize the variance: here, we begin with the identity and the fact that is a probability distribution. We form the Kronecker square and lift the system into the -dimensional unit simplex to get a brand-new MIS defined by the map

. We now have linear discontinuities. This same type of tensor lift can be used to linearize any algebraic constraints.

222This requires making the polynomials over homogeneous, which we can do by using the identity . Using ideas from [5], one can go much further than that and base the step-by-step Markov chain selection on the outcome of any first-order logical formula we may fancy (with the ’s as free variables).333The key fact behind this result is that the first-order theory of the reals is decidable by quantifier elimination. This allows us to pick the next stochastic matrix at each time step on the basis of the truth value of a Boolean logic formula with arbitrarily many quantifiers. See [5] for details. What all of this shows is that the assumption of linear discontinuities is not restrictive.

We prove in this article that irreducible444This means that forms an irreducible chain for each . MIS are almost always asymptotically periodic. We extend this result to larger families of Markov influence systems. We also give an example of “hyper-torpid” mixing: an MIS that converges to a stationary distribution in time equal to a tower-of-twos in the size of the chain. This bound also applies to the period of certain periodic MIS. The emergence of timescales far beyond the reach of standard Markov chains is a distinctive feature of Markov influence systems. We note that the long-time horizon analysis of general systems is still open.

Some intuition.

The need for some form of dimension reduction mechanism (ie, renormalization) is easy to grasp. The first hurdle is that, unlike a standard Markov chain, an MIS

is noncontractive over an eigenspace whose dimension can vary over time. It is this spectral incoherence that renormalization attempts to “tame.” To see why this has a strong graph-theoretic flavor, observe that at each time step the support of the stationary distribution can be read off the topology of the current graph: for example, the number of sinks in the condensation is equal to the dimension of the principal eigenspace plus one. Renormalization can thus be seen as an attempt to restore coherence to an ever-changing spectral landscape via a dynamic hierarchy of graphs, subgraphs, and homomorphs.

The bifurcation analysis at the heart of our investigation entails the design of a notion of “general position” aimed at bounding the growth rate of the induced symbolic dynamicsics [3, 17, 22]. The root of the problem is a clash between order and randomness. (This is the same conflict that arises between entropy and energy in statistical mechanics.) All Markov chains are attracted to a limit cycle (ie, order). Changing the chain at each step introduces pseudorandomness into the process (ie, disorder). The question is then to assess under what conditions order prevails over disorder. The tension between the two “forces” is mediated by introducing a perturbation parameter and locating its critical values. We show that, in this case, the critical region forms a Cantor set of Hausdorff dimension strictly less than 1.

Previous work.

There is a growing body of literature on dynamic graphs [4, 15, 18, 20] and their random walks [2, 7, 8, 9, 13, 10, 11, 16]. By contrast, as mentioned earlier, most of the research on nonlinear Markov chains has been done within the framework of stochastic differential calculus. The closest analog to the MIS model are the diffusive influence systems we introduced in [5]

. Random walks and diffusion are dual processes that coincide when the underlying operator is self-adjoint (which is not the case here). As a rule of thumb, diffusion is easier to analyze because even in a changing medium the constant function is always a principal eigenfunction. As a result, a diffusion model can converge to a fixed point while its dual Markov process does not. Indeed, as is well known 

[21], multiplying stochastic matrices from the right is less “stable” than from the left.555For example, consider the product of two stochastic matrices and , with rank. Multiplying by from the left gives us , whereas can be any old stochastic matrix of rank 1. Our renormalization scheme is new, but the idea of parsing graph sequences is one we introduced in [5] as a way of tracking the flow of information across changing graphs. The parsing method we discuss here is entirely different, however: being topological rather than informational, it is far more general and, we believe, likely to be useful in other applications of dynamic networks.

2 How to Parse a Graph Sequence

Throughout this work, a digraph refers to a directed graph with vertices in and a self-loop at each vertex.666The graphs and digraphs (words we use interchangeably) have no multiple edges and . We denote digraphs by lower-case letters (, etc) and use boldface symbols for sequences. A digraph sequence is an ordered, finite or infinite, list of digraphs over the vertex set . The digraph consists of all the edges such that there exist at least an edge in and another one in . The operation is associative.777The sign is meant to highlight the connection with the multiplication of incidence matrices: indeed, is the digraph specified by the nonzero entries of the product of the two corresponding incidence matrices. We define the cumulant and write for finite . The cumulant indicates all the pairs of vertices that can be joined by a temporal walk of a given length. We need additional terminology:

  • Transitive front of : An edge of a digraph is leading if there is such that is an edge of but is not. The non-leading edges form a subgraph , called the transitive front of . For example, is the graph over with the single edge (and the three self-loops.) If is transitive, then . The transitive front of a directed cycle has no edges besides the self-loops. We omit the (easy) proof that the transitive front is indeed transitive, ie, if and are edges of then so is . Given two graphs over the same vertex set, we write if all the edges of are in (with strict inclusion denoted by the symbol ). Because of the self-loops, . We easily check that the transitive front of is the (unique) densest graph such that .

  • Subgraphs and contractions: Given two digraphs with vertex sets , we denote by the subgraph of induced by . Pick and contract all these vertices into a single one (while pruning multiple edges). By abuse of notation, we still designate by the graph derived from by first taking the subgraph induced by and then contracting the vertices of ; note that the notation would be more accurate but it will not be needed. Given a sequence , we use the shorthand for . Finally, denotes the set of all complete digraphs (of any size) with self-loops, while consists of the complete digraphs with an extra vertex pointing to all the others unidirectionally.888For example, ignoring self-loops, and .

  • Stem decomposition of : The strongly connected components of a graph form, by contraction, an acyclic digraph called its condensation. Let be the vertex sets from corresponding to the sinks of the condensation.999These are the vertices with no outgoing edges: there is at least one of them; hence . The remaining vertices of induce a subgraph called the stem of . For each , the petal is the subgraph induced by if no vertex outside links to it; else is the subgraph induced by and , with all the vertices of subsequently contracted into a single vertex and the multiple edges removed (fig.2).

Figure 2: The decomposition of into its stem and its petals .

The parser.

The parse tree of a (finite or infinite) graph sequence is a rooted tree whose leaves are associated with from left to right; each internal node assigns a syntactical label to the subsequence formed by the leaves of its subtree. The purpose of the parse tree is to monitor the formation of new temporal walks as time progresses. How to do that begins with the observation that, because of the self-loops, the cumulant is monotonically nondecreasing.101010All references to graph ordering are relative to . If the increase were strict at each step then the parse tree would be trivial: each graph of would appear as a separate leaf with the root as its parent. Of course, the increase cannot go on forever. How to deal with time intervals within which the cumulant is “stuck” is the whole point of parsing. The answer is to define a grammar and proceed recursively. A simple approach would be to rely on a production of the form , where is the smallest index at which achieves its maximal value. While this would “renormalize” the sequence along the temporal axis, it would do nothing to cluster the graphs themselves. Instead, we use a grammar consisting of two pairs of productions, (1a, 1b) and (2a, 2b).

  1. Time renormalization     Let be the smallest index at which achieves its maximal value; write , , and . The two productions below cluster the time axis into the relevant intervals.

    1. Transitivization. The first production supplies the root of the parse tree with at most three children:

      (1a)

      where or (or both) may be the empty sequence . If , then . The right sibling of is the terminal symbol (a leaf of the parse tree) followed by . The annotation indicates that and that the transitive graph is ready to “guide” the parsing of .111111By definition of , no temporal walk from can extend one from . This shows that ; hence . Observe that can be parsed before is known; the parsing is of the LR type, meaning that it can be carried out bottom-up in a single left-to-right scan.

    2. Cumulant completion. We show how to parse in the special case where is in or . Recall that the notation implies that . Partition the sequence into minimal subsequences such that :

      (1b)

      The list on the right-hand side could be finite or infinite; if finite, it could be missing the final .121212In fact, the production could be as simple as , which would happen if . This production is the one doing the heavy lifting in that it establishes a bridge between renormalization and contractivity—see Section 4.

  2. Network renormalization     Two productions parse the rightmost term in (1a) by recursively breaking down the graph into clusters. This is done either by carving out subgraphs or taking homomorphs. In both cases, it is assumed that and that is transitive but not in or , these two cases being handled by (1b).

    1. Decoupling. If the number of connected components of exceeds one, then131313This refers to the subgraphs of induced by each one of the vertex subsets of the connected components of the undirected version of .

      (2a)

      In terms of the parse tree, the node has children that model processes operating in parallel. Intuitively, the production breaks the system into decoupled subsystems.141414This may or may not imply the independence of the subsystems’ dynamics.

    2. One-way coupling. If the undirected version of has a single connected component, we use its stem decomposition to cluster the digraphs of :

      (2b)

      Since is neither in nor in , its stem and petals both exist (with ). The assumed transitivity of implies that each . We iterate the production if is neither in nor in . System-wise, the symbol indicates the direction of the information flow. None flows into , so its dynamics is decoupled from the rest. Such decoupling does not hold for the petals, so it is one-way. This allows us to renormalize the stem into a single vertex for the purposes of the petals: the common  in all the instances of . In terms of the parse tree, the nodes has children that operate in parallel, with the last of them collecting information from the first one.

Network renormalization exploits the fact that the information flowing across the system might get stuck in portions of the graph for some period of time: we cluster the graph when and where this happens. Sometimes only time renormalization is possible. Consider the infinite sequence , where and, for , consists of the -vertex digraph with edges from vertex to all the others (plus self-loops): the cumulant never ceases to grow until it reaches , at which point the process repeats itself; the parsing involves infinitely many applications of (1a) and (1b), so there is no network renormalization. Quite the opposite, the case of an infinite single-graph sequence features abundant network renormalization (fig.3).

Figure 3: The parse tree of an infinite sequence consisting of the same graph .

The depth of the parse tree.

It is easily verified that cumulants lose at least one edge from parent to child, which puts an obvious upper bound on the maximal height of the parse tree. This quadratic bound is tight. Indeed, consider the sequence , where for , and (besides self-loops) consists of the single edge for . The -th copy of adds to the cumulant the new edge , which creates, in total, a quadratic number of increments. The bounded depth implies that the parse tree for an infinite sequence includes exactly one node with an infinite number of children. That node is expressed by a production of type (1b).

Undirected graphs.

Note that the cumulant of a sequence of undirected graphs might itself be directed.151515The product has a directed edge from to but not from to . An edge is undirected if both of its directed versions are present in the graph. Recall that the transitive front of the cumulant collects all the edges which might be encountered in the future that do not extend any edge of the cumulant into a new temporal path. All such future edges are undirected; obviously, they must already be present in the cumulant. We can retool our earlier argument to show that these edges form a transitive subgraph: being undirected, it consists of disjoint (undirected) cliques. This simplifies the parsing since the condensation is trivial and the parsing tree has no nodes of type (2b). The complexity of the parse tree can still be as high as quadratic, however. To see why, consider the following recursive construction. Given a clique over vertices at time , attach to it, at time , the edge . The cumulant gains the undirected edges and the directed edges for . At time , visit each one of the undirected edges for , using single-edge undirected graphs. Each such step will see the addition of a new directed edge to the cumulant, until it becomes the undirected clique . The quadratic lower bound on the tree depth follows immediately.

Backward parsing.

The sequence of graphs leads to products where each new graph is multiplied to the right, as would happen in a time-varying Markov chain. Algebraically, the matrices are multiplied from left to right. In diffusive systems (eg, multiagent agreement systems, Hegselmann-Krause models, Deffuant systems, voter models), however, matrices are multiplied from right to left. Although the dynamics can be quite different, the same parsing algorithm can be used. Given a sequence , its backward parsing is formed by applying the parser to the sequence , where is derived from by reversing the direction of every edge, ie, becomes . Once the parse tree for has been built, we simply restore each edge to its proper direction to produce the backward parse tree of .

3 The Markov Influence Model

Let (or when the dimension is understood) be the standard simplex and let denote set of all -by- rational stochastic matrices. A Markov influence system (MIS ) is a discrete-time dynamical system with phase space , which is defined by the map , where is a function that is constant over the pieces of a finite polyhedral partition161616How is defined on the discontinuities of the partition is immaterial. of (fig.4). We define the digraph (and its corresponding Markov chain) formed by the positive entries of . To avoid inessential technicalities, we assume that the diagonal of each is strictly positive (ie, has self-loops). In this way, any orbit of an MIS corresponds to a lazy, time-varying random walk with transitions defined endogenously.171717As discussed in the introduction, to access the full power of first-order logic in the stepwise choice of digraphs requires nonlinear partitions, but these can be linearized by a suitable tensor construction. We recall some basic terminology. The orbit of is the infinite sequence and its itinerary is the corresponding sequence of cells visited in the process. The orbit is periodic if for any modulo a fixed integer. It is asymptotically periodic if it gets arbitrarily close to a periodic orbit over time.

Figure 4: A Markov influence system: each region is associated with a single stochastic matrix. The first five steps of an orbit are shown to visit regions in this order.

For convenience, we assume a representation of the discontinuities induced by

as hyperplanes in

of the form , where (for concreteness).181818There is nothing special about the choice of ; in particular, we could pick an arbitrarily small interval around . Note that the polyhedral partition is invariant up to scaling for all values of the bifurcation parameter, so the MIS remains well-defined as we vary . The parameter is necessary for the analysis: indeed, as we explain below in Section 5, chaos cannot be avoided without it. The coefficient of ergodicity of a matrix is defined as half the maximum -distance between any two of its rows [21]. It is submultiplicative for stochastic matrices, a direct consequence of the identity

Given , let denote the set of -long prefixes of any itinerary for any starting position and any . We define the ergodic renormalizer as the smallest integer such that, for any and any matrix sequence matching an element of , the product is primitive (ie, some high enough power is a positive matrix) and its coefficient of ergodicity is less than . We assume in this section that and discuss in Section 4 how to relax this assumption via renormalization. Let be the union of the hyperplanes from in (where is understood). We define and . Remarkably, for almost all , becomes strictly equal to in a finite number of steps.191919Recall that both and depend on .

Lemma 3.1

Given any , there exists an integer and a finite union of intervals of total length less than such that , for any .

Corollary 3.2

For almost everywhere in ,202020Meaning outside a subset of of Lebesgue measure zero. every orbit is asymptotically periodic.

Proof. The equality implies the eventual periodicity of the symbolic dynamics. The period cannot exceed the number of connected components in the complement of . Once an itinerary becomes periodic at time with period , the map can be expressed locally by matrix powers. Indeed, divide by and let be the quotient and the remainder; then, locally, , where is specified by a stochastic matrix with a positive diagonal, which implies convergence to a periodic point at an exponential rate. Finally, apply Lemma 3.1 repeatedly, with for and denote by be the corresponding union of “forbidden” intervals. Define and ; then Leb and hence Leb. The lemma follows from the fact that any outside of lies outside of for some .

The corollary states that the set of “nonperiodic” values of has measure zero in parameter space. Our result is actually stronger than that. We prove that the nonperiodic set can be covered by a Cantor set of Hausdorff dimension strictly less than 1. The remainder of this section is devoted to a proof of Lemma 3.1.

3.1 Shift spaces and growth rates

The growth exponent of a language is defined as , where is the number of words of length ; for example, the growth exponent of is 1. The language consisting of all the itineraries of a Markov influence system forms a shift space and its growth exponent is the topological entropy of its symbolic dynamics [22].212121Which should not be confused with the topological entropy of the MIS itself. It can be strictly positive, which is a sign of chaos. We show that, for a typical system, it is zero, the key fact driving periodicity. Let be -by- matrices from a fixed set of primitive stochastic rational matrices with positive diagonals, and assume that for ; hence . Because each product is a primitive matrix, it can be expressed as (by Perron-Frobenius), where is its (unique) stationary distribution.222222Positive diagonals play a key role here because primitiveness is not closed under multiplication: for example, and are both primitive but their product is not. If is a stationary distribution for a stochastic matrix , then its -th row satisfies ; hence, by the triangular inequality, . This implies that

(1)

Property .

Fix a vector , and denote by the -by- matrix with the column vectors , where is an increasing integer sequence of nonnegative integers in . We say that property holds if there exists a vector such that and does not depend on the variable .232323Because is a probability distribution, property does not imply ; for example, we have , for . Intuitively, property is a quantifier elimination device for expressing “general position” for MIS. To see the connection, consider a simple statement such as “the three points , , and cannot be collinear for any value of .” This can be expressed by saying that a certain determinant polynomial in is constant. Likewise, the vector manufactures a quantity, , that “eliminates” the variable . Note that some condition on is obviously needed since we could pick . We explain below why is the right condition.

To see the relevance of the concept of general position, consider the iterates of a small ball through the map . To avoid chaos, it is intuitively obvious that these iterated images should not fall across discontinuities too often. Fix such a discontinuity: if we think of the ball as being so small it looks like a point, then the case we are trying to avoid consists of many points (the ball’s iterates) lying on or near a given hyperplane. But that is precisely what general position rules out.

Lemma 3.3

There exists a constant (linear in ) such that, given any integer and any increasing sequence in of length at least , property holds, where and is the number of bits needed to encode any entry of for any .

Proof. By choosing large enough, we can automatically ensure that is as big as we want.242424All the constants in this work may depend on the input parameters such as , , etc. Dependency on other parameters is indicated by a subscript. The proof is a mixture of algebraic and combinatorial arguments. We begin with a Ramsey-like statement about stochastic matrices.

Fact 3.4

There is a constant such that, if the sequence contains with for each , then property holds.

Proof. By (1), for constant . Note that has rational entries over bits (with the constant factor depending on ). We write , where and are the -by- matrices formed by the column vectors and , respectively, for ; recall that . The key fact is that the dependency on is confined to the term : indeed,

(2)

This shows that, in order to satisfy property , it is enough to ensure that has a solution such that . Let . If is nonsingular then, because each one of its entries is a rational over bits, we have , for constant . Let be the -by- matrix derived from by adding the column to its right and then adding a row of ones at the bottom. If is nonsingular, then has a (unique) solution in and property

holds (after padding

with zeroes). Otherwise, we expand the determinant of along the last column. Suppose that . By Hadamard’s inequality, all the cofactors are at most a constant in absolute value; hence, for large enough,

This contradiction implies that is singular, so (at least) one of its rows can be expressed as a linear combination of the others. We form the -by- matrix by removing that row from , together with the last column, and setting to rewrite as , where is the restriction of to the columns indexed by . Having reduced the dimension of the system by one variable, we can proceed inductively in the same way; either we terminate with the discovery of a solution or the induction runs its course until and the corresponding 1-by-1 matrix is null, so that the solution 1 works. Note that has rational coordinates over bits.

Let be the largest sequence in such that property does not hold. Divide into bins for . By Fact 3.4, the sequence can intersect at most of them, so, if , for some large enough , there is at least one empty interval in of length . This gives us the recurrence for and , where , for a positive constant . The recursion to the right of the empty interval, say, , warrants a brief discussion. The issue is that the proof of Fact 3.4 relies crucially on the property that has rational entries over bits—this is needed to lower-bound when it is not 0. But this is not true any more, because, after the recursion, the columns of the matrix are of the form , for , where is the length of the empty interval and . Left as such, the matrices use too many bits for the recursion to go through. To overcome this obstacle, we observe that the recursively transformed can be factored as , where and consists of the column vectors . The key observation now is that, if does not depend on , then neither does , since it can be written as where . In this way, we can enforce property while having restored the proper encoding length for the entries of .

Plugging in the ansatz , for some unknown positive , we find by Jensen’s inequality that, for all , . For the ansatz to hold true, we need to ensure that . Setting completes the proof of Lemma 3.3.

Define and let be some hyperplane in . We consider a set of canonical intervals of length (or less): , where is a small positive real to be specified below and, as we recall, , with . Roughly, the “general position” lemma below says that, for most , the -images of any -wide cube centered in the simplex cannot near-collide with the hyperplane for most values of . This may sound rather counterintuitive. After all, if the stochastic matrices are the identity, the images stay put, so if the initial cube collides then all of the images will! The point is that is primitive so it cannot be the identity. The low coefficients of ergodicity will also play a key role. Notation: refers to its use in Lemma 3.3.

Lemma 3.5

For any real and any integer , there exists of size , where is independent of , such that, for any and , there are at most integers such that , where and .

Proof. In what follows, refer to suitably large positive constants (which may depend on , etc). We assume the existence of more than integers such that and draw the consequences: in particular, we infer certain linear constraints on ; by negating them, we define the forbidden set and ensure the conclusion of the lemma. Let be the integers in question, where . For each , there exists such that . By the stochasticity of the matrices, ; hence . By Lemma 3.3, there is a rational vector such that and does not depend on the variable ; on the other hand, . Two quick remarks: (i) the term is derived from ; (ii) , where is a rational over bits. We invalidate the condition on by keeping outside the interval , which rules out at most intervals from . Repeating this for all sequences raises the number of forbidden intervals, ie, the size of , to .

Topological entropy.

We identify the family with the set of all matrices of the form , for (), where the matrix sequence matches some element of . By definition of the ergodic renormalizer, any is primitive and ; furthermore, both and are in . Our next result shows that the topological entropy of the shift space of itineraries vanishes.

Lemma 3.6

For any real and any integer , there exists and of size such that, for any , any integer , and any , , for constant .

Proof. In the lemma, (resp. ) is independent of (resp. ). The main point is that the exponent of is bounded away from 1. We define the set as the union of the sets formed by applying Lemma 3.5 to each one of the hyperplanes involved in and every possible sequence of matrices in . This increases to . Fix and consider the (lifted) phase space for the dynamical system induced by the map . The system is piecewise-linear with respect to the polyhedral partition of formed by treating as a variable in . Let be a continuity piece for , ie, a maximal region of over which the -th iterate of is linear. Reprising the argument leading to (1), any matrix sequence matching an element of is such that , where ; hence there exists such that, for any , , for some .

Consider a nested sequence .252525Note that is a cell of , , and is the stochastic matrix used to map to (ignoring the dimension ). We say there is a split at if , and we show that, given any , there are only splits between and , where , for constant .262626We may have to scale up by a constant factor since and, by Lemma 3.3, . We may confine our attention to splits caused by the same hyperplane since features only a constant number of them. Arguing by contradiction, we assume the presence of at least splits, which implies that at least of those splits occur for values of at least apart. This is best seen by binning into intervals of length and observing that at least intervals must feature splits. In fact, this proves the existence of splits at positions separated by a least two consecutive bins. Next, we use the same binning to produce the matrices , where .

Suppose that all of the splits occur for values of the form . In this case, a straightforward application of Lemma 3.5 is possible: we set and note that the functions are all products of matrices from the family , which happen to be -long products. The number of splits, , exceeds the number allowed by the lemma and we have a contradiction. If the splits do not fall neatly at endpoints of the bins, we use the fact that includes matrix products of any length between and . This allows us to reconfigure the bins so as to form a sequence with the splits occurring at the endpoints: for each split, merge its own bin with the one to its left and the one to its right (neither of which contains a split) and use the split’s position to subdivide the resulting interval into two new bins; we leave all the other bins alone.272727We note the possibility of an inconsequential decrease in caused by the merges. Also, we can now see clearly why Lemma 3.5 is stated in terms of the slab and not the hyperplane . This allows us to express splitting caused by the hyperplane in lifted space . This leads to the same contradiction, which implies the existence of fewer than splits at ; hence the same bound on the number of strict inclusions in the nested sequence . The set of all such sequences forms a tree of depth , where each node has at most a constant number of children and any path from the root has nodes with more than one child. Rescaling to and raising completes the proof.

3.2 Proof of Lemma 3.1

We show that the nonperiodic -intervals can be covered by a Cantor set of Hausdorff dimension less than one. All the parameters below refer to Lemma 3.6 and are set in this order: , , and . The details follow. Let such that . Given a continuity piece for , the -th iterate of induces a partition of into a finite number of continuity pieces , so we can define . As was observed in the proof of Lemma 3.6, . That same lemma shows that if we pick , for large enough then, for any ,

(3)

Next, we set so that the intervals of cover a length of at most . This gives us an extra length of worth of forbidden intervals at our disposal. For any