# Detecting Renewal States in Chains of Variable Length via Intrinsic Bayes Factors

Markov chains with variable length are useful parsimonious stochastic models able to generate most stationary sequence of discrete symbols. The idea is to identify the suffixes of the past, called contexts, that are relevant to predict the future symbol. Sometimes a single state is a context, and looking at the past and finding this specific state makes the further past irrelevant. These states are called renewal states and they split the chain into independent blocks. In order to identify renewal states for chains with variable length, we propose the use of Intrinsic Bayes Factor to evaluate the plausibility of each set of renewal states. In this case, the difficulty lies in finding the marginal posterior distribution for the random context trees for general prior distribution on the space of context trees and Dirichlet prior for the transition probabilities. To show the strength of our method, we analyzed artificial datasets generated from two binary models models and one example coming from the field of Linguistics.

## Authors

• 3 publications
• 1 publication
• ### Variable Length Markov Chain with Exogenous Covariates

Markov Chains with variable length are useful stochastic models for data...
12/30/2019 ∙ by Adriano Zanin Zambom, et al. ∙ 0

• ### Toward a Theory of Markov Influence Systems and their Renormalization

Nonlinear Markov chains are probabilistic models commonly used in physic...
02/04/2018 ∙ by Bernard Chazelle, et al. ∙ 0

• ### On the use of Markovian stick-breaking priors

In [10], a `Markovian stick-breaking' process which generalizes the Diri...
08/24/2021 ∙ by William Lippitt, et al. ∙ 0

• ### Learning Multiple Markov Chains via Adaptive Allocation

We study the problem of learning the transition matrices of a set of Mar...
05/27/2019 ∙ by M. Sadegh Talebi, et al. ∙ 0

• ### R^*: A robust MCMC convergence diagnostic with uncertainty using gradient-boosted machines

Markov chain Monte Carlo (MCMC) has transformed Bayesian model inference...
03/17/2020 ∙ by Ben Lambert, et al. ∙ 0

• ### An Efficient Bayes Coding Algorithm for the Non-Stationary Source in Which Context Tree Model Varies from Interval to Interval

The context tree source is a source model in which the occurrence probab...
05/11/2021 ∙ by Koshi Shimada, et al. ∙ 0

• ### Lifted samplers for partially ordered discrete state-spaces

A technique called lifting is employed in practice for avoiding that the...
03/11/2020 ∙ by Philippe Gagnon, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov Chains with variable length are useful stochastic models that provide a powerful framework for describing transition probabilities for finite-valued sequences due the possibility of capturing long-range interactions while keeping some parsimony in the number of free parameters. These models were introduced in the seminal paper of rissanen1983universal for data compression and became known in the statistics literature as Variable Length Markov Chain (VLMC) by buhlmann1999variable

, and as Probabilistic Suffix Trees (PST) in the machine learning literature

(ron1996power). The idea is that, for each past, only a finite suffix of the past is enough to predict the next symbol. Rissanen called context

, the relevant ending string of the past. The set of all contexts can be represented by the set of leaves of a rooted tree if we require that no context is a proper suffix of another context. For a fixed set of contexts, estimation of the transition probabilities can be easily achieved. The problems lies into estimating the contexts from the available data. In his seminal 1983 paper, Rissanen introduced the Context algorithm, which estimates the context tree by aggregating irrelevant states in the history of the process using a sequential procedure. A nice introductory guide to this type of models and particularly to the Context algorithm can be found in

galves2008stochastic. Many of the tree model methods related to data compression tasks involve obtaining better predictions based on weighting over multiple models. A classical example is the Context-Tree Weighting (CTW) algorithm (willems1995context)

, which computes the marginal probability of a sequence by weighting over all context trees and all probability vectors using computationally convenient weights. Using CTW,

csiszar2006context

showed that context trees can be consistently estimated in linear time using the Bayesian information criterion (BIC). Likewise, in Bayesian statistics, unobserved parameters of a probabilistic system are treated as additional random components with given prior distribution and inference is based on integrating over the nuisance parameters, which is a form of weighting over these quantities based on the prior distribution. Therefore, these weighting strategies can be translated to a Bayesian context where the chosen weights are directly related to a specific prior distribution. Nonetheless, inference performed following the Bayesian paradigm for VLMC models is a relatively recent topic of research. Some works that explicitly use Bayesian statistics in combination with VLMC models are

dimitrakakis2010bayesian which introduced an online prediction scheme by adding a prior, conditioned on context, on the Markov order of the chain, and kontoyiannis2020bayesian which provided more general tools such as posterior sampling through Metropolis-Hastings algorithm and Maximum a Posteriori context tree estimation focusing on model selection, estimation, and sequential prediction. A Bayesian approach for model selection in high-order Markov chains, allowing conditional probabilities to be partitioned into more general structures than the tree-based structures of VLMC models, is also proposed in xiong2016recursive.

As aforementioned, the effort was mostly concentrated in estimating the context tree structure. On the other hand, testing hypothesis for VLMC is a difficult topic, first tackled by balding2009limit and pursued further by busch2009testing using a Kolmogorov-Smirnov-type goodness-of-fit test, to compare if two samples come from the same distribution. Under the Bayesian paradigm, hypothesis testing is done though Bayes Factor. In this work, we focus on one characteristic of interest in a context tree, the presence of a renewal state or a renewal context. Renewal states play an important role in some computational methods frequently used in statistical analysis such as designing Bootstrap schemes and defining proper cross-validation strategies based on blocks. Therefore, having some methodology not only to detect renewal states, but also to quantify how plausible these assumptions are, can improve the robustness of analysis at the cost of some pre-processing. For example, galves2012context proposed a constant free algorithm (Smallest Maximizer Criterion) to find the tree that maximizes the BIC based on a Bootstrap scheme that uses the renewal property of one of the states. To the best of our knowledge, using Bayes Factors for evaluating hypotheses involving probabilistic context trees is a topic that has not been explored.

## 2 Variable-Length Markov Chains

### 2.1 Model Description

Let be an alphabet of symbols, and without loss of generality, consider for simplicity. For , a string will be denoted by and its length by . A sequence is a suffix of a string if for all . If we say that is a proper suffix of the string .

###### Definition 1

Let and be a set of strings formed by symbols in . We say that satisfies the suffix property if, for every string , implies that for .

###### Definition 2

Let and be a set of strings formed by symbols in . We say that is an irreducible tree if, no string belonging to can be replaced by a proper suffix without violating the suffix property.

###### Definition 3

Let be an irreducible tree. We say that is full if, for each string , any concatenation of a symbol and a suffix of is the suffix of a string .

#### Examples

Suppose that we have a binary alphabet , then:

• does not satisfy the suffix property because it contains both the strings and .

• is not an irreducible tree, because the string can be replace by its suffix, , without violating the suffix property, as the set satisfies the suffix property.

• is irreducible, but it is not full because is a suffix of a string in (either or ), but (the concatenation of and ) is not.

• , and are full irreducible trees.

A full irreducible tree can be represented by the set of leaves of a rooted tree with a finite set of labeled branches such that

1. The root node has no label,

2. each node has either or children (fullness) and

3. when a node has children, each child has symbol of the alphabet as a label.

The elements of will be called contexts and we will refer to full irreducible trees as context trees henceforth. Figure 1 presents 3 examples of contexts trees. The depth of a tree is given by the maximal length of a context belonging to , defined as

 ℓ(τ)=max{ℓ(z);z∈τ}.

In this work we will assume that the depth of the tree is bounded by an integer . In this case, it is straightforward to conclude that, for any string with at least symbols, there exist a suffix and a leaf of such that the symbols between the leaf (including) and the root node are exactly . galves2012context referred to this property as the properness of a context tree.

For each context tree , we can associate a family of probability measures indexed by elements of ,

 p={p(⋅|s):A→[0,1];s∈τ}.

The pair is called a probabilistic context tree.

Given a tree with the described properties and depth bounded by , define a suffix mapping function such that is the unique suffix .

###### Definition 4

A sequence of random variables

with state space is a Variable Length Markov Chain (VLMC) compatible with the probabilistic tree if it satisfies

 P(Zt=k|Zt−11=zt−11)=p(k|ητ(zt−11)), (1)

for all , , where is the suffix of .

### 2.2 Likelihood Function

In order to extend the scope of VLMC models introduced previously to data involving multiple sequences, we define a VLMC dataset of size , denoted as , as a set of independent VLMC sequences and will denote its observed realizations.

For each sequence , with length , we will consider its first elements as constant values, allowing us to write the joint probabilities as a product of transition probabilities in (1), without requiring additional parameters for consistently defining the probabilities of the first symbols in each sequence. Hence, the likelihood function is given by

 f(˜z|τ,p)=∏s∈τm−1∏k=0(p(k|s))nsk(˜z), (2)

where counts the number of occurrences of the symbol after strings with suffix across all sequences.

### 2.3 Renewal States

A symbol is called a renewal state if

 P(Zt′t+1=zt′t+1|Zt−11=zt−11,Zt=a)=P(Zt′t+1=zt′t+1|Zt=a),

for all , , , . That is, conditioning on , the distribution of the chain after , , is independent from the past .

This property of conditional independences in a Markov Chain can be directly associated to the structure of the context tree of a VLMC model. For a VLMC model with associated context tree , a state is a renewal state if does not appear in any inner node of the context tree. That is, for any context , expressing as the concatenation of symbols , for . In this case, we say that the tree is -renewing.

Two out of the three trees displayed in Figure 1 present renewal states. Tree (I) has as a renewal state, Tree (II) has no renewal states due to the presence of the contexts and , Tree (III) has only as a renewal state. If, in Tree(III), the branch formed by contexts , , and was pruned and substituted by the context only, then would also be a renewal state. Note that a VLMC may contain multiple renewal states.

A remarkable consequence of being a renewal state is that the random blocks between two occurrences of are independent and identically distributed. This feature allows the use of block Bootstrap methods, enables a straightforward construction of cross-validation schemes and any other technique that relies on exchangeability properties.

## 3 Bayesian Renewal Hypothesis Evaluation

A VLMC model is fully specified by the probabilistic context tree . The dimension of depends on the branches of . Both these unobserved components can be treated as random elements with given prior distribution to carry out inference under the Bayesian paradigm.

From now on, we will use the following notation, for each ,

 ps=(p(0|s),…,p(m−1|s))∈Δm

where denotes the -simplex,

 Δm={x∈R{0,1,…,m−1}:m−1∑k=0xk=1 and ∀j,xj≥0}.

In this section we discuss the prior specification for the probabilistic context tree , as well as the resultant posterior distribution and how to perform hypothesis testing using partial and intrinsic Bayes factor.

### 3.1 A Bayesian Framework for VLMC models

We consider a general prior distribution for proportional to any arbitrary non-zero function and, given , for each , will have independent Dirichlet priors.

The complete Bayesian system can be described by the hierarchical structure

 τ ∼h(τ)ζ(h,L), τ ∈TL, p|τ ∼∏s∈τΓ(∑m−1k=0αsk)∏m−1k=0Γ(αsk)m−1∏k=0(p(k|s))αsk−1, p ∈Δ|τ|m, (3) ˜Z|τ,p ∼f(˜z|τ,p), z(i) ∈ATi,

where

 ζ(h,L)=∑τ∈TLh(τ) (4)

is the normalizing constant of the tree prior distribution and is given by (2). We are assuming that the prior distribution for the transition probabilities

are independent Dirichlet distribution with hyperparameters

. Therefore, the joint distribution of

is given by

 π(τ,p,˜z)=h(τ)ζ(h,L)∏s∈τΓ(∑m−1k=0αsk)∏m−1k=0Γ(αsk)m−1∏k=0(p(k|s))nsk(˜z)+αsk−1.

Since our interest lies in making inferences about the dependence structure represented by rather than the transition probabilities, we can simplify our analysis by marginalizing the joint probability function over , , obtaining a function that depends only on the context tree and the data. The product of Dirichlet densities, assigned as the prior distribution of , conjugates to the likelihood function, allowing us to express the integrated distribution in closed-form as

 π(τ,˜z)=h(τ)ζ(h,L)∏s∈τΓ(∑m−1k=0αsk)∏m−1k=0Γ(αsk)∏m−1k=0Γ(nsk(˜z)+αsk)Γ(∑m−1k=0nsk(˜z)+αsk), (5)

obtained by multiplying the appropriate normalizing constant to achieve Dirichlet densities with parameters for each , so that the integration is done on a proper density. For a less convoluted notation, we shall denote

 q(τ,˜z)=∏s∈τΓ(∑m−1k=0αsk)∏m−1k=0Γ(αsk)∏m−1k=0Γ(nsk(˜z)+αsk)Γ(∑m−1k=0nsk(˜z)+αsk), (6)

and use, from now on, the shorter expression .

Finally, the model evidence (marginal likelihood) can now be obtained by summing (5) over all trees in ,

 E(˜z;h)=∑τ∈TLπ(τ,˜z)=∑τ∈TLh(τ)ζ(h,L)q(τ,˜z). (7)

Note that we explicitly describe the model evidence in terms of the prior distribution as we will be interested in evaluating hypotheses based on different prior distributions.

### 3.2 Bayes Factors for Renewal State Hypothesis

Let be a VLMC sample compatible with a probabilistic context tree where has maximum depth . We will call maximal tree the complete tree with depth and let be a fixed state of the alphabet. Our goal is to use Bayes Factors (kass1995bayes)

to evaluate the evidence in favor of the null hypothesis

that is -renewing against an alternative hypothesis that is not -renewing. We denote the set of -renewing trees with depth no more than and the set of trees with as an inner node and, consequently, is not a renewal state for those trees.

We are interested in defining a metric for evaluating the hypothesis against its complement in a Bayesian framework. These hypotheses can be expressed in terms of special prior distributions proportional to functions and , respectively, such that if, and only if, . Similarly, if, and only if, .

The Bayes Factor for against is defined as

 BFa,¯a(˜z)=E(˜z;ha)E(˜z;h¯a)=ζ(h¯a,L)ζ(ha,L)∑τ∈TLha(τ)q(τ,˜z)∑τ∈TLh¯a(τ)q(τ,˜z), (8)

where is given by (4) and is given by (6).

kass1995bayes proposed the following interpretation for the quantity as a measure of evidence provided by the data in favor of the hypothesis that corresponds to -renewing trees as opposed to the alternative one. A value between and is considered to provide evidence that is “Not worth more than a bare mention”, “Substantial” for values between and , “Strong” if they are between and and “Decisive” for values greater than . By symmetry, these intervals with negative sign provide the same amount of evidence but reversing the hypotheses considered. Therefore, the sign of provides a straightforward measure whether the data provides more evidence that the chain is compatible with a context tree that is -renewing or that belongs to .

### 3.3 Metropolis-Hastings algorithm for context tree posterior sampling

Before further development of methods to compute the Bayes Factors from (8), we need to introduce a Metropolis-Hastings algorithm for sampling from the marginal posterior distribution of context trees, . From (5) and the Bayes rule we obtain

 π(τ|˜z)=h(τ)ζ(h,L)q(τ,˜z)E(˜z;h)∝h(τ)q(τ,˜z), (9)

which has a simple expression up to the intractable proportionality terms, suggesting that the Metropolis-Hastings algorithm (hastings1970monte; chib1995understanding) is an appropriate strategy to obtain an empirical sample from the posterior distribution given by (9).

The main step for constructing the algorithm is defining a suitable proposal kernel to move to new context trees from a current tree . We propose the use of a graph-based kernel that can be viewed as a modification of the Monte Carlo Markov Chain Model Composition (MC) method from madigan1995bayesian by defining a neighborhood system over and constructing a proposal kernel that allows transitions only between neighboring trees only.

We first specify a set directed edges such that, an edge , from to , is included if, and only if, and , where denotes the symmetric difference operator . An equivalent definition is that an edge from to is obtained substituting one of the contexts , by the contexts associated with its children nodes, , in . We refer to this substitution as growing a branch from .

Additionally, we define the grow () and prune () operators, as

 ⊕(τ,h) ={τ′∈TL:(τ,τ′)∈Nd and h(τ′)>0}, ⊖(τ,h) ={τ′∈TL:(τ′,τ)∈Nd and h(τ′)>0}.

The operator maps a tree to the set of trees with positive prior distribution that can be obtained by growing new branches from , whereas maps to the set of trees in from which can be obtained after growing a branch.

Some important properties that can be easily checked are

1. For every , if and , then .

2. For every , if and , then .

3. For any finite sequence such that and , we have .

It follows from Properties 1 and 2 that any context tree can be recovered by applying sequentially grow and prune operations. Property 3 is a direct consequence of Properties 1 and 2 and means that any sequence of context trees obtained by a sequence of grow or prune operations can also be visited in reverse order with a sequence of grow and prune operations. These properties also suggest that combining and for constructing a set of transitions with positive probabilities in a proposal kernel is a good strategy in order to achieve the irreducibility condition if, and only if, .

We define a transition kernel as

 κ(τ′|τ)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩1|⊕(τ,h)|1(τ′∈⊕(τ,h)), if ⊖(τ,h)=∅,1|⊖(τ,h)|1(τ′∈⊖(τ,h)), if ⊕(τ,h)=∅,121|⊕(τ,h)|1(τ′∈⊕(τ,h))+121|⊖(τ,h)|1(τ′∈⊖(τ,h)), o/w,

which allows us to propose a tree in a simple two-step process. First, pick the operator to be applied to , or with probabilities if both lead to non-empty set of trees, otherwise pick the operation that produces a non-empty set. Then, pick from or with uniform probabilities.

The complete Metropolis-Hastings algorithm is described in Algorithm 1. This construction of a proposal kernel for context trees based on growing and pruning nodes of trees, although not formally defined in terms of graphs, was already used in kontoyiannis2020bayesian for a specific prior distribution .

### 3.4 Partial Bayes Factors and Intrinsic Bayes Factors

While given by (8) provides a measure of the plausibility of one hypothesis with respect to another, computing this quantity may present an enormous computational cost as the sum over involves a doubly exponential number of terms. In fact, even the normalizing constant of the tree prior distribution is intractable in the general case for moderate values , hindering the evaluation of the model evidence .

Previous works like the Context Tree Weighting (CTW) algorithm from willems1995context uses similar ideas to compute the marginal likelihood and their weighting of context trees corresponds to a very specific choice of prior distributions for such that (7) can be computed recursively based on the nodes of the maximal tree rather than every subtree. However, CTW is not suitable for our purposes since we are aiming to compute (7) for arbitrary prior distributions.

To overcome this difficulty, we consider the Partial Bayes Factor (PBF) described in o1995fractional as an alternative approach for model comparison when considering improper priors. The methodology consists of dividing the data into two independent chunks, and and then computing the Bayes Factor based on part of the data, , conditioned on as follows,

 (10)

where and are the posterior distributions of conditioned on the training data under the hypotheses and , respectively.

Although the original goal of using PBF is to avoid undefined behaviors when evaluating the ratio of terms involving improper priors, that are replaced by the posterior distributions conditioned on the training sample, we can see that it is very useful to avoid the intractable normalizing constant from the prior distribution.

Note that, even though (10) still involves sums over , the terms

 ∑τ∈TLπa(τ|˜ztrain)q(τ,˜ztest) and ∑τ∈TLπ¯a(τ|˜ztrain)q(τ,˜ztest)

can be written as expected values and , which can be obtained from ergodic Markov Chains and with invariant measures and , respectively.

Therefore, we can use MCMC methods to approximate Partial Bayes Factors by sampling two Markov Chains and and using the ratio of empirical averages instead of the expected values

 ˆPBFa,¯a(˜ztest|˜ztrain)=∑nitert=1q(τ(t)a,˜ztest)∑n% itert=1q(τ(t)¯a,˜ztest). (11)

To avoid the arbitrary segmentation of the dataset into train and test subsets, berger1996intrinsic proposed the Intrinsic Bayes Factor (IBF), which averages PBFs obtained using different segmentations, based on minimal training samples. Denote by the collection of subsets of of size , i.e.,

 Iv={{i1,i2,…,iv}⊂{1,2,…,I}}.

The dataset is divided into minimal training samples, which we will consider a -tuple of sequences , denoted and the remaining sequences, denoted as to be used as the test sample. For each possible subset of sequences , we compute the Monte Carlo approximation of the PBF in (11) and take either the arithmetic average to obtain the Arithmetic Intrinsic Bayes Factor (AIBF) or the geometric average for the Geometric Intrinsic Bayes Factor (GIBF).

Denoting by and the Markov Chains of context trees obtained using Algorithm 1 with target distribution and (considering prior distribution proportional to and ), respectively, the AIBF and GIBF are defined as

 AIBFa,¯a(˜z)=∑iv1∈Iv1(Iv)⎛⎜ ⎜⎝∑nitert=1q(τ(t)a,iv1,˜z(−iv1))∑nitert=1q(τ(t)¯a,iv1,˜z(−iv1))⎞⎟ ⎟⎠,

and

 GIBFa,¯a(˜z)=∏iv1∈Iv⎛⎜ ⎜⎝∑nitert=1q(τ(t)a,iv1,˜z(−iv1))∑nitert=1q(τ(t)¯a,iv1,˜z(−iv1))⎞⎟ ⎟⎠(Iv)−1.

The complete procedure for obtaining these quantities for a given VLMC dataset is described in Algorithm 2.

Note that, while a single sequence () is theoretically sufficient to identify the context tree and can be considered a minimal training sample, computing posterior distributions using small datasets may result in posterior distributions that assign very low probabilities to context trees with long branches due to smaller total counts on those longer branches. Therefore, using more sequences (higher value for ), may lead to more consistent results as the posterior distribution used in each PBF is more likely to capture long-range contexts. On the other hand, the number of PBFs to be computed is which quickly becomes prohibitive when increases. The choice of is a trade-off between computational cost and the deepness of contexts to be captured by partial posterior distributions.

## 4 Simulation Studies and Application

To show the strength of our method, we analyzed artificial VLMC datasets generated from two binary models models and a real one coming from the field of Linguistics.

### 4.1 Simulation for binary models

In this section, the primary goal is to examine the performance of the AIBF and GIBF for evaluating the evidence in favor of a null hypothesis of being -renewing considering aspects of the effect of the number of independent samples, the size of each chain, and discrimination ability when similar trees are considered. We consider simulations for binary VLMC models with three different sample sizes . For each scenario we sample chains of equal length, but three different values . A dataset was simulated for each combination of and , resulting in 9 datasets.

The two models considered are presented in Figure 2. Model 1 has a depth equal to 6 with being a renewal state, while Model 2 is a modified version with an additional branch grown from the node , which is substituted by the two suffixes and . Therefore, in Model 2, is no longer a renewal state although both trees are very similar.

For each possible renewal state or , we compute the Intrinsic Bayes Factor (both AIBF and GIBF) using Algorithm 2 considering prior distributions proportional to

 ha(τ)=1(τ∈TaL) and% h¯a(τ)=1(τ∈¯TaL)

which correspond to the uniform distribution in the space of context trees that are allowed under

and , respectively.

For the hyper-parameter , we choose for all and , resulting in symmetrical prior distributions for the transition probabilities and a higher density for vectors that are more concentrated.

In each scenario, we considered and as the number of sequences to be used for the minimal training sample and ran Metropolis-Hastings steps for each PBF Monte Carlo approximation.

The results for Model 1, presented in Table 1, show that both AIBF and GIBF lead to the correct decision for both renewal states tested, with values greater than (positive in scale) for , which was in fact a renewal state in the model which the data was simulated from. On the other hand, in all cases, they were much smaller than (negative in scale) for hypothesis , which was expected since is not a renewal state in the model that generated the data. The interpretation of the values obtained are of, at least, decisive evidence for the correct hypothesis ( is a renewal state and is not a renewal state) in both tests.

The results for Model 2, which includes a new pair of contexts causing to be no longer a renewal state, presented in Table 2, showed a similar performance for 1-renewing hypothesis when compared to Model 1, which is expected due to the similarity of both models with respect to the short-length nodes that do not involve . The main difference occurred in the computed values for the 0-renewing hypothesis, where the computed GIBF were positive in logarithmic scale for , mostly negative for and had both positive and negative values for the intermediate case . This suggests that for smaller samples, the posterior distributions obtained from the training samples were insufficient to capture the long-range contexts and that break the renewal condition of the state , generating evidence to the incorrect hypothesis.

Figures 6 and 7 in Appendix A present the empirical distributions of PBFs computed in scale for each scenario and each model. In general, the strength of the evidence tends to be larger for as more data is being used on the test set, but on the other hand, leads to more stable PBFs computed. The same behavior is observed as the number of independent sequences increases, what is expected as adding more data has an impact in the scale of the marginal likelihood function, also rescaling the Bayes Factors.

With , PBFs tend to be distributed around

, with high variance, as can be observed in

Figure 7 in Appendix A, resulting in a very unstable average, leading to correct results in some of the scenarios, and incorrect ones in others. As the sample size increases, those long-range contexts are more likely to be captured by the posterior distribution. With , we have decisive evidence that is not a renewal state, except for the scenario with and , where the value of provides very weak evidence, and, in general, the distribution of

PBFs was highly concentrated with negative sign, except for a few outliers. AIBF was highly affected by outliers and resulted in incorrect conclusions with high evidence for some scenarios.

Note that, especially for the datasets with small samples (), outliers with large values are observed, having great effect on the computed averages, although the conclusions are not affected. For large datasets (), we have smaller variance in the computed PBFs among different training sets compared to scenarios with sequences of smaller sizes.

Therefore, we conclude that a decision based on GIBF leads to, at least, strong evidence in the scale from kass1995bayes (greater than in scale) for the correct hypothesis in all cases where we had and or . The AIBF was not robust to the presence of outlier PBFs, leading to incorrect conclusions in the harder scenarios. Context trees that break renewal state condition on long-range contexts require larger samples ( or ) in order to be captured and result in evidence against the renewal state hypothesis, while states that break renewal state condition in short contexts can be captured with smaller samples.

### 4.2 Application to rhythm analysis in Portuguese texts

It is known that Brazilian and European Portuguese (henceforth BP and EP) have different syntaxes. For example, galves2005syntax infered that the placement of clitic pronouns in BP and EP differ in two senses, one of them being: “EP clitics, but not BP clitics, are required to be in a non-initial position with respect to some boundary”. However, the question remains: are the choices of word placement related to different stress patterns preferences? This question was addressed by galves2012context that found distinguishing rhythmic patterns for BP and EP based on written journalitic texts. The data consists of 40 BP texts and 40 BP texts randomly extracted from an encoded corpus of newspaper articles from the 1994 and 1995 editions of Folha de São Paulo (Brazil) and O Público (Portugal). Texts were encoded according on rhythmic features resulting in discrete sequences with around 2500 symbols each and are available at http://dx.doi.org/10.1214/11-AOAS511SUPP. After a preprocessing of the texts (removing foreign words, rewriting of symbols, dates, compound words, etc) the syllables were encoded by assigning one of four symbols according to whether or not (i) the syllable is stressed; (ii) the syllable is the beginning of a prosodic word (a lexical word (noun, verbs,…) together with the functional non-stressed words (articles, prepositions, …) which precede or succeed it). This classification can be represented by 4 symbols. Additionally an extra symbol was assigned to encode the end of each sentence. The alphabet was obtained as follows.

• 0 = non-stressed, non-prosodic word initial syllable;

• 1 = stressed, non-prosodic word initial syllable;

• 2 = non-stressed, prosodic word initial syllable;

• 3 = stressed, prosodic word initial syllable;

• 4 = end of each sentence.

For example, the sentence O sol brilha forte agora. (The sun shines bright now.) is coded as

Sentence O sol bri lha for te a go ra .
Code 2 1 3 0 3 0 2 1 0 4

The Smallest Maximizer Criteria proposed by galves2012context to select the best tree for BP and EP, uses the fact that the symbol appears as a renewal state to perform Bootstrap sampling. Moreover, they conclude that “the main difference between the two languages is that whereas in BP both 2 (unstressed boundary of a phonological word) and 3 (stressed boundary of a phonological word) are contexts, in EP only 3 is a context.” These are exactly the type of questions to be addresses by the renewal state detection algorithm.

Due to the encoding used, the grammar of the language, and the general structure of written texts, some transitions are not possible. For example, two end of sentences (symbol 4) cannot happen consecutively, therefore, a transition from 4 to 4 is not allowed. Furthermore, there is one, and only one, stressed syllable in each prosodic word. Table 3 summarizes the allowed and prohibited one-step transitions.

These prohibited transition conditions are included in the model with proper modifications to the prior distribution, assigning zero probability to some context trees and forcing the probabilities related to prohibited transitions to be zero. The modifications are:

1. If a transition from to is prohibited and a context has as its last symbol, we force in our prior distribution. The remaining probabilities associated with allowed transitions are then a priori distributed as a Dirichlet distribution with lower dimension.

For example, for a context , the only allowed transitions from are to and , we have

, and the free probabilities, , distributed as a 2-dimensional Dirichlet distribution with hyper-parameters .

2. If includes a prohibited transition, then as there will not be any occurrences of such sequence in the sample. As a consequence, these suffixes have no contribution in as the term related to of the product in (5) gets cancelled.

3. We define the space of context trees that do not have prohibited transitions in inner nodes. For example, since the transition is prohibited, a tree that contains a suffix cannot be in because the transition from to is not in a leaf (final node), whereas a tree in can contain the suffix .

Note that, allowing final nodes to contain a prohibited transition is necessary to keep the consistency of our definition based on full -ary trees, as full tree contains the suffix (allowed) if, and only if, it contains another suffix ending in (prohibited), what is not a problem because this prohibited suffix will not contribute to the marginal likelihood.

For each set of sequences in BP and EP, we compute Intrinsic Bayes Factors as evidence for the five hypotheses that 0, 1, 2, 3, and 4 are renewal states. We took which should be enough to cover all relevant context trees based on the results from the original paper. For prior distributions, we used the same uniform distributions as in the simulation experiment, but restricted to the trees that do not include prohibited transitions on inner nodes, i.e.,

 ha(τ)=1(τ∈Ta5∩T∗5) and h¯a(τ)=1(τ∈¯Ta5∩T∗5).

We also set for every and . The algorithm ran for iterations for computing each tree posterior distribution under each hypothesis, using sequences (around 5000 symbols in each training sample) for each Partial Bayes Factor, resulting in a total of posterior distributions for each hypothesis and PBFs to average. The empirical distribution for the estimated Partial Bayes Factors after a 10% trimming are shown in Figure 3.

Due to numerical instabilities, especially in the AIBF, caused by discrepancies in the Partial Bayes Factors when some particular texts are used as the training sample, we also compute the -trimmed version of AIBF (and GIBF), which consists of computing the arithmetic (and geometric) average excluding the lowest and highest PBFs, as defined in berger1996intrinsic. In this example, we removed the 2 highest and lowest values in for each set of sequences. From the results are presented in Table 4, we can see that decisive evidence was obtained when evaluating the renewal hypothesis for states 2, 3 and 4 for the BP dataset and 3 and 4 for the EP dataset which are consistent with the results from galves2012context.