In several research problems we deal with probabilistic sequences of inputs (e.g., sequence of stimuli) from which an agent generates a corresponding sequence of responses and it is of interest to model or find some kind of relation between them. This is the case, for example, of many experiments related to the study of statistical learning in neuroscience. Statistical learning in this context refers to the ability to extract statistical regularities from the environment over time (Armstrong et al., 2017; Conway, 2020; Schapiro and Turk-Browne, 2015). In this kind of experiments, humans or animals are exposed to sequences of stimuli with some statistical regularity and it is conjectured that the brain is able to retrieve/learn the statistical regularities encoded in the stimuli (von Helmholtz, 1867; Wacongne et al., 2012; Garrido et al., 2013). The study of this conjecture usually involves data analysis of some physiological or behavioral responses recorded from participants during the performance of a suitable task. Statistical learning is also a widely used term in computer science but it is not that meaning we are referring to here.
Motivated by an auditory statistical learning problem, a new class of stochastic process was introduced in Duarte et al. (2019) to model the relation between the probabilistic sequence of inputs and the corresponding sequence of responses, namely sequences of random objects driven by context tree models. A process in this class has two elements: a stochastic source generating the sequence of stimuli and a sequence of responses generated by the agent during the exposure to the stimuli. The source generating the stimuli is assumed to be a context tree model (CTM) taking values in a categorical set (Rissanen, 1983; Bühlmann and Wyner, 1999; Galves and Löcherbach, 2008). In a context tree model, at each step, the occurrence of the next symbol is solely determined by a variable-length sequence of past symbols, called context. Any stationary stochastic chain can be approximate by a context tree model, therefore it constitutes a powerful tool to model the probabilistic structure of sequences of stimuli (Fernández and Galves, 2002; Duarte et al., 2006)
. The model also assumes the existence of a family of probability measures on the set of responses (hereafter called objects) indexed by the contexts characterizing the sequence of stimuli. This family of measures describes the relation between the source and the responses. More precisely, at each step, a new object of the response sequence is chosen according to the probability measure associated to the context ending at that step in the sequence of stimuli.
Given some data, statistical model selection methods can be applied to estimate the set of contexts and, in some cases, the family of probability measures characterizing the distribution of the response sequence. The theoretical framework introduced inDuarte et al. (2019) addressed a model selection algorithm for the particular case in which the objects of the response sequence refer to functions (e.g., electrophysiological data). Nevertheless, the proposed mathematical framework is general enough to model other kind of objects in the response sequences such as responses taking values in a categorical set, in the real line or in . In the same way that context tree models have found applications in several research areas such as neuroscience, genetics (Busch et al., 2009) and linguistics (Galves et al., 2012), this new theoretical framework can find applications in research areas beyond neuroscience. For this reason, its computational implementation is useful.
This paper introduces the SeqROCTM Matlab toolbox, a set of computational tools for working with sequences of random objects driven by context tree models. In addition to the statistical software, this article makes other contributions. We formalize two model selection procedures for sequences of random objects driven by context tree models with categorical responses. We also formalize and define general conditions for the use of the Smaller Maximizer Criteria (SMC) to tune model selection algorithms for both context tree models and sequence of random objects driven by context tree models.
Since the stimuli sequence is generated by context tree model, the SeqROCTM toolbox also implements several model selection algorithms and tuning procedures for context tree model. Up to our knowledge, there exist only an R package implementing model selection for context tree model (Mächler and Bühlmann, 2004).
The paper is organized as follows. Section 2 briefly describes the mathematical concepts and model selection algorithms included in the SeqROCTM toolbox. Section 3 introduces the architecture of the software. Sections 4 and 5 present the main functionalities of the toolbox through illustrative examples motivated by statistical learning problems. Conclusions are given in Section 6.
2 Model selection methods for sequence of random objects driven by context tree models
We begin this section by introducing some notation and formal definitions. Let be a finite set. Any string , is denoted by and its length by . Given two strings and of elements of , denotes the string of length obtained by concatenating and . The string is said to be a suffix of , and denote by , if there exists a string satisfying . When we say that is a proper suffix of and denote .
A context tree is defined as any set satisfying
Suffix Property. No string is a proper suffix of another string .
Irreducibility. No string belonging to can be replaced by a proper suffix without violating the suffix property.
The elements of are called contexts. The height and size of are defined as and , respectively.
In this work we deal with experiments in which an agent is exposed to an stochastic sequence of stimulus , taking values in . A key point is that the stochastic sequence is a context tree model (Rissanen, 1983; Bühlmann and Wyner, 1999). Formally, consider a stationary ergodic process and for any string denote by
We say is a context tree model with parameters if there exist a function such that
for any and any finite sequence such that , it holds that
no proper suffix of satisfies condition 2.
The function is said a context function.
A sequence of responses with values in some measurable space is recorded while the agent is exposed to the stimuli sequence (e.g., neurophysiology responses such as electroencephalographic data, behavioral responses, etc.). The relation between the responses and the stimuli is modeled through a class of stochastic processes called sequence of random objects driven by context tree model (Duarte et al., 2019).
The bivariate stochastic chain taking values in is a sequence of random objects driven by context tree models with parameters , where is a family of probability measures on , if
is a context tree model with parameters ;
conditionally to the sequence ,
are independent random variables and, for any, it holds that
Given some training data and, under the assumption that it was generated by a sequence of random objects driven by context tree models, the statistical problem of interest is to estimate the parameters and characterizing the response sequence.
To formulate the model selection methods we consider two scenarios. A first scenario in which the responses belong to a finite set and a second scenario in which takes values in a functional space. Hereafter we will refer to these two scenarios as categorical and functional case, respectively.
Before introducing the model selection procedures a few more definitions are needed. Given a finite string we denote by the number of occurrences of the string in the sequence , that is
Let be an integer such that , an admissible context tree of maximum height for the sample is any context tree satisfying
if and only if and .
Any string with is a suffix of some or has a suffix .
The set of all admissible context trees of maximal height is denoted by .
Let be a context tree and fix a finite string . A subtree in induced by is defined as the set . The set is called a terminal subtree if for all it holds that for some .
Given two context trees and , we denote by (resp. ) if for any there exists such that (resp. ).
The model selection procedure that will be introduced for the functional case and two out of three procedures developed for the categorical case are inspired by the algorithm Context (Rissanen, 1983). For this reason, we first describe below a general algorithm Context procedure, specifying later the difference on each case. In the sequence, we present a third model selection procedure for the categorical case which is based on BIC.
2.1 General algorithm Context
Given a sample , fix an integer and let be the context tree of maximum size in . We shorten this maximal candidate context tree by successively pruning the terminal subtrees according to some statistical criterion.
For any string such that is a terminal subtree, we decide to prune or not verifying a statistical criterion, say stat(u). If stat(u) is satisfied, we prune the subtree in ,
Otherwise, if stat(u) is not satisfied, we keep in . At each pruning step, we check a string which induces a terminal subtree in and that has not been checked yet. This pruning procedure is repeated until all the existing terminal subtrees have been checked and no more pruning is possible.
A pseudo code for the general algorithm Context is given in Algorithm 1.
The use of the general algorithm Context for the functional and categorical cases differs with respect to stat(u).
In the functional case stat(u) implements a two-side goodness of fit test for functional data (Cuesta-Albertos et al., 2006). For the categorical case, stat(u) implements two different procedures. The first one compares the conditional log-likelihoods between a string and its offspring while the second one compares the empirical distributions between a string and its offspring. The following sections describe the statistical criterion stat(u) implemented in each case.
2.2 Functional case
In this section it is defined the statistic criterion used in the general algorithm Context (Algorithm 1) for the functional case.
Consider the set of real-valued square integrable function in the interval and the Borel algebra on . Moreover, assume that is a sample of a sequence of random objects driven by context tree models with parameters , with a probability measure on For any string with , let be the set of indexes belonging to where the string occurs in the sample , that is
By definition, the set has elements. If , we set for each . Thus, is the subsample of induced by the string .
To each element of the terminal subtree we associate the subsample of induced by it. These sets of functions are used to decide whether the subtree is pruned or not: the stat(u) criterion tests the equality of distribution between the sets of functions associated to the elements of .
To test whether two samples of functions have the same distribution, it is used a projective method proposed in Cuesta-Albertos et al. (2006). In this method, each function of both sets is projected in a gaussian direction chosen at random. This produces two new projected samples of real numbers. Then, the equality of distribution for these new samples is tested using the Kolmogorov-Smirnov test. The results presented in Cuesta-Albertos et al. (2006)
guarantees that, if the two samples of functions have different distributions then the set of random directions where the projected samples have the same distribution has Lebesgue measure equal to zero. This implies that if we reject the null hypothesis of equality of distributions for two sets of projected samples, then we can also reject it for the corresponding functional sets.
Formally, for any string such that is a terminal subtree, we test the null hypothesis
using the test statistic
Here is a realization of a Brownian bridge in the interval and is the Kolmogorov-Smirnov distance between the empirical distributions and of the projections of the samples and onto the direction , respectively.
We reject the null hypothesis when , with and . Observe that for a string with only one pair of offspring (i.e., ) the null hypothesis is tested with significance level (because becames the -percentile of a KS distribution). On the other hand, for a string with more than two offspring, is tested with an unknown significance level upper bounded by . This is guaranteed because in the definition of we applied a terminal subtree correction, by using instead of . In this way, is tested with significance level at most for any string .
The statistic in equation (5) depends on the a random direction . For this reason, to improve the stability of the estimate we test the null hypothesis (4) several times using different Brownian bridges. We conclude that the sets of functions associated to the elements of the terminal subtree do not have the same distribution, and consequently, we do not prune the subtree
, if the number of rejections exceeds a certain threshold. This threshold is derived according to a binomial distribution with probability of success. Formally, fixed a string , a significant level and consider independent Brownian bridges . Compute the test statistics and define
In this case, stat(u) checks whether , where is the -percentile of a binomial distribution with parameters and . The elucidation for this procedure comes from the following proposition.
For any string and integer , consider the random variables defined in (5) with independent Brownian bridges in the interval . For any significance level define with . Under the null hypothesis (4), for any , it holds that,
where denotes the smallest constant such that with a random variable with Binomial distribution of parameters and .
2.3 Categorical case
2.3.1 General algorithm Context + Conditional log-likelihood
Given such that , we denote by the number of occurrences of the string in the sample followed by the occurrence of the symbol in the sample , that is
The maximum conditional likelihood for a sample is given by
and the maximum likelihood estimator of the conditional probability , defined as
Notice that is the portion of the conditional likelihood of the model given the data induced by the context .
Consider the statistic
and fix a threshold . For any string such that is a terminal branch, the function stat(u) verifies whether the following inequality holds
If the inequality is satisfied, we prune the subtree of . Otherwise, we keep in . The inequality (12) is equivalent to check whether
Notice that is the conditional log-likelihood ratio between a model with parameters and a model with parameters , where and they differ only by one set of offspring nodes branching from , that is .
The idea of comparing the maximum likelihood induced by a node with the maximum likelihood induced by its offspring was originally used in the algorithm Context introduced by Rissanen (1983).
Given , set for all , and, for any define
The context tree estimator obtained with this procedure can be defined as
Notice that once we have , for a given , equation (13) implies that for any , .
The consistency of the original algorithm Context was proved in Rissanen (1983). In this setting (i.e., model selection procedure for context tree model) the statistic used to identify the contexts is given by
The proof of consistency depends on through the ergodicity of and its memory relation with , which in our case also satisfies. Therefore, the consistency can be easily adapted for the formulation we are introducing here for sequences of random objects driven by context tree models.
2.3.2 General algorithm Context + Offspring empirical distributions
In this case, the function stat(u) inside the general algorithm Context compares the distance between the empirical distribution associate to the string and the ones associated to its offspring.
Formally, for any finite string , define the statistic
For any finite string such that is a terminal subtree, the function stat(u) verifies whether . If the inequality is satisfied the subtree is pruned, otherwise it is kept.
Given , set for all , and, for any define
The context tree estimator obtained with this procedure can be defined as
where refers to the largest suffix of . Note that, as well as in the classical algorithm Context, implies for all .
In Galves and Leonardi,Florencia (2008) it was proved the strong consistency of the estimator (17) (with instead of ) for the case of unbounded context tree models. The proof relies on a mixture property which is always satisfied in the case of finite context tree models. In particular, is also true for the law of the response sequence since its time memory depends on the time memory of the associated context tree model.
2.3.3 Bayesian Information Criterion (BIC)
This section describes a model selection procedure for the categorical case using the Bayesian Information Criterion. Model selection for context tree models via BIC was first addressed in Csiszár and Talata (2006). We formalize here how the procedure introduced in Csiszár and Talata (2006) is used in our case.
Given a sample and a constant , the BIC estimator for sequence of random objects driven by context tree models is defined as
stands for the degree of freedom of the model. Formally, for any admissible context tree, we define, for each ,
Csiszár and Talata (2006) showed that (18) can be computed efficiently through the following inductive procedure. Starting with , for any , define the quantity and the indicator , and for any define recursively the quantity
and the indicator
The estimate obtained solving (18) can be written as
Observe that, on the contrary to the algorithm Context, for a given , does not imply that for any .
The fact that the recursive procedure above effectively solves the BIC optimization problem and the consistency of the estimator were proved in Csiszár and Talata (2006) for the case of context tree models . In Csiszár and Talata (2006), the analogous to equation (18) is
in which the maximum log-likelihood is computed using the empirical probabilities of the distribution , instead of . Since the distribution affects these proofs only through the ergodic theorem and its memory dependency on , it is straight forward that all the proofs can be adapted to the case of the conditional log-likelihood considered in this section, which depends on instead of .
2.3.4 Tuning the model selection methods
The threshold used in the procedures based on the general algorithm Context and the penalization constant
involved in the model selection procedure based on BIC are hyperparameters whose values must be specified a priori. Small values ofand result in big context trees (big in the sense of its size) and, consequently, overfitted models while high values of these hyperparameters give rise to context trees of small size and underfitted models.
To choose the value of the hyperparameters one can use the Smallest Maximizer Criterion (SMC) (Galves et al., 2012). The SMC procedure was introduced in Galves et al. (2012) to tune the model selection method for context tree models based on BIC. Here we extend this framework to the case of sequence of random objects driven by context tree models for tuning the model selection methods proposed in the categorical case.
The SMC procedure consists of two steps. In the first step a set of candidate models is computed, namely the champion trees. In the second step, an optimal model is chosen within the set of champion trees. The champion trees obtained will depend on the model selection procedure being tuned and may differs from one procedure to another.
In this section, denote either , or .
Step 1. Compute the champion trees. The champion trees constitute a set of estimated context trees obtained by varying the value of the hyperparameter . When we obtain the admissible context tree of maximum size (the more complex model). By successively increasing the value of , we obtain a finite set of context trees totally ordered with respect to the order , say . It is not hard to see that there exists a value such that for any the estimated model is the empty tree, , which refers to the independent model.
A crucial fact for the consistency of SMC is that the context tree generating the sample data belongs eventually almost surely to the set of champion trees as goes to . This is the content of the theorem below and its proof is a co-factor of Theorem 6 in Galves et al. (2012) and an extra argument given in Appendix B.
Assume is a sample of a sequence of random objects driven by context tree model with parameters , with . Consider the map with denoting either , or and denote by
Then is totally ordered with respect to and eventually almost surely as .
It is well known that the bigger the context tree, the higher its sample likelihood. When SMC was introduced for tuning the BIC model selection algorithm for context tree models, Galves et al. (2012) theoretically proved the existence of a change of regime in the rate in which the sample likelihood increases in the set of champion trees (Theorem 7 in Galves et al. (2012)). The authors also showed that such changing point in the likelihood function occurs at the true model generating the data. A consequence of the proof of consistency of SMC is that the change of regime does not depends on the estimation method used to obtain the champion trees, but only on some properties of the set. For this reason we state the next theorem in a slightly more general form that stated in Galves et al. (2012) and in terms of sequences of random objects driven by context tree models.
Assume is a sample of a sequence of random objects driven by a context tree model with parameters , with . Given a set satisfying
is totally ordered with respect to and
eventually almost surely as .
The following holds:
For any with there exists a constant such that
For any with there exists a constant such that
Theorem 1 is a co-factor of Theorem 7 in Galves et al. (2012) and its proof is presented in Appendix C. This theorem provides a criterion to choose the optimal model (and consequently, the optimal value) among the champion trees. That is to say, the model in at which the change of regime occurs. This is the scope of the second step of the SMC.
Step 2. Identify the optimal tree. To select an optimal tree we use the following consequence of Theorem 1. For any ,
This suggest that should be the smallest context tree such that the rescaled difference between the conditional log-likelihood of and (its successor in the order ) decreases as
increases. This is done by comparing average bootstrapped conditional log-likelihood using a t-test, as follows.
Fix two different sample sizes . Obtain independent bootstrap resamples of , of size , say
Similarly, let be another set of independent bootstrap samples of size constructed by truncating the sequences to size .
For each and its successor () compute the rescaled log-likelihood differences
for and .
Apply a one-side t-test to compare the mean of the samples
Select as optimal tree the smallest champion tree such that the test rejects the equality of the means in favor of the alternative hypothesis .
Step 2 involves the computation of bootstrap resamples of a sequence of random objects driven by context tree models. The toolbox implements different bootstrap strategies for that. Before introducing them, we describe the bootstrap schemes implemented to resample a context tree model .
(Parametric bootstrap) The bootstrap samples are obtained by drawing from a parameterized distribution (Bühlmann, 2002) in the following way:
Choose a hyperparameter value and estimate the model using the data .
Generate the bootstrap samples by simulating from the approximated distribution ,
(Block bootstrap) Split the sample into non-overlapping blocks (see Figure 1a). These blocks are build by using a renewal context of to split the sequence. A renewal context is a string from which the next symbols can be generated without knowing further information from the past. A resample is obtained by repeatedly sampling uniformly a block from the set of blocks and concatenating them. In the toolbox, the user can specify the renewal context, or it can be computed from the estimated model .
A bootstrap resampling of the bivariate chain can be obtained with two different procedures:
Choose a hyperparameter value and estimate from .
Obtain a bootstrap resampling of the sequence using one of the bootstrap strategies for context tree models described above. The user can also choose not to resample the sequence .
Generate a sequence using the distribution .
(Block bootstrap) Split the sample into non-overlapping blocks using a renewal context (see Figure 1b). In this case, a renewal context is a string from such that from it is possible to generate both the next symbols of sequence and its associates responses , without further information from the past. The bootstrap samples can be obtain by repeatedly sampling uniformly from the set of blocks and concatenating them.
Bühlmann (2000) proposed a procedure based on Risk functions for tuning the algorithm Context. The SeqROCTM toolbox implements this tuning procedure for the particular case of the Final prediction error risk. This procedure is available in the toolbox for tuning the model selection procedures for context tree models and the model selection procedures for sequence of random objects driven by context tree models (for the categorical case).
3 Software Architecture
The SeqROCTM toolbox have been designed following a modular structure. The toolbox consists of functions written in Matlab that can be grouped in four modules regarding their functionalities (see Figure 2). This architecture makes the software easy to update, either by adding new functionalities or by improving the existing ones.
The module Data Simulation includes routines to simulate sequences of inputs (i.e., context tree models), to simulate the response sequence of a sequence of random objects driven by context tree models and to simulate the bivariate sequence. The module Visualization implements the algorithm described in Mill (2020) to graphically show a tree structure. This module contains also a routine to print the context tree in the console. The Tools module implements several functions that can assist the researcher during the experimental design and data analysis. Some of those functions are also invoked by functions in other modules. Some demos illustrating how to use the toolbox are included in the Demo module.
The main functions of the toolbox are in the Model Selection module. This module contains all the model selection procedures and tuning algorithms introduced in Section 2. Figure 3 presents a close-up of this module.
The novelty implemented in this toolbox it is illustrated in Figure 3 by the branch growing from the SeqROCTM node, that is, the mathematical framework to make inference in the class of sequences of random objects driven by context tree models. Nevertheless, due to the close relation to model selection in context tree models, we end up including in the toolbox several existing model selection algorithms for context tree models.
Up to our knowledge, there is only an R-package implementing model selection in context tree models introduced by Mächler and Bühlmann (2004). This package implements the algorithm Context and a tuning procedure for the algorithm Context based on Risk functions. Our toolbox also works as an alternative tool for this purpose.
All the implementations are self-contained. The only external function the toolbox uses is the function permn (van der Geest, 2019).
4 Illustrative example: The Goalkeeper game
This section presents the major functionalities of the SeqROCTM toolbox through an illustrative example.
The Goalkeeper game is a video game developed by the NeuroMat team (https://game.numec.prp.usp.br/) as a tool to investigate the conjecture that the brain does statistical model selection (Castro, 2006). During the game, the kicker can throw the ball in three directions: left, center or right. The agent, playing the role of the Goalkeeper in a soccer penalty shootout, has to defend as much penalties as possible by predicting, at each step, in which direction the kicker will throw the ball. The kicker’s choices are randomly generated according to a context tree model. The Goalkeeper game has been used in the experimental protocol of some neurobiological experiments (Stern et al., 2020).
Here, for simplicity, instead of collecting some data using the game, we will simulate different participant strategies to generate the responses. We show how the toolbox can be used to assess the participant strategy from the data. The aim is that the estimated strategy matches the one used to generate the responses matches.
To generate the sequence of kick directions, we define a context tree model that generates sequences according to the following rule: after a shot to the left the kicker always send the ball to the center, after a shot to the right he always send the ball to the left, but after a shot to the center, if one step back he sent the ball to the center, then he send the ball to the left. Otherwise, if one step back he sent the ball to the left, then he send the ball to the right with probability 0.8 and to the center with probability 0.2.
Formally, the directions left, center and right
are represented by the symbols 0, 1 and 2, respectively. These symbols will conform the alphabet. When using the toolbox, an alphabet is always represented by a vector of consecutive positive integers from zero to the number of elements minus one. A context tree is defined through a cell array of vectors, each vector representing a context (i.e., a leaf of the tree). The distributions associated to the contexts are specified by a matrix with the number of rows equals the number of contexts and the number of columns equals the number of symbols in the alphabet. In this way, the-th row contains the distribution associated to the -th context in the cell array defining the context tree. The following source code defines these variables according to the example.