Consider a stationary, finite-valued stochastic process with probability law. According to the ergodic theorem, an observer of this process can reconstruct the ‘true’ ergodic component of the process from observing a single typical infinite realization. Decision problems, on the other hand, are often concerned with making predictions based on finite past observations. In such problems, the primary object of interest is the predictive distribution about the outcome of the process at a given day given the finite history of outcomes from previous days.
This paper relates these two perspectives on predictions and decisions. We consider the long-run properties of an observer’s predictive distribution over next period’s outcome as observations accumulate. We show that the predictive distribution becomes arbitrarily close to the predictive distribution conditioned on knowledge of the true ergodic component, in most periods almost surely. Thus, as data accumulates, an observer’s predictive distributions based on finite history become the ‘correct’ predictions, in the sense of becoming nearly as good as what he would have predicted given knowledge of the objective empirical frequencies of the process. We demonstrate that the various qualifications we impose cannot be dropped.
Our results connect several literatures on learning and predictions in stochastic environments. First, there is the literature on the strong merging of opinions, pioneered by Blackwell and Dubins .111 Kalai and Lehrer  apply this concept to learning in games. More directly relevant to our purpose are the weaker notions of merging introduced by Kalai and Lehrer  and Lehrer and Smorodinsky , which focus on closeness of near-horizon predictive distributions. While strong merging obtains only under stringent assumptions, weak merging can be more easily satisfied. In our setting, for example, the posteriors may fail to strongly merge with the true parameter, no matter how much data accumulates. This strong notion of merging is unnecessary in contexts where decision makers discount the future or care only about a fixed number of future periods. Weak merging, to which our results apply, is usually sufficient.
Another line of enquiry focuses on representations of the form , where a probability measure (the law of the stochastic process) is expressed as a convex combination of distributions that may be viewed as especially “simple,” or “elementary.” Such representations, also called decompositions, are useful in models of learning where the set of parameters may be viewed as the main object of learning. Two seminal theorems are de Finetti’s representation of exchangeable distributions and the ergodic decomposition theorem for stationary processes. Exchangeability rules out many interesting patterns of inter-temporal correlation, so it is natural to consider the larger class of stationary distributions. For this class, the canonical decomposition is in terms of the ergodic distributions. This is the finest decomposition possible using parameters that are themselves stationary. Our main theorem states that a Bayesian decision maker’s predictions, based on finite histories, become arbitrarily close to those he would have made given knowledge of the true ergodic component.
Our result should also be contrasted with Doob’s consistency theorem which states that Bayesian posteriors weakly converge to the true parameter. When the focus is the quality of decisions, what matters is not the agents’ belief about the true parameter but the quality of his predictions. Although the two concepts are related, they are not the same. The difference is seen in the following example from Jackson, Kalai and Smorodinsky [6, Example 5]: Assume that the outcomes Heads and Tails are generated by tossing a fair coin. If we take the set of all dirac measures on infinite sequences of Heads-Tails outcomes as “parameters”, then the posterior about the parameter converges weakly to a belief that is concentrated on the true realization. On the other hand the agent’s predictions about next period’s outcome is constant and never approach the predictions given the true “parameter.” This example highlights that convergence of posterior beliefs to the true parameters may have little relevance to an agent’s predictions and behavior.
Every process can be represented in an infinite number of ways, many of which, like the decomposition of the coin toss process above, are not very sensible. Jackson, Kalai and Smorodinsky  study the question of what makes a particular decomposition of a stochastic process sensible. One requirement is for the process to be learnable, in the sense that an agent’s predictions about near-horizon events become close to what he would have predicted had he known the true parameter. Given the close connection between ergodic distributions and long-run frequencies, the most natural decomposition of a stationary process is where the ’s index the ergodic distributions. We show that their results do not apply to the class of stationary processes and their canonical ergodic decompositions. We show, however, that the ergodic decomposition is learnable in a weaker, yet meaningful sense as described below.
A third related literature, which traces to Cover 
, is non-Bayesian estimation of stationary processes. See Morvai and Weiss and the reference therein. This literature looks for an algorithm that make near-horizon predictions that are accurate for every stationary process. Our proofs of Theorem 3.1 and Example 3.3 rely on techniques that were developed in this literature. There is however a major difference between that literature and our work: We are interested in a specific algorithm, namely Bayesian updating. Our agent’s predictions and behavior are derived from this updating process. We show how to apply the mathematical apparatus developed for the non-Bayesian estimation in our Bayesian setup.
2. Formal model
An agent (a decision maker, a player, or a statistician) observes a stochastic process that takes values in a finite set of outcomes . Time is indexed by and the agent starts observing the process at . Let be the space of realizations of the process, with generic element denoted . Endow with the product topology and the induced Borel structure . Let
be the set of probability distributions over. The law of the process is an element of . A standard way to represent uncertainty about the process is in terms of an index set of “parameters:”
Let . A decomposition of is a quadruple where: is a standard probability space of parameters and for every such that the map is -measurable and
for every .
A decomposition captures a certain way in which a Bayesian agent arranges his beliefs: The agent views the process as a two stages randomization. First a parameter is chosen according to and then the outcomes are generated according to . Beliefs can be represented in many ways. The two extreme decompositions are: (1) the Trivial Decomposition. with , is trivial, and ; and (2) the Dirac Decomposition. with , , and . A “parameter” in this case is just a Dirac measure that assigns probability 1 to the realization .
We are interested in decompositions that identify “useful” patterns shared by many realizations. These patterns capture our intuition of fundamental properties of a process. The two extreme cases are usually unsatisfactory. In the Dirac decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting patterns.
Stationary beliefs admit a well-known decomposition with natural properties. Recall that the set of stationary measures over is convex and compact in the weak-topology. Its extreme points are called ergodic beliefs. We denote the set of ergodic beliefs by . Every stationary belief admits a unique decomposition in which the parameter set is the set of ergodic beliefs: for some belief . We call this decomposition the ergodic decomposition.
According to the ergodic theorem, for every stationary belief and every block , the limit frequency
exists for -almost every realization . When is ergodic this limit equals the probability . Thus, for ergodic processes, the probability of every block equals its (objective) empirical frequency.
The ergodic decomposition theorem states that for -almost every , The function defined over blocks can be extended to a stationary measure over which is also ergodic. Moreover, , so that the function recovers the ergodic parameter from the realization of the process. Thus, the parameters in the ergodic decomposition represent the empirical distribution of finite sequences of outcome along the realization of the stationary process. These parameters capture our intuition of fundamentals of the process.
A special case of the ergodic decomposition is the decomposition of an exchangeable distribution via i.i.d. distributions. For future reference, consider the following example:
The set of outcomes is and the agent’s belief is given by
for every and where . Thus, the agent believes that if he observes the process consecutive periods then the number of good periods (periods with outcome ) is distributed uniformly in and all configuration with good outcomes are equally likely.
De-Finetti’s decomposition is given by where equipped with the standard Borel structure and Lebesgue’s measure , and, for is the distribution of i.i.d coin tosses with probability of success :
For every and sequence with positive -probability, the -period predictive distribution is the element representing the agent’s prediction about next period’s outcomes given a prior and after observing the first outcomes of the process. Predictive distributions in this paper will always refer to one-step ahead predictions. This is for expository simplicity; our analysis covers any finite horizon.
Kalai and Lehrer , and Kalai, Lehrer and Smorodinsky  introduced the following notions merging. Note that in our setup, where the set of outcomes is the same in every period, this definition of merging is the same as ‘weak star merging’ in D’Aristotile, Diaconis and Freedman .
Let . Then the belief merges to if
for -almost every realization .
The belief weakly merges to if
for -almost every realization .
Here and later, for every pair we let . These definitions were inspired by Blackwell and Dubins idea of strong merging, which requires that the prediction of will be similar to the prediction of not just for the next period but for the infinite horizon.
A decomposition of is learnable if merges with for -almost every . The decomposition is weakly learnable if weakly merges with for -almost every .
As an example of a learnable decomposition, consider the Bayesian agent of Example 2.2. In this case
the strong law of large numbers implies that for every parameterthis expression converges -almost surely to . Therefore merges with for every , so De Finetti’s decomposition is learnable (and, a fortiori, weakly learnable). This is a rare case in which the predictions and can be calculated explicitly. In general merging and weak merging are difficult to establish, because the Bayesian prediction about the next period is a complicated expression which is potentially depends on entire observed past.
2.3. Motivation for Weak Merging
In applications, represent the true process generating observations, and is a Bayesian agent’s belief. To say that weakly merges with means that his next period predictions are accurate except for rare times.
To connect this concept with statistical decision problems, suppose that in every period, before the outcome is realized, the agent has to take some decision from a finite set . The agent’s payoff is represented by the payoff function . A strategy is given by , with denoting the action taken given the past realized outcomes. Let
be the expected average expected payoff in the first periods. Fix . A strategy is -optimal for periods under if for every strategy . Of course the optimal strategy depends on the agent’s belief . The following proposition, which is immediate from the definition of weak learning, says that an agent who maximizes according to a belief that weakly merges with the truth will play -optimal strategies against the truth if he is sufficiently patient. By ‘sufficiently patient’ we mean that the horizon is large. Similar result applies if the agent aggregates periods’ payoffs using some discount factor where by ‘sufficiently patient’ is meant that the discount factor is close to .
Let be such that weakly merges with . For every there exists such that for every , in every decision problem, every -optimal strategy for periods under is -optimal for periods under .
Kalai, Leher and Smorodinsky  provide a motivation for weak learning in terms of the properties of calibration tests. The idea of calibration originated with Dawid . A calibration test of a forecast compares the predicted frequency of events to their realized empirical frequencies. Kalai et al. showed that weakly merges with if and only if forecasts made by pass all calibration tests of a certain type when the outcomes are generated according to .
Finally, Lehrer and Smorodinsky  provide a characterization of weak merging in terms of the relative entropy between and .333 However, we do not know whether their condition can be used to prove our theorem without repeating the whole argument. No similar characterization is known for merging.
2.4. Merging and the Consistency of Bayesian Estimators
The idea of learning captured by Definition 2.4
concerns the quality of predictions made about near-horizon events. Another, perhaps more common, way to think about Bayesian inference is in terms of the consistency of Bayesian estimator. Consistency can be thought of as concerning learning the parameter itself. Recall that the Bayesian estimator of the parameteris the agent’s conditional belief over after observing the outcomes of the process. It is well known that under any ‘reasonable’ decomposition, the Bayesian estimator is consistent, i.e., the estimator weakly converges to the Dirac measure over the true parameter as data accumulates444 The argument traces back to Doob. See, for example, Weizsacker  and the references therein. It holds whenever the decomposition has the property that the realization of the process determines the parameter. However, consistency of the estimator does not imply that the agent can use what he has learned to make predictions about future outcomes. For example, consider the Dirac decomposition of the process of fair coin tosses. Suppose the true parameter is for some . After observing the first outcomes of the process the agent’s belief about the parameter is uniform over all that agrees with on the first coordinates. While this belief indeed converges to , the agent does not gain any new insight about the future of the process from learning the parameter. This decomposition is therefore not learnable.
3. Main Theorem
We are now in a position to state our main theorem.
The ergodic decomposition of every stationary stochastic process is weakly learnable.
To see the implications of our theorem, consider the following Hidden Markov process
An agent believes that the state of the economy every period is a noisy signal of an underlying “hidden” states that changes according to a Markov chain with memory 1. Formally, letbe the set of outcomes, the set of hidden (unobserved) states, and a -valued stationary Markov process with transition matrix given by
where . Thus, if the hidden state in period was then at period the hidden state remains with probability and changes with probability . The observed state of period will then be with probability and is different from with probability . Let be the distribution of . Then is a stationary process which is not markov of any order. If the agent is uncertain about then his belief about the outcome process is again stationary, and can be represented by some prior over the parameter set . This decomposition of will be the ergodic decomposition.
The consistency of the Bayesian estimator for implies that the conditional belief over the parameter converges almost surely in the weak-topology over to the belief concentrated on the true parameter. However, because next-period’s predictions involve complicated expressions that depend on the entire history of the process, it is not clear whether these predictions merge with the truth. It follows from our theorem that they weakly merge.
Consider now the general case. If the agent knew the fundamental , then at period , after observing the partial history , his predictive probability that the next period outcome is would have been
Again consistency of the Bayesian estimator implies that, given uncertainty about the fundamental, the agent’s assessment of becomes asymptotically accurate for every block
. However, when the agent has to compute the next-period posterior probability (3), he only had one observation of a block of size and no observation of the block of size so at that stage his assessment of the probabilities that appear in (3) may be completely wrong. Our theorem says that the agent would still weakly learn to make these predictions correctly.
Theorem 3.1 states that the agent will make predictions about near-horizon events as if he knew the fundamental of the process. Note, however, that it is not possible to ensure that the agent will learn to predict long-run events correctly, no matter how much data accumulates. For example, consider an agent who faces a sequence of i.i.d. coin tosses with parameter representing the probability of Heads. Suppose this agent has a uniform prior over [0,1]. This agent will eventually learn to predict near horizon outcomes as if he knew the true parameter , but if he will continue to assign probability 0 to the event that the long-run frequency is . In economic models, discounting implies that only near-horizon events matter.
We end this section with an example that in Theorem 3.1 weak learnability cannot be replaced by learnability. The example is a modification of an example given by Ryabko for the forward prediction problem in a non-Bayesian setup .
Every period there is a probability for eruption of war. If no war erupts then the outcome is either bad economy or good economy and is a function of the number of peaceful periods since the last war. The function from the number of peaceful periods to outcome is an unknown parameter of the process, and the agent has a uniform prior over this parameter.
Formally, let be the set of outcomes. We define through its ergodic decompositions. Let be the set of parameters with the standard Borel structure
and the uniform distribution. Thus, a parameter is a function
. We can think about this belief as a hidden markov model where the unobservable processis the time that elapsed since last time a war occurred. Thus, is the -valued stationary Markov process with transition probability
for every , and is the distribution of a sequence of
-valued random variables such that
Consider a Bayesian agent who observes the process. After the first time a war erupts the agent keeps track of the state of the process at every period. If there is no uncertainty about the parameter, i.e., if the Bayesian agent knew , his prediction about the next outcome when gives probability to outcome W and probability to outcome . On the other hand, if the agent does not know but believes that it is randomized according to , he can deduce the values gradually while he observes the process. However for every there will be a time when the agent will observe consecutive peaceful period for the first time and at this point the agent’s prediction about the next outcome will be . Thus there will always be infinitely many occasions in which an agent who predicts according to will differ than an agent who predicts according to . Therefore the decomposition is not learnable. On the other hand, in agreement with our theorem, these occasions become more infrequent as time goes by so the decomposition is weakly learnable.
4. Proof of Theorem 3.1
Up to now we assumed that the stochastic process starts at time . When working stationary processes it is natural to extend the index set of the process from to , i.e. to assume that the process has infinite past. This is without loss of generality: every stationary stochastic process admits an extension to the index set [10, Lemma 10.2]. We therefore assume hereafter, with harmless contrast with our previous notation, that .
Let be a -algebra Borel subsets of . The quotient space of with respect to is the unique (up to isomorphism of measure spaces) standard probability space and a measurable map such that is generated by , i.e., for every -measurable function from to some standard probability space there exists a (unique up to equality -almost surely) -measurable lifting defined over such that . The conditional distributions of over is the unique (up to equality -almost surely) family of probability measures over such that:
For every it holds that
The map is -measurable and (1) is satisfied for every .
We call the decomposition of induced by . For every belief , the trivial decomposition of is generated by the trivial sigma-algebra , the Dirac decomposition is generated by the sigma-algebra of all Borel subsets of . The ergodic decomposition is induced by the -algebra of all invariant Borel sets of , i.e. all Borel sets such that where is the left shift.
We will prove a more general theorem, which may be interesting in its own right. Let be the left shift so that for every . A sigma-algebra of Borel subsets of is shift-invariant if for every Borel subset of .
Let be a stationary distribution over and let be a shift invariant -algebra of subsets of such that . Then the decomposition of induced by is weakly learnable.
Theorem 3.1 follows immediately from Theorem 4.1 since the sigma-algebra of invariant sets which induces the ergodic decomposition satisfies the assumption of the Theorem 4.1. We will prove Theorem 4.1 using Lemma 4.2
Let be a stationary distribution over and let be a shift invariant -algebra of Borel subsets of . Then
Consider the case in which is trivial. Then Lemma 4.2 says that a Bayesian agent who observes a stationary process from time onwards will make predictions in the long run as if he knew the infinite history of the process.
Proof of Lemma 4.2.
For every let be a version of the conditional distribution of according to given the finite history and :
and let be a version of the conditional distribution of according to given the infinite history and :
Let . By the martingale convergence theorem and therefore
It follows from the stationarity of and the fact that is shift invariant that
Maker’s Ergodic Theorem.
Let be such that and let be such that and . Then
Proof of Theorem 4.1.
From it follows that . Therefore, from Lemma 4.2 we get that
By the same lemma (with trivial)
By the last two limits and the triangular inequality
Let be the quotient of over and let be the corresponding conditional distributions. Let be the set of all realizations such that
Then by (8). But . It follows that for -almost every , a desired. ∎
5. Ergodicity and mixing
Mixing conditions formalize the intuition that observing a sequence of outcomes of a process does not change one’s belief about events in the far future. Standard examples of mixing processes are i.i.d. processes and non-periodic markov processes. In this section we recall a mixing condition that was called “sufficiency for prediction” in JKS, show that the ergodic decomposition is not necessarily sufficient for prediction and show that a finer decomposition than the ergodic decomposition is sufficient for prediction and also weakly learnable.
Let be the future tail sigma-algebra where the -algebra of that is generated by . A probability distribution (not necessarily stationary) is mixing if it is -trivial, i.e., if for every .555 An equivalent way to write this condition is that
for every and , there is such that
If we want the components of the decomposition to be mixing we need a finer decomposition than the ergodic decomposition. This decomposition is the decomposition that is induced by the tail as shown in the following proposition.
Let be the decomposition of a belief that is induced by the tail . Then is mixing for -almost every .
This proposition is JKS’ Theorem 1. We repeat the argument here to clarify a gap in their proof.
The proposition follows from the fact that the conditional distributions of every probability distribution over the tail are almost surely tail-trivial (i.e., mixing). This fact was recently proved by Berti and Rigo [1, Theorem 15]666 It is taken for granted in the first sentence of JKS’s proof of Their Theorem 1. We note that it is not true for every sigma-algebra that the conditional distributions of over are almost surely -trivial. This property is very intuitive (and indeed, easy to prove) when is generated by a finite partition, or more generally when is countably generated, but the tail is not countably generated, which is why Berti and Rigo’s result is required. ∎
The next theorem uses Lemma 4.2 to show that the tail decomposition is also weakly learnable. In particular, Theorem 5.2 implies that the ergodic decomposition does not capture all the learnable properties of a stationary process.
The tail decomposition of a stationary stochastic process is weakly learnable.
From Lemma 4.2 it follows that the decomposition induced by the past tail is learnable, since the past tail is shift invariant.
The theorem now follows from the fact that for every stationary belief over a finite set of outcomes it holds that where and are the completions of the past and future tails under . See Weiss [15, Section 7]. Therefore, the decomposition of induced by equals the decomposition induced by , which is learnable. We note that the equality of the past and future tails of a stationary process is not trivial, it relies on finiteness of the set of outcomes , and the proof relies on the notion of entropy. ∎
We conclude with further comments on the relationship with . Their main result characterizes the class of distributions that admit a decomposition which is both learnable and sufficient for prediction. They dub these processes “asymptotically reverse mixing.” In particular, they prove that, for every such process , the decomposition of induced by the future tail is learnable and sufficient to prediction. In our Example 3.3, the tail decomposition equals the ergodic decomposition, and, as we have shown, is not learnable. This shows that stationary processes needs not be asymptotic reverse mixing. On the other hand, the class of asymptotically reverse mixing processes contains non-stationary processes. For example, the Dirac atomic measure is asymptotically reverse mixing for every realization .
In this section we discuss to what extent the theorems and tools of this paper extend to a larger class of process. In the process, this sheds further light on the assumptions made in our work.
6.1. Infinite set of outcomes
The definitions of merging and weak merging can be extended to the case in which the outcome set is a compact metric space777 Also for the case that is a separable metric space, but then there are several possible non-equivalent definitions : Let be the Prohorov Metric over . Say that the belief merges to if
for -almost every realization and that weakly merges to if the limit holds in strong Cesaro sense. Theorem 3.1 extends to the case of an infinite set of outcomes. However, Theorem 5.2 does not hold in this case. We used the finiteness in the proof when we appealed to the equality of the past and future tails of the process. The following example shows the problem where is infinite:
Let equipped with the standard Borel structure. Thus an element is given by where for every . Let be the belief over such that are i.i.d. fair coin tosses and for every . Note that in this case (so the future tail contains the entire history of the process) while (the past tail is empty). The tail decomposition in this case will be the Dirac decomposition. However, this decomposition is not learnable: an agent who predict according to will at every period will be completely in the dark about .
6.2. Relaxing stationarity
As we have argued earlier, stationary beliefs are useful to model situations where there is nothing remarkable about the point in time in which the agent started to keep track of the processes (so other agents who start observing the process at different times have the same beliefs) and that the agent is a passive observer who has no impact on the process itself. The first assumption is rather strong, and can be somewhat relaxed. In particular, consider a belief that is the posterior of some stationary prior conditioned on the occurrence of some event. (A similar situation is an agent who observes a finite state markov process that starts at a given state rather than the stationary distribution.) Let us say that a belief is conditionally stationary if there exists some stationary belief such that for some Borel subset of with . While such processes are not stationary, they still admits an ergodic decomposition. they exhibit the same tail behavior of stationary processes. In particular, our theorems extend to such processes. The obvious details are omitted.
-  P. Berti and P. Rigo. 0-1 laws for regular conditional distributions. The Annals of Probability, 35:649–662, 2007.
-  David Blackwell and Lester Dubins. Merging of opinions with increasing information. Ann. Math. Statist., 33:882–886, 1962.
-  Thomas M Cover. Open problems in information theory. In 1975 IEEE Joint Workshop on Information Theory, pages 35–36, 1975.
-  Anthony D’Aristotile, Persi Diaconis, and David Freedman. On merging of probabilities. Sankhyā: The Indian Journal of Statistics, Series A, pages 363–380, 1988.
-  AP Dawid. The Well-Calibrated Bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
-  Matthew O. Jackson, Ehud Kalai, and Rann Smorodinsky. Bayesian representation of stochastic processes under learning: de Finetti revisited. Econometrica, 67:875–893, 1999.
-  E. Kalai and E. Lehrer. Rational learning leads to Nash equilibrium. Econometrica, 61:1019–1045, 1993.
-  E. Kalai, E. Lehrer, and R. Smorodinsky. Calibrated Forecasting and Merging. Games and Economic Behavior, 29(1):151–159, 1999.
-  Ehud Kalai and Ehud Lehrer. Weak and Strong Merging of Opinions. Journal of Mathematical Economics, 23:73–86, 1994.
-  O. Kallenberg. Foundations of Modern Probability. Springer-Verlag, New York, second edition, 2002.
-  E. Lehrer and R. Smorodinsky. Compatible measures and merging. Mathematics of Operations Research, pages 697–706, 1996.
-  E. Lehrer and R. Smorodinsky. Relative entropy in sequential decision problems. Journal of Mathematical Economics, 33:425–439, 2000.
-  Gusztáv Morvai and Benjamin Weiss. Forward estimation for ergodic time series. In Annales de l’Institut Henri Poincare (B) Probability and Statistics, volume 41, pages 859–870. Elsevier, 2005.
-  Boris Yakovlevich Ryabko. Prediction of random sequences and universal coding. Problemy Peredachi Informatsii, 24(2):3–14, 1988.
-  Benjamin Weiss. Single Orbit Dynamics. AMS Bookstore, 2000.
-  H.V. Weizsäcker. Some reflections on and experiences with splifs. Lecture Notes-Monograph Series, pages 391–399, 1996.