In the traditional framework of spectral learning of stochastic time series models, model parameters are estimated based on trajectories of fully recorded observations. However, real-world time series data often contain missing values, and worse, the distributions of missingness events over time are often not independent of the visible process. Recently, a spectral OOM learning algorithm for time series with missing data was introduced and proved to be consistent, albeit under quite strong conditions. Here we refine the algorithm and prove that the original strong conditions can be very much relaxed. We validate our theoretical findings by numerical experiments, showing that the algorithm can consistently handle missingness patterns whose dynamic interacts with the visible process.
- 1 Introduction
- 2 Background
- 3 Spectral Learning for OOMs from Data Containing Missing Values
- 4 Empirical Results
- 5 Conclusion
Spectral methods have become widely used to model probabilistic grammars [Bailly et al., 2010, Balle and Mohri, 2012, Cohen et al., 2014, 2013, Balle and Mohri, 2015], stochastic processes [Hsu et al., 2012, Anandkumar et al., 2012, Rodu et al., 2013, Anandkumar et al., 2014, Thon and Jaeger, 2015, Wu and Noé, 2016], and controlled dynamical systems [Boots and Gordon, 2011, Hamilton et al., 2014, Hefny et al., 2015, Azizzadenesheli et al., 2016, Hefny et al., 2018]
. Compared with likelihood-based methods such as Expectation Maximization (EM), spectral learning methods have two appealing properties: (i) they are consistent learning methods with convergence guarantees, and (ii) they are, in principle, non-iterative learning methods which are computationally inexpensive. These properties have made spectral learning a promising tool for learning and analyzing dynamical systems.
Stochastic time series modeling is one of the application areas for spectral methods. In the traditional framework of spectral learning, the training sequences of observations are assumed to contain no missing values. However, this assumption can be violated in real-world data. For instance, in longitudinal studies of disease treatment[Hedeker and Gibbons, 2006], the disease status of patients are often intermittently missing due to patients’ skipped visits; in gene expression analysis, the gene data generated by microarray experiments often contain missing expression values [Troyanskaya et al., 2001]; in affective computing for emotion measurement, the cognitive-affective states of subjects are usually sparsely annotated by human experts [Grafsgaard et al., 2011], where the un-annotated timestamps can be regarded as missing values.
There exist multiple methods to learn stochastic time series models from training data containing missing values. One obvious way is to assemble shorter trajectories that are free from missing values as the new training data. However, this approach might suffer from substantial information loss. Another way is to design algorithms that acknowledge the missing values in the training data. In this line of efforts, likelihood-based methods such as EM algorithms for Hidden Markov Models have been investigated in previous research[Yeh et al., 2012, Yu and Kobayashi, 2003]
. Nonetheless, similar to other EM-based algorithms, these algorithms rely on local search heuristics, giving rise to locally optimal results and costly computation.
Thon [2017, Chapter 7] recently presented a spectral learning algorithm that admits missing values in the training data. In this novel approach, one first estimates an Input-Output Observable Operator Model from missingness-observation sequences using a spectral method, and then reduces it to an Observable Operator Model, which describes the underlying stochastic process. Here we first give a condensed yet self-contained introduction to the algorithm first proposed in [Thon, 2017, Chapter 7]; we then analyze the theoretical properties of the algorithm by (i) presenting and analyzing a modified frequency estimator that acknowledges missing values in training data, and (ii) determining the consistency of the proposed spectral algorithm under much more relaxed set of assumptions than in [Thon, 2017, Chapter 7]. We provide numerical experiments to demonstrate the capabilities of the proposed algorithm and our theoretical findings.
Let and be alphabets for actions and observations of a system. We use the symbol to denote an action, and the symbol to denote an observation. We use a symbol with a bar to denote a word (a sequence of symbols), e.g., , and we use the lower index if the starting and ending time are specified for the word, e.g., . Let be the set of words over , and let be the set of words over with length . The upper index with square brackets is reserved to count the trajectories of words, e.g., forms the -th trajectory. In the same fashion, a sequence of action-observation pairs is denoted by , and if the starting and ending time are specified, . Let be the set of words over the set , where denotes the operation of cartesian product.
For a matrix , we use to denote the matrix transpose, to denote the matrix inverse, to denote the Moore-Penrose pseudo-inverse, and for the entry in the row indexed by and column indexed by , for the column indexed by , for the row indexed by . Throughout this manuscript we use to denote the set of positive integers. We write for some . A matrix with all entries 0 will be denoted by .
We denote the probability of an event byand the probabilistic condition by . The probability limit of a sequence , if such a limit exists, is denoted by . The convergence in probability is denoted by .
In this section, we review the basic definitions for stochastic processes and dynamical systems by heavily reusing [Bauer, 1972, Schönhuth, 2006, 2008] sometimes verbatim. We then take a brief overview of Observable Operator Models and Input-Output OOMs under the framework of Sequential Systems, which have been systematically introduced in [Thon and Jaeger, 2015].
A discrete-time, finite-valued stochastic process is a quadruple , in which is a probability space and
is a family of random variables on this probability space taking values in a measurable space, where is a finite set and is the power set of . We write , where every factor is equal to . That is, is the collection of -valued right-infinite sequences. We let be the -algebra over generated by the cylinder sets of sequences in . Further, we let be the product random variable, which is a map from to . Let be the distribution of the random variable .
We define a left shift transformation by for all . For a finite-length sequence for , we write . For a set of sequences , we define . The transformation for is defined as the -times composition of , i.e., . Similarly, for all and , we define the -times right shift of a set of sequences by setting .
Given a stochastic process , the quadruple is called the induced canonical dynamical system, which exists and is uniquely defined [Schönhuth, 2006, Definition 4.2] [Bauer, 1972, Corollary 12.1.4]. In this manuscript, we only work with such canonical dynamical systems induced by stochastic processes, so when we talk about the properties of a dynamical system (e.g., stationarity), we also refer to these properties of the corresponding stochastic process.
A dynamical system is said to be stationary (relative to ), if for all ; a dynamical system is called asymptotically stationary (relative to ), if there is a measure such that for all , where the measure is called the asymptotically stationary measure; a dynamical system is called asymptotically mean stationary (AMS) (relative to ), if there is a measure such that for all , where the measure is called the AMS measure. In this manuscript, we will be considering only asymptotically stationary dynamical systems, and in this case the asymptotically stationary measure is the same as AMS measure.
Given a dynamical system , a function is said to be a measurement of the dynamical system if is - measurable, where is the Borel -algebra of . The dynamical system is said to be ergodic with respect to the measurement if the sample average converges as for almost all . An event is called invariant (relative to ), if . The set of invariant events is a sub--algebra of which we will denote by . A dynamical system is said to be ergodic (relative to ), if for any such invariant event .
(Corollary 7.2.1. of [Gray, 2009]) If a dynamical system is ergodic, AMS with stationary measure , and the sequence is uniformly integrable with respect to , where is a measurement of the dynamical system, then the following limit is true -a.e., -a.e., and in :
where denotes the expectation with respect to the AMS measure and is the space of all -integrable functions.
2.1 Sequential Systems, OOMs, and IO-OOMs
We now define the Sequential Systems, which are abstract linear algebraic models originally proposed to study Stochastic Finite Automata [Carlyle and Paz, 1971].
(Sequential System). A -dimensional linear Sequential System (SS) over the alphabet is a structure , where is a linear evaluation function , each is a linear operator, and is the initial state.
For a SS , its external function is defined by
We regard two SSs as equivalent if they describe the same external function .
(Equivalent SSs) Two SSs and are equivalent, denoted by , if they define the same external function, i.e., if .
Based on the above definition of equivalence, it is clear that two SSs are equivalent if they are subject to a similarity transformation.
([Thon and Jaeger, 2015, Lemma 10]) Let be a -dimensional SS, and be non-singular. Then , where .
A SS could be further specified as a Stochastic Multiplicity Automaton (SMA), an Observable Operator Model (OOM), or an Input-Output OOM (IO-OOM), depending on whether one is interested in modeling probabilistic languages, stochastic processes, or controlled processes. We proceed to define OOMs.
(OOM). An uncontrolled process over the alphabet is a function that satisfies (i) and (ii) for all . An Observable Operator Model (OOM) is a SS that models an uncontrolled process.
We see that OOMs are defined by letting external functions of SSs to be uncontrolled processes. Similarly, we could define Input-Output OOMs, or equivalently111 IO-OOMs and PSRs have different formalisms, but using the formulation in Definition 2.6, IO-OOMs are equivalent to PSRs [Thon and Jaeger, 2015]. We use IO-OOMs in this report instead of PSRs only for the consistency in notations. Note the original definition of IO-OOMs of [Jaeger, 1998] is by now deprecated, with which IO-OOMs were more restrictive than PSRs [Singh et al., 2004]., Predictive State Representations (PSRs), by setting the outer functions of SSs to be controlled processes.
(IO-OOM) A controlled process over the alphabet is a function that satisfies (i) and (ii) . An Input-Output OOM (IO-OOM) is a SS that models a controlled process.
In general, OOMs and IO-OOMs are models with predictive states, meaning that their states encode the necessary information for predicting the future. For this reason, conceptually OOMs and IO-OOMs are very different from models with latent states such as HMMs [Bengio, 1999] and POMDPs [Kaelbling et al., 1998]
, where states are defined by probability distributions over hidden variables. It has been shown that OOMs and IO-OOMs have greater representational capacity: OOMs extend HMMs[Jaeger, 2000] and IO-OOMs extend POMDPs [Littman et al., 2001].
3 Spectral Learning for OOMs from Data Containing Missing Values
The standard spectral learning algorithms for time series models require that the training sequences be fully recorded. This requirement, however, can be violated in real-world sequential data where missing values are not uncommon. In this section, based on [Thon, 2017], we present and analyze a spectral learning algorithm that acknowledges the missing values in the training data, and use such data to learn OOMs which describe the underlying stochastic processes.
3.1 The Types of Missingness in Time Series Data
In this subsection, we review the basic definitions for spectral learning with missing values as introduced in [Thon, 2017], sometimes using his wording. Consider a stochastic process that takes values in . We will call this stochastic process the underlying stochastic process. Throughout this manuscript we will only deal with underlying stochastic processes that are asymptotically stationary (and therefore AMS). Let be an initial sample from the underlying stochastic process . In practice, for some time steps we do not observe the value , and in this situation we say that is missing. We let be a sequence of missingness, with if the value is missing, else . Let denote the sequence of observations of length , where if , i.e., if the observation at time is not missing, and otherwise. We can pair up for all as a missingness-observation sequence, such that is the initial sample of a missingness-observation process . We use an example to illustrate these notations: For and , let be the underlying sequence, and suppose the first symbol a is missing, then , , .
For an underlying stochastic process , our goal is to learn a model for using as training data. To achieve such a goal, we will treat missing values as wildcards or “don’t care” placeholders for observations (the purpose of which will be clear later). To have a convenient notation for describing the effects of wildcards, for all , we invest an additional random variable , such that for all and for all . Additionally, for all , we introduce a missing value notation upon : by writing , we simply mean , reflecting the wildcard or “don’t care” placeholder nature of . This simply means that each is an identical copy of for all , with a special missing value notation equipped on the former but not on the latter.
Given a missingness-observation sequence , the joint probability of this missingness-observation sequence is governed by the missingness process and the observation process in the sense that
If the random variables are clear from context, we will simply drop them and write , , and . Note that if , the factorization at the right hand side of Equation 1 would not be defined. For this reason, we assume when using the factorization in Equation 1.
We now specify and in more detail. Although the random variables for all take values in , we are not interested in the pairs and for , as they do not make any sense for the underlying stochastic process. For this reason, we impose a restriction on the observation process by requiring that
for all .
We now specify a special case of missingness named AMSAR.
(AMSAR missingness). The values in the stochastic process are said to be always missing sequentially at random (AMSAR) if for all and for all we have:
where the index is the current time, is the observation sequence (containing the missing values) prior to the current time , and are the values of the underlying stochastic process from the current time to the future time .
Intuitively, AMSAR says that missingness at every time is conditionally independent of the current and future values of the underlying stochastic process, given the previously observed values. It also says that missingness at time is independent of which “true but unobserved” outputs have been emitted at times when there was missingness. We consider that this is a realistic assumption for the missing values in real-time sequential data in the sense that the missingness at a time can depend on the previous observations.
(Lemma 70 of [Thon, 2017]) Supposing the missingness is AMSAR, for any , we have
We assume as otherwise the statement is trivial.
where follows from the AMSAR assumption:
Let be an underlying stochastic process and let be the corresponding missingness-observation process which results from corrupting the underlying stochastic process with an AMSAR missingness. Let be a missingness-observation sequence such that , then
This proposition was stated in [Thon, 2017, Equation (18) - (19)].
where the equation (1) is by the definition of the observation process ; (2) reduces the redundant missingness information of (as has already contained the missingness information by Equation 2); (3) follows as the missing values are wildcards; the equation (4) can be established by only considering the cases and , as otherwise by Equation 1 and 2, violating the assumption of the proposition. First assume , then
Next assume . This means
as showed in Lemma 3.2. Hence (4) is true for both of the cases. Equation (5) is by the general product rule of conditional probabilities. ∎
Under the same assumption of Proposition 3.3, additionally assuming that the underlying stochastic process is asymptotically stationary with stationary probability measure , the following equation holds:
Repeating the argument in Proposition 3.3, it is clear that
for all . Hence
where the last equation is by the assumption that the underlying stochastic process is asymptotically stationary. ∎
Under the same assumption of Corollary 3.4, the following equation holds:
This directly follows from Corollary 3.4 by letting . ∎
3.2 Spectral Learning for OOMs from data containing missing values
Recall that, given a sequence (or sequences) of missingness-observation pairs for some , our goal is to learn a model for that describes the underlying stochastic process which does not contain missing values. Thon  observed that, if the underlying stochastic process can be described by an OOM , then the observation process can be described by an IO-OOM in terms of the OOM parameters:
Thus, in light of Proposition 3.3, to learn an OOM which approximates the OOM of the underlying stochastic process, we can first learn the IO-OOM that approximtes using the training data of missingness-observation sequences, and then reduce to by reusing the observable operators with missingness and discarding other observable operators. More concretely, Thon proposed the following Algorithm 1, which takes missingness-observation sequences as input and estimates an OOM which describes the underlying stochastic process as output.
In practice, we do not need to compute the IO-OOM operators for in step 5, because only observable operators for all are relevant to the algorithm output.
3.3 Theoretical Analysis of a Frequency Estimator
It turns out that a well-designed frequency estimators is crucial to achieve consistency of Algorithm 1. As an AMSAR missingness can be seen as a non-blind policy if we interpret a missingess-observation process as an input-output process, at a first glance, we can simply re-use an off-the-shelf frequency estimator for input-output processes with non-blind and unknown policies (e.g., [Bowling et al., 2006]). A closer look, however, reveals that directly using such estimators is not entirely optimal for at least three reasons. For one, by default, these estimators treat all system outputs as genuine symbols, which means that the missing value symbols will not be treated as wildcards; secondly, to derive consistency using these estimators, we need to impose assumptions (e.g., asymptotic stationarity) on the missingness-observation processes – we want to avoid this because such missingness-observation processes are artificial objects, which lack the transparency of underlying stochastic processes, the processes we eventually want to model; thirdly, these estimators take form as multi-step chained products of counting statistics (e.g., [Bowling et al., 2006, Equation 6]), giving rise to a higher computational cost than one customarily expects for frequency estimators for stochastic processes.
We now introduce and analyze a new frequency estimator, which addresses the above issues. For a given sequence of observations , we first define an indicator function , which takes a (infinite-length) sequence of observations , compares its initial -length sub-sequence with , and then produces an integer or , depending on whether the sequence is identical to the sequence up to the entries that are missing in . More precisely, we define
To count the number of appearances of a sequence as a subsequence of a finite sequence for some , we require the first symbols of to be and write
As a concrete example, consider the sequence and . In this case,