# State Aggregation Learning from Markov Transition Data

State aggregation is a model reduction method rooted in control theory and reinforcement learning. It reduces the complexity of engineering systems by mapping the system's states into a small number of meta-states. In this paper, we study the unsupervised estimation of unknown state aggregation structures based on Markov trajectories. We formulate the state aggregation of Markov processes into a nonnegative factorization model, where left and right factor matrices correspond to aggregation and disaggregation distributions respectively. By leveraging techniques developed in the context of topic modeling, we propose an efficient polynomial-time algorithm for computing the estimated state aggregation model. Under some "anchor state" assumption, we show that one can reliably recover the state aggregation structure from sample transitions with high probability. Sharp divergence error bounds are proved for the estimated aggregation and disaggregation distributions, and experiments with Manhattan traffic data are provided.

Comments

There are no comments yet.

## Authors

• 10 publications
• 14 publications
• 54 publications
10/14/2018

### Adaptive Low-Nonnegative-Rank Approximation for State Aggregation of Markov Chains

This paper develops a low-nonnegative-rank approximation method to ident...
02/08/2018

### State Compression of Markov Processes via Empirical Low-Rank Estimation

Model reduction is a central problem in analyzing complex systems and hi...
07/12/2021

### Polynomial Time Reinforcement Learning in Correlated FMDPs with Linear Value Functions

Many reinforcement learning (RL) environments in practice feature enormo...
04/12/2018

### Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

In this paper we discuss policy iteration methods for approximate soluti...
02/13/2019

### Sample-Optimal Parametric Q-Learning with Linear Transition Models

Consider a Markov decision process (MDP) that admits a set of state-acti...
06/16/2020

### Iterative trajectory reweighting for estimation of equilibrium and non-equilibrium observables

We present two algorithms by which a set of short, unbiased trajectories...
05/22/2017

### Online Factorization and Partition of Complex Networks From Random Walks

Finding the reduced-dimensional structure is critical to understanding c...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

State aggregation is a long-existing approach for model reduction of complicated systems. It is widely used as a heuristic to reduce the complexity of control systems and reinforcement learning. The earliest idea of state aggregation is to aggregate “similar” states into a small number of subsets through a partition map. However, the meta-states and the partition map are often handpicked by practitioners based on domain-specific knowledge

rogers1991aggregation ; bertsekas1995neuro . Alternatively, the partition can be chosen via discretization in accordance with some priorly known similarity metric or feature functions tsitsiklis1996feature

. More generally speaking, state aggregation also allows one to map states into meta-states through some probabilistic relation. Such models have been used for modeling large Markov decision processes, where each observed state is represented as a mixture over latent states

singh1995reinforcement ; bertsekas1995neuro . This model is known as soft state aggregation, where states and meta-states are mapped to each other via aggregation distributions and disaggregation distributions. Experiences and theoretical evidences suggest that the knowledge of a state aggregation model significantly reduces the complexity of solving the Markov decision problem singh1995reinforcement .

Unfortunately, existing approaches that use the idea of state aggregation for optimal control and reinforcement learning typically require prior knowledges about the exact system dynamics and/or the state aggregation structures. Neither is available in emerging reinforcement learning applications that involve complicated black-box systems, like autonomous driving, computer games and cell evolution. In these cases, little prior structure is known about the underlying process, while a substantial amount of transition data are available. There lacks a principled approach that infers the state aggregation structure of a black-box system from unsupervised data.

In this paper, we study the estimation of a state aggregation model from sample transition data. We hope to develop an unsupervised state aggregation approach that requires minimal prior knowledge. Suppose that we observe a Markov trajectory generated by a state-transition system. In the case of hard state aggregation, the goal would be to find a partition mapping such that

. More generally, we consider the soft aggregation structure such that every state can be represented using a membership model over a number of meta-states. To this end, we hope to develop an efficient statistical procedure to estimate and compute the state aggregation model based on trajectories of the Markov chain. The soft state aggregation structure of a Markov chain can be formulated into a nonnegative factorization of the transition probability operator. Experience suggests that spectral decomposition is effective for analysis and state compression of Markov trajectories

zhang2018spectral . However, plain spectral decomposition does not recover the latent states, aggregation or disaggregation distributions. It remains open how to identify the state aggregation model effectively from data.

Inspired by a technique used in topic modeling Topic-SCORE

, we develop an efficient method to solve this open problem. We introduce a notion of “anchor state”, which is an analog of anchor words, to uniquely identify meta-states of the Markov chain. Based on existence of anchor states, our method takes as input sample state transitions and performs a multi-step SVD approach, where the key is to properly normalize the data and rotate the singular vectors into the nonnegative cone by a simplex-hunting algorithm. The proposed method outputs the estimated state aggregation model, including the meta-states, aggregation/disaggregation distributions, and sets of anchor states that represent each meta-state. Numerical experiments are performed on simulated random walks and a Manhattan taxi-trip data set, in which we identified leading modes of the taxi traffic as well as anchor regions in the city. On the theoretical side, we prove high-probability error bounds for the total variation divergence between the estimated distributions and the ground truth. To the authors’ best knowledge, this is the first statistically-proven approach that uncovers unknown state aggregation structure from time series data.

## 2 Related Literatures

State aggregation is one of the earliest model reduction methods that were used to approximate known control systems with smaller ones. It also applies to reinforcement learning where the underlying transition probabilities are not known a priori. State aggregation reduces the complexity of the state space and therefore reduces computational costs for approximating the optimal value function or policy; see e.g., moore1991variable ; bertsekas1995neuro ; singh1995reinforcement ; tsitsiklis1996feature ; ren2002state . Similar to but more general than the state aggregation approach, a line of works, known as representation learning, construct basis functions for representing high-dimensional value functions, for which many methods have been developed; e.g. johns2007constructing ; mahadevan2005proto ; parr2007analyzing ; petrik2007analysis . mahadevan2009learning . The aforementioned methods typically require prior knowledge about structures of the problem, lacking statistical guarantees.

Two recent papers studied the statistical model reduction of Markov time series. The paper zhang2018spectral

proposed a class of spectral state compression methods of Markov process based on singular value thresholding of empirical transition matrices. It gives statistical error bounds for estimating the transition matrix and recovering the principal subspace associated with the transition matrix. Another related paper

li2018estimation studied the estimation of Markov transition matrix using maximum likelihood estimation with a low-rank constraint. Both zhang2018spectral and li2018estimation

focus on estimation of a low-rank Markov model. However, their results do not apply to recovery of state aggregation model which corresponds to a particular nonnegative factorization model. In particular, the spectral methods cannot recover the aggregation and disaggregation distributions.

Estimating the state aggregation structure is essentially a problem of nonnegative matrix factorization (NMF) lee1999learning under noise corruption. NMF is widely applied and many methods exist (see gillis2014and for a comprehensive overview). However, many classical methods for NMF only perform well with independent and homoscedastic noise. The empirical transition matrix generated by a Markov chain does not satisfy these conditions. Fortunately, we discover that state aggregation is closely related to topic modeling. They share similar nonnegative factorization structure and similar noise structures. In particular, empirical transition distribution conditioned at a state is analogous to the bag-of-words representation of a “document”, each outgoing state can be viewed as a “dictionary word”, and a meta-state is analogous to a “topic”. This inspires us to borrow ideas from topic modeling to develop tractable methods for estimating state aggregation.

In particular, we find that the state aggregation model is strongly related to topic models with an “anchor word” assumption Ge ; Topic-SCORE . The anchor word assumption has a natural analog in state aggregation models, which we call anchor state assumption. An anchor state is visited by only one meta-state with a nonzero probability. The “anchor state” assumption is meaningful in applications: In Manhattan taxi-trip data, we have successfully identified a number of “anchor states”, each representing a location in the city that reflects the leading traffic modes hidden in the taxi-trip data.

The proposed method is adapted from the topic modeling method in Topic-SCORE . It is able to identify the aggregation/disaggregation distributions by leveraging the technique in Topic-SCORE , where, in the context of a topic model with anchor words, they developed a principled way to find vertices corresponding to the singular vectors. Our method is customized to analyzing Markov transition data for the state aggregation problem. It uses a normalization technique, which is called Jin’s SCORE SCORE and proved very useful in network data and text data analysis mixed-SCORE ; Topic-SCORE . The normalization scheme is particularly crafted to address the dependence issue in Markov data. Our method can compute estimates of both aggregation and disaggregation distributions, while the topic modeling method in Topic-SCORE can recover only the right factors. To the best knowledge of the authors, this is the first work that develops a tractable method for estimating the meta-states, aggregation and disaggregation distributions with statistical guarantees.

## 3 The State Aggregation Model

In this section, we introduce the model of soft state aggregation and the notion of anchor states. We also discuss its relations with topic model and other latent-variable models for Markov chains.

###### Assumption 1 (Soft State Aggregation).

Let be a positive integer. The Markov chain

admits a soft state aggregation, i.e., there exist random variables

such that

 P(Xt+1∣Xt)=r∑k=1P(Zt=k∣Xt)P(Xt+1∣Zt=k),

for all with probability , where and are independent of and referred to as the aggregation distributions and disaggregation distributions, respectively.

The soft state aggregation model has been discussed in various literatures like singh1995reinforcement , zhang2018spectral . See bertsekas1995dynamic Section 6.3.7 for a textbook review. Admitting a soft state aggregation is equivalent to a nonnegative factorization of the transition matrix. A Markov chain satisfy the soft state aggregation with meta-states if and only if its transition matrix can be written in the form of

 P=UV⊤, (1)

where satisfy that and . This decomposition means that one can map the states into meta-states while preserving the system’s dynamics. It also suggests that the transition dynamics is the sum of leading modes, each mode corresponding to a rank-1 matrix. The smallest integer such that the nonnegative factorization (1) holds is called the nonnegative rank of . Throughout the paper, we assume that the nonnegative rank of , which equals to the smallest number such that Assumption 1 holds, is a known constant and does not scale with the number of states . Note that the nonnegative factorization (1) is a stronger model than spectral decomposition. As a result, the spectral decomposition method proposed in zhang2018spectral does not apply to estimation of the state aggregation model.

We hope to estimate the state aggregation structure suggested in Assumption 1 and identify the meta-states . However, the latent process is not necessarily identifiable. In fact, there exist infinitely many ways to specify meta-states and aggregation distributions to represent the same Markov process. In order to find a meaningful state aggregation structure, we require that there exist at least one “anchor state” that specifies each meta-state.

###### Assumption 2 (Anchor State Condition).

For every meta-state , there exists an anchor state and a nonnegative decomposition such that , , and for all , where denote columns of .

The anchor state condition is a natural analog of the anchor word condition for topic modeling Ge and the separability condition for nonnegative matrix factorization donoho2004does . Under additional technical conditions, the anchor state condition would ensure unique identifiability of the meta-states and the nonnegative matrix factorization donoho2004does . Notably, the state aggregation model with anchor state condition is related to several models, which we detail below:

1. Relation to Topic Model. The topic model is a statistical model for the generating process of text documents. Given a dictionary of distinct words, there exist unknown topics, each corresponding to a probability mass function over the dictionary, denoted by where . Each document admits a mixed membership over the topics, and words in document are drawn from the dictionary using the probability mass function . Let be the matrix, each row of which contains the expected word distribution in a document. Then, admits a nonnegative factorization:

 D=WA⊤,

where is called the topic matrix and is called the weight matrix. The state aggregation model has a natural analog to topic models: the empirical transitions conditioned on each state can be thought as a document, and and are analogous to the weight matrix and topic matrix, respectively. The state aggregation model differs from topic models in the sense that both documents and words correspond to the same set of states and the factor matrices may admit special structures particular to the Markov system.

2. Relation to Hard Aggregation. A Markov chain admits a hard aggregation if there exists a partition of the state space such that for all . This suggests that the transition matrix has a nonnegative decomposition where is a blockwise-constant matrix and each column of it is an indicator function of one of the subsets (Prop. 3 of zhang2018spectral ). Admitting a hard state aggregation model does not necessarily imply the existence of anchor state for every meta-state.

3. Relation to Lumpable Markov Models. A Markov process is called lumpable if the state space can be partitioned into blocks while still preserving the strong Markov property. However, a lumpable Markov chain is not necessarily low-rank (Prop. 4 of zhang2018spectral ), as there may be local dynamics within the blocks leading to a full-rank transition matrix. A particularly interesting case is when a Markov chain is both aggregatable and lumpable, which happens if and only if

 P=Z~PD−1ZT,

where columns of are indicator functions of the blocks,

is a stochastic matrix governing the block-to-block transitions, and

is a diagonal matrix with block sizes along its diagonal. Lumpability implies some form of symmetry between left and right singular spaces of . When an aggregatable Markov chain is also lumpable, the transition matrix can be decomposed into the form by letting and . In this case, we can verify that the anchor state condition holds, therefore it is a special case that falls under our assumptions.

## 4 State Aggregation Learning

We develop a spectral state aggregation method for analyzing Markov transition data, by adapting a technique in Topic-SCORE which was developed for topic modeling. Suppose that we are given sample pairs of state transitions , where follows the transition kernel . We formulate a matrix of empirical state-transition counts , where . We will use the count matrix as the input of our estimation procedure, which is detailedly described in Algorithm 1.

In what follows we explain the intuition of Algorithm 1 and assume for simplicity that are independently generated from some distribution, so that rows of become independent. Given the count matrix , the first step is to conduct a column-wise normalization of :

 N↦˜N=N[diag(N⊤1)]−1/2. (2)

This normalization helps reduce noise heteroscedasticity and dependence. Roughly speaking, each row of

has a multinomial distribution, as a result,

 E[(N−EN)⊤(N−EN)]≈diag(EN⊤1).

Hence, the normalization step (2) makes the “covariance” matrix of

approximately an identity matrix.

The state aggregation model implies

 ˜N≈[diag(N1)]P[diag(N⊤1)]−1/2=U∗(V∗)⊤,

where , . It inspires the following steps of estimating :

1. Estimate the column space of by the leading right singular vectors of .

2. Recover from its column space using the non-negative constraint.

3. Estimate from the estimate of .

The first and third steps are trivial, so what remains is how to recover from its column space.

Borrowing the idea of Topic-SCORE , we apply a normalization scheme, known as SCORE SCORE , on singular vectors of . Let be the right singular vectors of . The SCORE normalization divides each of by the leading singular vector in an entry-wise fashion and obtains a matrix

 ˆD=[diag(ˆh1)]−1[ˆh2,…,ˆhr]. (3)

Under mild conditions, is a positive vector so is well-defined. It turns out that rows of exhibits a simplex geometry:

• There exits a low-dimensional simplex with vertices such that all rows of are approximately contained in this simplex.

• When each meta-state has an anchor state, each vertex of the above simplex has a row of located around it. Therefore, we can use the point cloud of rows of to estimate these vertices.

• With knowledge of simplex vertices, one can directly transform the singular vectors into the nonnegative cone and get an estimate of .

Algorithm 1 implemented these ideas and customize them to work for the state aggregation learning problem. We refer the readers to the appendix for a detailed explanation of the proposed method.

In Step 5, the estimated is not guaranteed to be a nonnegative matrix. There is an alternative way to Step 5: For each , we project the j-th row of to the convex hull of columns of , where the projected vector is uniquely expressed as a convex combination of columns of with a combination coefficient vector denoted as . We use to estimate the -th row of . Numerical experiments suggest that both subroutines for estimating work equally well. When take known constant values, the overall method has a runtime polynomial in .

## 5 Main Statistical Results

In this section we establish the main results on statistical upper bounds for estimating the aggregation and disaggregation matrices .

First, we present a statistical error bound for recovering the disaggregation distributions (columns of ):

###### Theorem 1 (Statistical error bound for V).

Consider a -state Markov chain with transition matrix that satisfies Assumptions 1 and 2 with a fixed . Let sample transitions be generated independently from the stationary distribution . Fix positive constants -. Suppose and is an irreducible matrix with . Suppose the first right singular vector of satisfies that the ratio between its maximum and minimum entries is bounded by . Suppose satisfies . When and , with probability , the estimate given by Algorithm 1 satisfies

Next, we present a statistical error bound for recovering the disaggregation distributions (rows of ):

###### Theorem 2 (Statistical error bounds for U).

Under the conditions of Theorem 1, if additionally , then with probability , the estimate given by Algorithm 1 satisfies

 max1≤j≤p∥ˆuj−uj∥1≤C(1+p3√pn√n)√plog(n)n.

Theorems 1 and 2 provide a total-variance estimation error bound that applies uniformly to every individual aggregation distribution and every individual disaggregation distribution.

In what follows, we explain the technical conditions required by Theorems 1,2 and their theoretical implications.

Remark (About regularity conditions). Our theoretical bound requires a few regularity conditions, but they are rather mild:

• The requirement that is irreducible means it cannot be converted to a block-wise diagonal matrix via simultaneous row/column permutation. The case that has a block-wise diagonal structure rarely happens. Note that hard state aggregation only implies has a block structure but not . For lumpable Markov models, as long as the latent Markov process is irreducible, this condition is also satisfied.

• The condition of is mild because

is a low-rank matrix and its nonzero eigenvalues are typically at the constant order.

• The condition on the first right singular vector of is a form of incoherence and is commonly used in the literature of matrix completion and spectral analysis. We note that all coordinates of the first right singular are strictly positive, as a result of the irreducibility of .

• The condition will be satisfied if every state can be reached sufficiently often (at least probability) in one step conditioned on some meta-state.

Informally speaking, they require some form of irreducibility and reachability as well as that aggregation/disaggregation distributions be sufficiently independent so one can separate them.

Remark (Comparison with existing bounds). Existing works on estimating low-rank Markov models relied on spectral decomposition. They only gave error bounds for either estimating or estimating the principal subspace of zhang2018spectral ; li2018estimation . However, zhang2018spectral ; li2018estimation did not account for the nonnegative aspect of the factorization, thus they cannot recover any state aggregation structure. To our best knowledge, Theorems 1 and 2 give the first direct error bounds for estimating individual aggregation and disaggregation distributions.

Remark (Optimality of the error bounds). Consider the special case where , so the Markov chain becomes repeatedly and independently sampling from a fixed -state distribution, which equals to the single column of . In this case, the leading term of the error bound given by Theorem 1 is . It matches the minimax total variance divergence (up to a logarithmic factor) for estimating a -state discrete distribution han2015minimax . Using a similar argument, we can see that the leading error bound given by Theorem 2 is also nonimprovable for estimating aggregation distributions, because there are aggregation distributions (with a support size which is treated as a constant) and each distribution is sampled for roughly times. Therefore our error bounds are optimal in their dependence on and , as long as is sufficiently large.

Remark. The current results left open a few issues. First, we assumed for simplicity that the number of meta-states is a constant value, therefore our error bounds depend only on and . It remains to analyze how the recovery error scales as varies. Second, we have assumed for simplicity that are generated independently from the stationary distribution. It remains to analyze how would the error change if the sample transitions are drawn from a long state trajectory. In this case, one needs to analyze the mixing properties of the Markov chain. These issues are left for future research.

## 6 Numerical Experiments

Finally we experiment the proposed state aggregation procedure on both simulated and real data.

### 6.1 Simulation Results

First we test our new approach on simulated sample transitions. For a -state Markov chain with meta states, we first randomly create two matrices such that each meta state has the same number of anchor states. After assembling a transition matrix , we generate random walk data as input of Algorithm 1.

We carry out experiments with , and the number of anchor states equal to for each meta state. The results with varying sample size are plotted in Figure 2. We observe that, when there are more anchor states, Algorithm 1 performs slightly better at recovering columns of . In the log-log plot, the error in scales linearly with . By fitting a curve, we see that the error curve (when there are anchor states) approximately satisfies . This validates our error upper bound in Theorem 1.

We next compare Algorithm 1 with the spectral state compression method proposed in zhang2018spectral , in terms of recovering the transition matrix . Let be our estimator, and let be obtained using the competing method (computed by singular value thresholding). We measure the recovery error by using an averaged-per-row total variation divergence. Numerical results are plotted in Figure 2. We see that our proposed method consistently outperforms the competing estimator.

### 6.2 Experiments with NYC Taxi Data

We analyze a data set of NYC yellow cab trips that were collected in January 2016 NYCyellowcabJan2016 . We treat each taxi trip as a sample transition generated by a cite-wide Markov process over the New York city, where the transition is from a pick-up location to some drop-off location. We discretize the map into cells so that the Markov process becomes a finite-state one.

We apply the proposed state aggregation method to the taxi-trip data, and we test various sizes of the meta-states, i.e., . For each vertex produced in the algorithm, there exists a cluster of anchor states around it that represent a same meta-state. In order to connect these anchor states with a region in the city, we select points that are closest to the vertex, and locate the corresponding cells on the map. The convex hull of the cells forms an “anchor region”, which is plotted in Figure 5.

Let , be the estimated aggregation and disaggregation matrices. We use heat maps to visualize the columns of . Take for example. We pick two meta-states, with anchor states in the downtown and midtown areas, respectively. We plot in Figure 5 the corresponding columns of and . Recall that the columns of are the disaggregation distributions, and columns of can be thought of as some likelihood function for transitioning to corresponding meta-states. The heat maps can be viewed as some leading “modes” in the traffic-dynamics, corresponding to anchor rigions in the downtown and midtown areas, respectively.

Furthermore, we analyze the estimated aggregation and disaggregation distributions to find a partition of the Manhattan city. Recall that in Algorithm 1, each state is projected onto a simplex, which can be represented as a convex combination of simplex’s vertices. For each state, we assign the state to a cluster that corresponds to largest value in the weights of convex combination. In this way, we can cluster the 1922 locations into a small number of regions. The partition results are shown in Figure 5, where anchor regions are marked in each cluster.

## 7 Summary

This paper proposes a spectral-based method for uncovering the state aggregation structure from Markov transition data. We establish sharp statistical error bound for each aggregation/disaggregation distribution. While our current results apply only to finite-state Markov chains, we hope they provide an initial attempt towards developing more powerful tools for statistical system identification. We also hope this paper would point out an intriguing connection between state aggregation and topic modeling, which might inspire more work in this area and motivate the developments of novel statistical methods for modeling complex systems.

## References

• [1] NYC Taxi and Limousine Commission (TLC) trip record data. Accessed June 11, 2018.
• [2] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond SVD. In Foundations of Computer Science (FOCS), pages 1–10, 2012.
• [3] Dimitri P Bertsekas. Dynamic programming and optimal control. Athena scientific Belmont, MA, 2007.
• [4] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scientific, Belmont, MA, 1996.
• [5] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, pages 1141–1148, 2004.
• [6] Nicolas Gillis. The why and how of nonnegative matrix factorization.

Regularization, Optimization, Kernels, and Support Vector Machines

, 12(257), 2014.
• [7] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under loss. IEEE Transactions on Information Theory, 61(11):6343–6354, 2015.
• [8] Jiashun Jin. Fast community detection by SCORE. Annals of Statistics, 43(1):57–89, 2015.
• [9] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships by simplex vertex hunting. arXiv:1708.07852, 2017.
• [10] Jeff Johns and Sridhar Mahadevan. Constructing basis functions from directed graphs for value function approximation. In

Proceedings of the 24th international conference on Machine learning

, pages 385–392. ACM, 2007.
• [11] Zheng Tracy Ke and Minzhe Wang. A new SVD approach to optimal topic estimation. arXiv:1704.07016, 2017.
• [12] Daniel Lee and Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
• [13] Xudong Li, Mengdi Wang, and Anru Zhang. Estimation of Markov chain via rank-constrained likelihood. Proceedings of the 35th international conference on Machine learning, 2018.
• [14] Sridhar Mahadevan. Proto-value functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 553–560. ACM, 2005.
• [15] Sridhar Mahadevan. Learning representation and control in Markov decision processes: New frontiers. Foundations and Trends® in Machine Learning, 1(4):403–565, 2009.
• [16] Andrew W Moore. Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces. In Machine Learning Proceedings 1991, pages 333–337. Elsevier, 1991.
• [17] Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael Littman. Analyzing feature generation for value-function approximation. In Proceedings of the 24th international conference on Machine learning, pages 737–744. ACM, 2007.
• [18] Marek Petrik. An analysis of laplacian methods for value function approximation in MDPs. In IJCAI, pages 2574–2579, 2007.
• [19] Zhiyuan Ren and Bruce H Krogh. State aggregation in Markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3819–3824. IEEE, 2002.
• [20] David F Rogers, Robert D Plante, Richard T Wong, and James R Evans. Aggregation and disaggregation techniques and methodology in optimization. Operations Research, 39(4):553–582, 1991.
• [21] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pages 361–368, 1995.
• [22] John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59–94, 1996.
• [23] Anru Zhang and Mengdi Wang. Spectral state compression of Markov processes. arXiv:1802.02920, 2018.

## Appendix A Rationale of Algorithm 1

In this section, we explain the rationale of Algorithm 1, especially for:

• Why the SCORE normalization [8] produces a simplex geometry.

• How the simplex geometry is used for estimating .

Write for short and . The normalized data matrix

 ˜N≈diag(n)P[diag(m)]−1/2≡˜N∗. (4)

The matrix can be viewed as the “signal” part of . Let be the right singular vectors of . They can be viewed as the population counterpart of . We define a population counterpart of the matrix produced by SCORE:

 D=[diag(h1)]−1/2[h2,…,hr]. (5)

From now on, we pretend that the matrix is directly given and study the geometric structures associated with the singular vectors and the SCORE matrix .

### a.1 The simplex geometry and explanation of steps of Algorithm 1

When , the matrix , defined in (4), also admits a low-rank decomposition:

 ˜N∗=U∗(V∗)⊤,

where

 U∗=[diag(n)]U,V∗=[diag(m)]−1/2V.

By linear algebra, the span of the right singular vectors is the same as the column space of

. It implies there exists a linear transformation

such that

 H≡[h1,…,hr]=V∗L. (6)

Since is a nonnegative matrix, each row of is an affine combination of rows of . Furthermore, if is an anchor state, then the -th row of has exactly one nonzero entry, and so the -th row of is proportional to one row of . This gives rise to the following simplicial-cone geometry:

###### Proposition 1 (Simplicial cone geometry).

Suppose , each meta state has an anchor state, and . Let contain the right singular vectors of . There exists a simplicial cone in , which has extreme rays, such that all rows of are contained in this simplicial cone. Furthermore, for all anchor states of a meta state, the -th row of lies exactly on one extreme ray of this simplicial cone.

Remark. Similar simplicial cone geometry has been discovered in the literature of nonnegative matrix factorization [5]. The simplicial cone there is associated with rows of the matrix that admits a nonnegative factorization, but the simplicial cone here is associated with singular vectors of the matrix. Since SVD is a linear projection, it is not surprising that the simplicial cone structure is retained in singular vectors.

However, in the real case, we have to apply SVD to the noisy matrix, then the simplicial cone is corrupted by noise and hardly visible. We hope to find a proper normalization of , so that the normalized rows are all contained in a simplex, where all points on the extreme ray of the previous simplicial cone (these points do not overlap) fall onto one vertex of the current simplex (these points now overlap). Such a simplex geometry is much more robust to noise corruption and is easier to estimate.

How to normalize to obtain a simplex geometry is tricky. If all entries of are nonnegative, we can normalize each row of by the -norm of that row, and rows of the resulting matrix are contained in a simplex. However, consists of singular vectors and often have negative entries, so such a normalization won’t work.

By Perron’s theorem in linear algebra, the leading right singular vector have all positive coordinates. It turns out that normalizing each row of by the corresponding coordinate of is a proper normalizaiton that will produce a simplex geometry. This is the idea of SCORE [8, 9, 11].

###### Proposition 2 (Post-SCORE simplex geometry).

In the setting of Proposition 1, additionally, we assume is an irreducible matrix. Consider the matrix . Then, there exists a simplex , which has vertices , such that all rows of are contained in this simplex. Furthermore, for all anchor states of a same meta state, the -th row of falls exactly onto one vertex of this simplex.

Proposition 2 explains the rationale of the vertex hunting step. The vertex hunting step we used was borrowed from [9, 11]; see explanations therein.

Let be the vertices of the simplex . By vertex hunting, we obtain estimates of these vertices. The next question is: How can we recover from the simplex vertices ?

Let be the -th row of , for . By the nature of a simplex, each point in it can be uniquely expressed as a convex combination of the vertices. This means, for each , there exists a weight vector from the standard simplex such that

 dj=r∑k=1wj(k)bk.

The next proposition shows that we can recover from .

###### Proposition 3 (Relation of simplex and matrix V).

In the setting of Proposition 2, each row of is a convex combination of the vertices of , i.e., for each , there exists in the standard simplex such that . Furthermore, consider the matrix , whose -th row equals to . Then, and are connected by

 [diag(h1)][diag(m)]1/2W=V[diag(l1)],

where is the first column of as defined in (6).

By Proposition 3, each column of the matrix

 [diag(h1)][diag(m)]1/2W

is proportional to the corresponding column of . Since each column of has a unit -norm, if we normalize each column of the above matrix by its -norm, we can exactly recover .

Once we have obtained , we can immediately recover from by the relation:

 U=U(V⊤V)(V⊤V)−1=PV(V⊤V)−1.

The above gives the following theorem:

###### Proposition 4 (Exact recovery of U and V).

In the setting of Proposition 2, if we apply Algorithm 1 to the matrix , it exactly outputs and .

### a.2 Proof of propositions

Proposition 1 follows from (6) and definition of simplicial cone. Proposition 4 is proved in Section 1.1. We now prove Propositions 2-3. Recall that by (6), for ,

 hk=V∗lk,

where is the -th column of . When is an irreducible matrix, by Perron’s theorem, all the coordinates of are strictly positive. This guarantees that is well-defined. Additionally, for an anchor state of the -th meta state, , where . Therefore, for . We define a matrix

 B=[diag(l1)]−1[l2,…,lr].

By definition,

 [1,D]=[diag(h1)]−1H, (7)

and

 [1,B]=[diag(l1)]−1L. (8)

Combining them with (6) gives

 [1,D]=[diag(h1)]−1V∗[diag(l1)][1,B]. (9)

Let

 W=[diag(h1)]−1V∗[diag(l1)].

The (9) implies

 1=W1,D=WB.

Since is a nonnegative matrix, the first equation implies that each row of is from the standard simplex, and the second equation implies that each row of is a linear combination of the rows of , where the combination coefficients come from the corresponding row of . This has proved the simplex geometry stated in Proposition 2.

Note that the -th row of is located on one vertex of the simplex if and only if the -th row of is located on one vertex of the standard simplex. From the way we define , its -th row equal to

 w⊤j=1h1(j)√mj[Vj1l1(1),Vj2l1(2),…,Vjrl1(r)].

Since , and are all positive vectors, is located on one vertex of the standard simplex if and only if exactly one of is nonzero, where the latter is true if and only if is an anchor state. This has proved Proposition 2.

Furthermore, from the way is defined above, using the fact that , we immediately find that

 W=[diag(h1)]−1[diag(m)]−1/2V[diag(l1)],

which is equivalent to

 [diag(h1)][diag(m)]1/2W=V[diag(l1)].

This has proved Proposition 3.

## Appendix B Proof of Theorems 1-2

Theorem 1 is adapted from Theorem 2.2 of [11], by connecting our setting with a topic modeling setting. Let . Since the sample transitions are generated independently from the stationary distribution . We have

• . Since , . Combining it with large-deviation inequalities for Bernoulli variables (e.g., the Bernstein inequality), with probability , simultaneously for all , for some constant .

• Conditioning on , the rows of are independent of each other, where the -th row has a distribution of , where is the -th row of .

Therefore, conditioning on , each row of follows the same statistical model as the bag-of-words representation of a document of length of , generated from the topic matrix with a weight over the topics. Therefore, Theorem 1 is an application of Theorem 2.2 of [11] to the case where the dictionary size is , the number of documents is , and the minimum document length is .

We notice that Theorem 2.2 of [11] assumes all documents have equal length. Here, ’s are different; but since they are at the same order, the proof of Theorem 2.2 of [11] also applies. Moreover, our regularity conditions look different from those of Theorem 2.2 of [11], as we have adapted them to the state aggregation setting.

We now prove Theorem 2. Fix . Note that

 ˆu⊤j=ˆp⊤jˆV(ˆV⊤ˆV)−1,u⊤j=p⊤jV(V⊤V)−1.

As a result,

 ∥ˆuj−uj∥1 ≤∥(ˆpj−pj)⊤V(ˆV⊤ˆV)−1∥1 (10) +∥p⊤jV[(ˆV⊤ˆV)−1−(V⊤V)−1]∥1 (11) +∥ˆp⊤j(ˆV−V)(ˆV⊤ˆV)−1∥1 (12) ≡(I1)+(I2)+(I3). (13)

We shall bound the three terms separately. Consider . Let be the -th column of . We first give a bound for . Let be drawn from a multinomial distribution with the number of trials equal to and the probability vector equal to . Then,

 (ˆpj−pj)⊤Vk(d)=1njnj∑m=1V⊤kym.

Each is bounded by . Moreover, the variance of is bounded by , where we have used the fact that . Using the Bernstein inequality, we know that with probability ,

 |(ˆpj−pj)⊤Vk|≤C√log(n)p√nj≤C√log(n)√np.

Since this is true for every , we immediately find out that

 ∥(ˆpj−pj)⊤V∥1≤C√log(n)√np, (14)

where we note that by our assumption. Write . In Theorem 2.2 of [11], it provides a row-wise error bound on , that is, with probability , for all

 ∥ˆvj−vj∥1≤∥vj∥1⋅errn. (15)

As a result, we can get an upper bound on :

 ∥ˆV−V∥2F =p∑j=1∥ˆvj−vj∥2≤p∑j=1∥ˆvj−vj∥21 (16) ≤err2np∑j=1∥vj∥21 (17) ≤r⋅err2np∑j=1∥vj∥2 (18) ≤Cerr2n∥V∥2F≤Cp−1err2n. (19)

In particular, it implies . We notice that

 ∥ˆV⊤ˆV−V⊤V∥ ≤ 2∥(ˆV−V)⊤V∥+∥(ˆV−V)⊤(ˆV−V)∥ ≤ 3∥ˆV−V∥∥V∥ ≤ 3∥ˆV−V∥F∥V∥ ≤ Cerrn/p,

where in the last inequality we have used (16) and that . As a result,

 (20) = (21) ≤ Cp⋅errn. (22)

In particular, . Combining it with (14), we find that

 |(I1)| ≤∥(ˆpj−pj)⊤V∥1∥(ˆV⊤ˆV)−1∥1 (23) ≤√r∥(ˆpj−pj)⊤V∥1∥(ˆV⊤ˆV)−1∥ (24) ≤C√plog(n)n. (25)

Consider . It is seen that

 |(I2)| ≤∥p⊤jV∥1∥(ˆV⊤ˆV)−1−(V⊤V)−1∥1 ≤√r∥p⊤jV∥1∥(ˆV⊤ˆV)−1−(V⊤V)−1∥.

Note that we have . Combining it with the bound in (20), we immediately find that

 |(I2)|≤Cerrn. (26)

Consider . It is seen that

 |(I3)| ≤∥ˆp⊤j(ˆV−V)∥1∥(ˆV⊤ˆV)−1∥1 ≤√r∥ˆp⊤j(ˆV−V)∥1∥(ˆV⊤ˆV)−1∥ ≤Cp⋅∥ˆp⊤j(ˆV−V)∥1 ≤Cp⋅(∥p⊤j(ˆV−V)∥1+∥(ˆpj−pj)⊤V∥1 +∥(ˆpj−pj)⊤(ˆV−V)∥1)

The third two term is dominated by the first two terms. We already have a bound for . Additionally, by (15),

 ∥p⊤j (ˆV−V)∥1≤p∑i=1pj(i)∥ˆvi−vi∥1 (27) ≤errnp∑i=1pj(i)∥vi∥1 (28) ≤Cerrnmaxi∥vi∥1 (29) ≤Cp−1errn. (30)

Combining (14) and (27) gives

 |(I3)|≤C(errn+√plog(n)n)≤Cerrn. (31)

The claim follows from (23), (26) and (31).