## 1 Introduction

Latent variable models are classical approaches to explain observed data through unobserved concepts. They have been successfully applied in a wide variety of fields, such as speech recognition, natural language processing, and computer vision

[15, 19, 14, 4]. Despite their successes, latent variable models are typically studied in the offline setting. However, in many practical problems, a learning agent needs to learn a latent variable model online while interacting with real-time data with unobserved concepts. For instance, a recommender system may want to learn to cluster its users online based on their real-time behavior. This paper aims to develop algorithms for such online learning problems.Previous works proposed several algorithms to learn latent variable models online by extending the expectation maximization (EM) algorithm. Those algorithms are known as online EM algorithms, and include the stepwise EM

[5, 12] and the incremental EM [13]. Similar to the offline EM, each iteration of an online EM algorithm includes an E-step to fill in the values of latent variables based on their estimated distribution, and an M-step to update the model parameters. The main difference is that each step of online EM algorithms only uses data received in the current iteration, rather than the whole dataset. This ensures that online EM algorithms are computationally efficient and can be used to learn latent variable models online. However, similarly to the offline EM, online EM algorithms have one major drawback: they may converge to a local optimum and hence suffer from a non-diminishing performance loss.

To overcome these limitations, we develop an online learning algorithm that performs almost as well as the globally optimal latent variable model, which we call . Specifically, we propose an online learning variant of the spectral method [2], which can learn the parameters of latent variable models offline with guarantees of convergence to a global optimum. Our online learning setting is defined as follows. We have a sequence of topic models, one at each time . The prior distribution of topics can change arbitrarily over time, while the conditional distribution of words is stationary. At time , the learning agent observes a document of words, which is sampled i.i.d. from the model at time . The goal of the agent is to predict a sequence of model parameters with low cumulative regret with respect to the best solution in hindsight, which is constructed based on the sampling distribution of the words over steps.

This paper makes several contributions. First, it is the first paper to formulate online learning with the spectral method as a regret minimization problem. Second, we propose , an online learning variant of the spectral method for our problem. To reduce computational and space complexities of , we introduce reservoir sampling. Third, we prove a sublinear upper bound on the -step regret of . Finally, we compare to the stepwise EM in extensive synthetic and real-world experiments. We observe that the stepwise EM is sensitive to the setting of its hyper-parameters. In all experiments, performs similarly to or better than the stepwise EM with tuned hyper-parameters.

## 2 Related Work

The spectral method by tensor decomposition has been widely applied in different latent variable models, such as mixtures of tree graphical models

[2], mixtures of linear regressions

[6], hidden Markov models (HMM)

[3], latent Dirichlet allocation (LDA) [1], Indian buffet process [17], and hierarchical Dirichlet process [18]. One major advantage of the spectral method is that it learns globally optimal solutions [2]. The spectral method first empirically estimates low-order moments of observations and then applies decomposition methods with a unique solution to recover the model parameters.

Traditional online learning methods for latent variable models usually extend traditional iterative methods for learning latent variable models in the offline setting. Offline EM calculates the sufficient statistics based on all the data, while in online EM the sufficient statistics are updated with the recent data in each iteration [5, 13, 12]. Online variational inference is used to learn LDA efficiently [10]. These online algorithms converge to local minima, while we aim to develop an algorithm with a theoretical guarantee of convergence to a global optimum.

An online spectral learning method has also been developed [11], with a focus on improving computational efficiency, by conducting optimization of multilinear operations in SGD and avoiding directly forming the tensors. Online stochastic gradient for tensor decomposition has been analyzed [8] with a different online setting: they do not look at the online problem as regret minimization and the analysis focuses on convergence to a local minimum. In contrast, we develop an online spectral method with a theoretical guarantee of convergence to a global optimum. Besides, our method is robust in the non-stochastic setting where the topics of documents are correlated over time. This non-stochastic setting has not been previously studied in the context of online spectral learning [11].

## 3 Spectral Method for Topic Model

This section introduces the spectral method in latent variable models. Specifically, we describe how the method works in the simple bag-of-words model [2].

In the bag-of-words model, the goal is to understand the latent topic of documents based on the observed words in each document. Without loss of generality, we describe the spectral method and (Section 5) in the setting where the number of words in each document is . The extension to more than words is straightforward [2]. Let the number of distinct topics be and the size of the vocabulary be . Then our model can be viewed as a mixture model, where the observed words , , and are conditionally i.i.d. given topic , which is also i.i.d.. Later in Section 4

, we study a more general setting where the topics are non-stationary, in the sense that the distributions of topics can change arbitrarily over time. Each word is one-hot encoded,

if and only if represents word , where is the standard coordinate basis in. The model is parameterized by the probability of each topic

, for , and the conditional probability of all words given topic . The th entry of is for . With observed words, it suffices to construct a third order tensor asTo recover the parameters of the topic model, we want to decompose as

(1) |

Unfortunately, such a decomposition is generally NP-hard [2]. Instead, we can decompose an orthogonal decomposable tensor. One way to make orthogonal decomposable is by whitening. We can define a whitening matrix as , where

is the diagonal matrix of positive eigenvalues of

, and is the matrix of eigenvectors associated with those eigenvalues. After whitening, instead of decomposing , we can decompose as by the*power iteration method*[2]. Finally, the model parameters are recovered by and , where is the pseudoinverse of . In practice, only a noisy realization of , , is typically available, and it is constructed from empirical counts. Such tensors can be decomposed approximately, and the error of such a decomposition is analyzed in Theorem 5.1 of Anandkumar

*et al.*[2].

## 4 Online Learning for Topic Models

We study the following online learning problem in a single topic model. We have a sequence of topic models, one at each time . The prior distribution of topics can change arbitrarily over time, while the conditional distribution of words is stationary. We denote by a tuple of one-hot encoded words in the document at time , which is sampled i.i.d. from the model at time . Non-stationary distributions of topics are common in practice. For instance, in the recommender system example in Section 1, user clusters tend to be correlated over time. The clusters can be viewed as topics.

We represent the distribution of words at time by a cube . In particular, the probability of observing the triplet of words at time is

(2) |

where is the prior distribution of topics at time . This prior distribution can change arbitrarily with .

The learning agent predicts the distribution of words at time and is evaluated by its per-step loss . The agent aims to minimize its cumulative loss, which measures the difference between the predicted distribution and the observations over time.

But what should the loss be? In this work, we define the *loss* at time as

(3) |

where is the *Frobenius norm*. For any tensor , we define its Frobenius norm as . This choice can be justified as follows. Let

(4) |

be the average of distributions from which are generated in steps. Then

(5) |

as shown in Lemma 1 in Section 6.4

. In other words, the loss function is chosen such that a natural

*best solution in hindsight*, in (5), is the minimizer of the cumulative loss.

With the definition of the loss function and the best solution in hindsight, the goal of the learning agent is to minimize the regret

(6) |

where is the loss of our estimated model at time and is the loss of the best solution in hindsight, respectively.

## 5 Algorithm

We propose , an online learning algorithm for minimizing the regret in (6). Its pseudocode is in Algorithm 1. At each time , the input is the observation . We also maintain a set of reservoir samples , where is the set of the time indices of these reservoir samples from the previous time steps.

The algorithm operates as follows. First, in line we construct the second-order moment from the reservoir samples, where is the set of all -permutations of . Then we estimate and by eigendecomposition, and construct the whitening matrix in line . After whitening, in line we build the third-order tensor from whitened words , where is the set of all -permutations of . Then in line with the power iteration method [2], we decompose and get its eigenvalues and eigenvectors . Finally, in line we recover the parameters of the model, the probability of topics and the conditional probability of words . After recovering the parameters, we update the set of reservoir samples from line to line . We keep reservoir samples , . When , the new observation is added to the pool. When , the new observation replaces a random observation in the pool with probability .

In , we introduce reservoir sampling for computational efficiency reasons. Without reservoir sampling, the operations in lines and of Algorithm 1 would depend on because all past observations are used to construct and . Besides, the whitening operation in line would depend on because all past observations are whitened by a matrix that changes with . With reservoir sampling, we approximate , , and with reservoir samples. We discuss how to set in detail in Section 6.2.

## 6 Analysis

In this section, we bound the regret of . In Section 6.1, we analyze the regret of without reservoir sampling in the noise-free setting. In this setting, the agent knows at time . The regret is due to not knowing at time . In Section 6.2, we analyze the regret of with reservoir sampling in the noise-free setting. In this setting, the agent knows at time , which is a subset of . In comparison to Section 6.1, the additional regret is due to reservoir sampling. In Section 6.3, we discuss the regret of with reservoir sampling in the noisy setting. In this setting, the agent approximates each distribution with its single empirical observation , for any . In comparison to Section 6.2, the additional regret is due to noisy observations.

### 6.1 without Reservoir Sampling in Noise-Free Setting

We first analyze an idealized variant of , where the agent knows at time . In this setting, the algorithm is similar to Algorithm 1, except that lines and are replaced by

We denote by the corresponding whitening matrix in line , and by and the estimated model parameters. Note that in the noise free setting, the power iteration method in line is exact, and therefore

at any time , according to (1). The main result of this section is stated below.

###### Theorem 6.1.

Let at all times . Then

###### Proof.

From Lemma 2 in Section 6.4, . Now note that at any time , and therefore

At any time and for any ,

where the first inequality is by Lemma 3 and the second inequality is from the fact that all entries of are in at any time .

This concludes our proof.

### 6.2 with Reservoir Sampling in Noise-Free Setting

We further analyze with reservoir sampling in the noise-free setting. As discussed in Section 5, without reservoir sampling, the construction time of the decomposed tensor at time would grow linearly with , which is undesirable. In this setting, the algorithm is similar to Algorithm 1, except that lines and are replaced by

where are indices of the reservoir samples at time . We denote by the corresponding whitening matrix in line , and by and the estimated model parameters. As in Section 6.1, the power iteration method in line is exact, and therefore

at any time . The main result of this section is stated below.

###### Theorem 6.2.

Let all corresponding entries of and be close with a high probability,

(7) |

for some small and . Let at all times . Then

###### Proof.

From the definition of and the bound in Theorem 6.1,

We bound the first term above as follows. Suppose that the event in (7) does not happen. Then , from Lemma 3 and the fact that all corresponding entries of and are close. Now suppose that the event in (7) happens. Then , from Lemma 3 and the fact all entries of and are in . Finally, note that the event in (7) happens with probability . Now we chain all inequalities and get that .

Note that the reservoir at time , , is a random sample of size for any . Therefore, from Hoeffding’s inequality [9] and the union bound, we get that

In addition, let the size of the reservoir be . Then the regret bound in Theorem 6.2 simplifies to

The above bound can be sublinear in only if . Moreover, the definition of and imply that . As a result of these constraints, the range of reasonable values for is .

For any , the regret is sublinear in , where is a tunable parameter. At lower values of , but the reservoir size approaches . At higher values of , the reservoir size is but approaches .

### 6.3 with Reservoir Sampling in Noisy Setting

Finally, we discuss the regret of with reservoir sampling in the noisy setting. In this setting, the analyzed algorithm is Algorithm 1. The predicted distribution at time is .

From the definition of and the discussion in Section 6.3, can be decomposed and bounded from above as

(8) |

when the size of the reservoir is .

Suppose that as , for instance by setting . Under this assumption, in approaches (Section 6.2) because is an empirical estimator of on observations. By Weyl’s and Davis-Kahan theorems [20, 7], the eigenvalues and eigenvectors of approach those of as , and thus the whitening matrix in approaches (Section 6.2). Since in is an empirical estimator of (Section 6.2) on whitened observations, where , as . By Theorem 5.1 of Anandkumar *et al.* [2], the eigenvalues and eigenvectors of approach those of as . This implies that , as all quantities in the definitions of and approach each other as . Therefore, and the regret bound in (8) is , sublinear in , as .

### 6.4 Technical Lemmas

###### Lemma 1.

###### Proof.

It is sufficient to show that

(10) |

for any , where and are the -th entries of tensors and , respectively. To prove the claim, let . Then

Now we put the derivative equal to zero and get . This concludes our proof.

###### Lemma 2.

For any , .

###### Proof.

###### Lemma 3.

For any tensors satisfying , and satisfying , we have

###### Proof.

The proof follows from elementary algebra

The first equality is from . The first inequality is from the reverse triangle inequality. The second inequality is from the triangle inequality. The third equality is from the fact that only one entry of is and all the rest are , by the definition of in Section 4. The third inequality is from , and , which follow the fact that tensors and represent distributions with all entries in and summing up to .

## 7 Experiments

In this section, we evaluate and compare it with stepwise EM [5]. We experiment with both stochastic and non-stochastic synthetic problems, as well as with two real-world problems.

Our chosen baseline is stepwise EM [5], an online EM algorithm. We choose this baseline as it outperforms other online EM algorithms, such as incremental EM [12]. Stepwise EM has two key tuning parameters: the step-size reduction power and the mini-batch size [12, 5]. The smaller the , the faster the old sufficient statistics are forgotten. The mini-batch size is the number of documents to calculate the sufficient statistics for each update of stepwise EM. In the following experiments, we compared to stepwise EM with varying and .

All compared algorithms are evaluated by their models at time , , which are learned from the first steps. We report two metrics: *average negative predictive log-likelihood up to step *,

Comments

There are no comments yet.