## 1 Introduction

We extend previous work on Hidden Quantum Markov Models (HQMMs), and propose a novel approach to learning these models from data.
HQMMs can be thought of as a new, expressive class of graphical models that have adopted the mathematical formalism for reasoning about uncertainty from quantum mechanics. We stress that while HQMMs could naturally be implemented on quantum computers, we do not need such a machine for these models to be of value. Instead,
HQMMs can be viewed as novel models inspired by quantum mechanics that can be run on classical computers.
In considering these models, we are interested in answering three questions: (1) how can we construct quantum circuits to simulate classical Hidden Markov Models (HMMs); (2) what happens if we take full advantage of this quantum circuit instead of enforcing the classical probabilistic constraints; and (3) how do we learn the parameters for quantum models from data?

The paper is structured as follows: first we describe related work and provide background on quantum information theory as it relates to our work. Next, we describe the hidden quantum Markov model and compare our approach to previous work in detail, and give a scheme for writing *any*

hidden Markov model as an HQMM. Finally, our main contribution is the introduction of a maximum-likelihood-based unsupervised learning algorithm that can estimate the parameters of an HQMM from data. Our implementation is slow to train HQMMs on large datasets, and will require further optimization. Instead, we evaluate our learning algorithm for HQMMs on several simple synthetic datasets by learning a quantum model from data and filtering and predicting with the learned model. We also compare our model and learning algorithm to maximum likelihood for learning hidden Markov models and show that the more expressive HQMM can match HMMs’ predictive capability with fewer hidden states on data generated by HQMMs.

## 2 Background

### 2.1 Related Work

Hidden Quantum Markov Models were introduced by Monras et al. (2010), who discussed their relationship to classical HMMs, and parameterized these HQMMs using a set of Kraus operators. Clark et al. (2015) further investigated HQMMs, and showed that they could be viewed as open quantum systems with instantaneous feedback. We arrive at the same Kraus operator representation by building a quantum circuit to simulate a classical HMM and then relaxing some constraints.

Our work can be viewed as extending previous work by Zhao and Jaeger (2010) on Norm-observable operator models (NOOM) and Jaeger (2000) on observable-operator models (OOM). We show that HQMMs can be viewed as complex-valued extensions of NOOMs, formulated in the language of quantum mechanics. We use this connection to adapt the learning algorithm for NOOMs in M. Zhao (2007) into the first known learning algorithm for HQMMs, and demonstrate that the theoretical advantages of HQMMs also hold in practice.

Schuld et al. (2015a) and Biamonte et al. (2016)

provide general overviews of quantum machine learning, and describe relevant work on HQMMs. They suggest that developing algorithms that can learn HQMMs from data is an important open problem. We provide just such a learning algorithm in Section

4.Other work at the intersection of machine learning and quantum mechanics includes Wiebe et al. (2016)

on quantum perceptron models and learning algorithms.

Schuld et al. (2015b) discuss simulating a perceptron on a quantum computer.### 2.2 Belief States and Quantum States

Classical discrete latent variable models represent uncertainty with a probability distribution using a vector

whose entries describe the probability of being in the corresponding system state. Each entry is real and non-negative, and the entries sum to 1. In general, we refer to the run-time system component that maintains a state estimate of the latent variable as an ‘observer’, and we refer to the observer’s state as a ‘belief state.’ A common example is the belief state that results from conditioning on observations in an HMM.In quantum mechanics, the quantum state of a particle can be written using Dirac notation as , a column-vector in some orthonormal basis (the row-vector is the complex-conjugate transpose ) with each entry being the ‘probability amplitude’ corresponding to that system state. The squared norm of the probability amplitude for a system state is the probability of observing that state, so the sum of squared norms of probability amplitudes over all the system states must be 1 to conserve probability.
For example, is a valid quantum state, with basis states 0 and 1 having equal probability . However, unlike classical belief states such as , where the probability of different states reflects an ignorance of the underlying system, a pure quantum state like the one described above is the *true* description of the system; the system is in both states simultaneously.

But how can we describe classical mixtures of quantum systems (‘mixed states’), where we maintain classical uncertainty about the underlying quantum states? Such information can be captured by a ‘density matrix.’ Given a mixture of quantum systems, each with probability , the density matrix for this ensemble is defined as follows:

(1) |

The density matrix is the general quantum equivalent of the classical belief state and has diagonal elements representing the probabilities of being in each system state. Consequently, the normalization condition is . The off-diagonal elements represent quantum coherences and entanglement, which have no classical interpretation. The density matrix can be used to describe the state of any quantum system.

The density matrix can also be extended to represent the joint state of multiple variables, or that of ‘multi-particle’ systems, to use the physical interpretation. If we have density matrices and for two qudits (a

-state quantum system, akin to qubits or ‘quantum bits’ which are 2-state quantum systems)

and, we can take the tensor product to arrive at the density matrix for the joint state of the particles, as

. As a valid density matrix, the diagonal elements of this joint density matrix represent probabilities; , and the probabilities correspond to the states in the Cartesian product of the basis states of the composite particles. In this paper, the joint density matrix will serve as the analogue to classical joint probability distribution, with the off-diagonal terms encoding extra ‘quantum’ information.Given the joint state of a multi-particle system, we can examine the state of just one or few of the particles using the ‘partial trace’ operation, where we trace over the diagonal elements of the particles we wish to disregard. This lets us recover a ‘reduced density matrix’ for a subsystem of interest. The partial trace for a two-particle system where we trace over the second particle to obtain the state of the first particle is:

(2) |

For our purposes, this operation will serve as the quantum analogue of classical marginalization. Finally, we discuss the quantum analogue of ‘conditioning’ on an observation. In quantum mechanics, the act of measuring a quantum system can change the underlying distribution, i.e., collapses it to the observed state in the measurement basis, and this is represented mathematically by applying von Neumann projection operators (denoted in this paper) to density matrices describing the system. One can think of the projection operator as a matrix of zeros with ones in the diagonal entries corresponding to observed system states. If we are only observing one part of a larger joint system, the system collapses to the states where that subsystem had the observed result. For example, suppose we have the following density matrix, for a two-state two-particle system with basis :

(3) |

Suppose we measure the state of particle , and find it to be in state . The corresponding projection operator is and the collapsed state is now: . When we trace over particle to get the state of particle , the result is , reflecting the fact that particle is now in state with certainty. Tracing over particle , we find , indicating that particle still has an equal probability of being in either state. Note that measuring has changed the underlying distribution of the system ; the probability of measuring the state of particle to be is now 0, whereas before measurement we had a chance of measuring

. This is unlike classical probability where measuring a variable doesn’t change the joint distribution. We will use this fact when we construct our quantum circuit to simulate HMMs.

Thus, if we have an -state quantum system that tracks a particle’s evolution, and an -state quantum system that tracks the likelihood of observing various outputs as they depend (probabilistically) on the -state system, upon observing an output , we apply the projection operator on the joint system, and trace over the second particle to obtain the -state system conditioned on observation .

Classical probability | Quantum Analogue | ||

Description | Representation | Representation | Description |

Belief State | Density Matrix | ||

Joint Distribution | Multi-particle Density Matrix | ||

Marginalization | Partial Trace | ||

Conditional probability | Projection + Partial Trace |

### 2.3 Hidden Markov Models

Classical Hidden Markov Models (HMMs) are graphical models used to model dynamic processes that exhibit Markovian state evolution. Figure 1 depicts a classical HMM, where the transition matrix and emission matrix

are column-stochastic matrices that determine the Markovian hidden state-evolution and observation probabilities respectively. Bayesian inference can be used to track the evolution of the hidden variable.

The belief state at time is a probability distribution over states, and prior to any observation is written as:

(4) |

The probabilities of observing each output at time is given by the vector :

(5) |

We can use Bayesian inference to write the belief state vector after conditioning on observation :

(6) |

where is a diagonal matrix with the entries of the th row of along the diagonal, and the denominator renormalizes the vector .

An alternate representation of the Hidden Markov Model uses ‘observable’ operators (Jaeger (2000)). Instead of using the matrices and , we can write . There is a different operator for each possible observable output and . We can then rewrite Equation 6 as:

(4) |

If we observe outputs , we apply and take the sum of the resulting vector to find the probability of observing the sequence, or renormalize to find the belief state after the final observation.

## 3 Hidden Quantum Markov Models

### 3.1 A Quantum Circuit to Simulate HMMs

Let us now contrast state evolution in quantum systems with state evolution in HMMs. The quantum analogue of observable operators is a set of non-trace-increasing Kraus operators {} that are completely positive (CP) linear maps. Trace-preserving Kraus operators , can map a density operator to another density operator. Trace-decreasing Kraus operators , represent operations on a smaller part of a quantum system that can allow probability to ‘leak’ to other states that aren’t being considered. This paper will formulate problems such that all sets of Kraus operators are trace-preserving. When there is only one operator in the set, i.e., such that , then

is a unitary matrix. Unitary operators generally model the evolution of the ‘whole’ system, which may be high-dimensional. But if we care only about tracking the evolution of a smaller sub-system, which may interact with its environment, we can use Kraus operators. The most general quantum operation that can be performed on a density matrix is

, where the denominator re-normalizes the density matrix.Now, how do we simulate classical HMMs on quantum circuits with qudits, where computation is done using unitary operations? There is no general way to convert column-stochastic transition and emission matrices to unitary matrices, so we prepare ‘ancilla’ particles and construct unitary matrices (see Algorithm 1) to act on the joint state. We then trace over one particle to obtain the state of the other.

Figure 1(a) illustrates a quantum circuit constructed with these unitary matrices. By preparing the ‘ancilla’ states and appropriately (i.e., entirely in system state 1, represented by a density matrix of zeros except ), we construct and from transition matrix and emission matrix , respectively. evolves to perform Markovian transition, while updates to contain the probabilities of measuring each observable output. At runtime, we measure which changes the joint distribution of to give the updated conditioned state . Mathematically, this is equivalent to applying a projection operator on the joint state and tracing over . Thus, the forward algorithm corresponding to Figure 1(a) that explicitly models a hidden Markov Model on a quantum circuit can be written as:

(7) |

We can simplify this circuit to use Kraus operators acting on the lower-dimensional state space of . Since we always prepare in the same state, the operation on the joint state of followed by the application of the projection operator can be more concisely written as a Kraus operator on just , so that we need only be concerned with representing how the particle evolves. We would need to construct a set of Kraus operators for each observable output , such that .

Tensoring with an ancilla qudit and tracing over a qudit can be achieved with an matrix and an matrix respectively, since we always prepare our ancilla qudits in the same state (details on constructing these matrices can be found in the Appendix), so that:

(8) |

We can then construct Kraus operators such that . Figure 1(b) shows this updated circuit, where is still the quantum implementation of the transition matrix and is the quantum implementation of the Bayesian update after observation. This scheme to model a classical HMM can be written as:

(9) |

We can similarly simplify to a set of Kraus operators. We write the unitary operation in terms of a set of Kraus operators as if we were to measure immediately after the operation . However, instead of applying one Kraus operator associated with measurement as we do with Figure 1(b), we sum over all of possible ‘observations’, as if to ‘ignore’ the observation on . Post-multiplying each Kraus operator in with each operator in , we have a set of Kraus operators that can be used to model a classical HMM as follows (the full procedure is described in Algorithm 2):

(10) |

We believe this procedure to be a useful illustration of performing classical operations on graphical models using quantum circuits. In practice, we needn’t construct the Kraus operators in this peculiar fashion to simulate HMMs; an equivalent but simpler approach is to construct observable operators from transition and emission matrices as described in section 2.3, and set the th column of , with all other entries being zero. This ensures .

### 3.2 Formulating HQMMs

Monras et al. (2010) formulate Hidden Quantum Markov Models by defining a set of Kraus operators , where each observable has associated Kraus operators acting on a state with hidden dimension , and they form a complete set such that . The update rule for a quantum operation is exactly the same as Equation 10, which we arrived at by first constructing a quantum circuit to simulate HMMs with known parameters and then constructing operators in a very peculiar way. The process outlined in the previous section is a particular parameterization of HQMMs to model HMMs. If we let the operators and be any unitary matrices, or the Kraus operators be any set of complex-valued matrices that satisfy , then we have a general and fully quantum HQMM.

Indeed, Equation 10 gives the forward algorithm for HQMMs. To find the probability of emitting an output given the previous state , we simply take the trace of the numerator in Equation 10, i.e., .

The number of parameters for a HQMM is determined by the number of latent states , outputs , and Kraus operators associated with an output . To exactly simulate HMM dynamics with an HQMM, we need as per the derivation above. However, this constraint need not hold for a general HQMM, which can have any number of Kraus operators we apply and sum for a given output. can also be thought of as the dimension of the ancilla that we tensor with in Figure 1(a) before the unitary operation . Consequently, if we set , we do not tensor with an additional particle, but model the evolution of the original particle as unitary. In all, a HQMM requires learning parameters, which is a factor times more than a HMM with the observable operator representation which has parameters. The canonical representation of HMMs with with an transition matrix and an emission matrix has parameters.

HQMMs can also be seen as a complex-valued extension of norm-observable operator models defined by Zhao and Jaeger (2010). Indeed, the HQMM we get by applying Algorithm 2 on a HMM is also a valid NOOM (allowing for multiple operators per output), implying that HMMs can be simulated by NOOMs. We can also state that both HMMs and NOOMs can be simulated by HQMMs (the latter is trivially true). While Zhao and Jaeger (2010) show that any NOOM can be written as an OOM, the exact relationship between HQMMs and OOMs is not straightforward owing to the complex entries in HQMMs and requires further investigation.

## 4 An Iterative Algorithm For Learning HQMMs

We present an iterative maximum-likelihood algorithm to *learn* Kraus operators to model sequential data using an HQMM. Our algorithm is general enough that it can be applied to *any* quantum version of a classical machine learning algorithm for which the loss is defined in terms of the Kraus operators to be learned.

We begin by writing the likelihood of observing some sequence . Recall that for a given output , we apply the Kraus operators associated with that observable in the ‘forward’ algorithm, as . If we do not renormalize the density matrix after applying these operators, the diagonal entries contain the joint probability of the corresponding system states and observing the associated sequence of outputs. The trace of this un-normalized density matrix gives the probability of observing since we have summed over (i.e., marginalized) all the ‘hidden’ states. Thus, the general log-likelihood of a sequence of length being predicted by a HQMM where each observable has associated Kraus operators is:

(11) |

It is not straightforward to directly maximize this log-likelihood using gradient descent; we must preserve the Kraus operator constraints and long sequences can quickly lead to underflow issues. Our approach is to learn a matrix , which is essentially the set of Kraus operators of dimension , stacked vertically.
The Kraus operators constraint requires , which implies , where the columns of are orthonormal.

Let be our guess and be the *true* matrix of stacked Kraus operators that maximizes the likelihood under the observed data. Then, there must exist some unitary operator that maps to , i.e., . Our goal is now to find the matrix . To do this, we use the fact that the matrix can written as the product of simpler matrices (see appendix for proof), where

(12) |

and specify the two rows in the matrix with the non-trivial entries, and the other paramters are angles that parameterize the non-trivial entries. The matrices can be thought of as Givens rotations generalized for complex-valued unitary matrices. Applying such a matrix on has the effect of combining rows and () of like so:

(13) |

Now the problem becomes one of identifying the sequence of matrices that can take to . Since the optimization is non-convex and the matrices need not commute, we are not guaranteed to find the global maximum. Instead, we look for a local-max that is reachable by only multiplying matrices that increase the log-likelihood. To find this sequence, we iteratively find the parameters that, if used in equation 13, would increase the log-likelihood. To perform this optimization, we use the fmincon function in MATLAB that uses interior-point optimization. It can also be computationally expensive to find the the best rows to swap at a given step, so in our implementation, we randomly pick the rows to swap. See Algorithm 3 for a summary. We believe more efficient implementations are possible, but we leave this to future work.

## 5 Experimental Results

In this section, we evaluate the performance of our learning algorithm on simple synthetic datasets, and compare it to the performance of Expectation Maximization for HMMs (

Rabiner (1989)). We judge the quality of the learnt model using its Description Accuracy (DA) (M. Zhao (2007)), defined as:(14) |

where is the length of the sequence, is the number of output symbols in the sequence, is the data, and is the model. Finally, the function is a non-linear function that takes the argument from to :

(15) |

If , the model perfectly predicted the stochastic sequence, while would mean that the model predicted the sequence better than random.

In each experiment, we generate 20 training sequences of length 3000, and 10 validation sequences of length 3000, with a ‘burn-in’ of 1000 to disregard the influence of the starting distribution. We use QETLAB (a MATLAB Toolbox developed by Johnston (2016)) to generate random HQMMs. We apply our learning algorithm once to learn HQMMs from data and report the DA. We use the Baum-Welch algorithm implemented in the hmmtrain function from MATLAB’s Statistics and Machine Learning Toolbox to learn HMM parameters. When training HMMs, we train 10 models and report the best DA.

We found that starting with a batch size of 1 with 5-6 iterations to get close to the local maximum, and then increasing the batch size to 3-4 and smaller was a good way to reach convergence. We also find that training models with becomes very slow; when , to compute the log-likelihood, we can simply take the product of all the Kraus operators corresponding to the observed sequence, and apply it on either side of the density matrix. However, with , we have to perform a sum over the Kraus operators corresponding to a given observation, before we can apply the next set of Kraus operators.

The first experiment compares learned models on data generated by a valid ‘probability clock’ NOOM/HQMM model (M. Zhao (2007)) that theoretically cannot be modeled by a finite-dimensional HMM. The second experiment considers data generated by the 2-state, 4-output HQMM proposed in Monras et al. (2010), which requires at least 3 hidden states to be modeled with an HMM. The third experiment is performed on data generated by physically motivated, fully quantum 2-state, 6-output HQMM requiring at least 4 classical states for HMMs to model, and can be seen as an extension of the Monras et al. (2010) model. Finally, we compare the performance of our algorithm with EM for HMMs on data that was generated by a hand-written HMM. These experiments are meant to showcase the greater expressiveness of HQMMs compared with HMMs. While we see mixed performance on HMM-generated data, we are able to empirically demonstrate that on the HQMM-generated datasets, our algorithm is able to learn an HQMM that can better predict the generated data than EM for classical HMMs with fewer hidden states.

### 5.1 Probability Clock

Zhao and Jaeger (2010) describes a 2-hidden state, 2-observable NOOM ‘probability clock,’ where the probability of generating an observable changes periodically with the length of the sequence of s preceding it, and cannot be modeled with a finite-dimensional HMM:

(16) |

This is a valid HQMM since . Observe that this HQMM has only 1 Kraus operator per observable, which means it models the state evolution as unitary.

Our results in Table 2 demonstrate that a probability clock generates data that is hard for HMMs to model and that our iterative algorithm yields a simple HQMM that matches the predictive power of the original model.

Model | P | Train DA | Test DA |

HQMM (T) | 8 | () | () |

HQMM (L) | 8 | () | () |

HMM (L) | 8 | () | ( |

HMM (L) | 24 | () | () |

HMM (L) | 80 | () | () |

### 5.2 Monras et al. (2010) 2-state HQMM

Monras et al. (2010) present a 4-state, 4-output HMM with a loose lower bound requirement of 3 classical latent states that can be modeled by the following 2-state, 4-output HQMM:

(17) | |||||

(18) |

This model also treats state evolution as unitary since there is only 1 Kraus operator per observable. We generate data using this model, and our results in Table 3 show that our algorithm is capable of learning an HQMM that can match the DA of the original model, while the HMM needs more states to match the DA.

Model | P | Train DA | Test DA |

HQMM (T) | 16 | () | () |

HQMM (L) | 16 | () | () |

HQMM (L) | 32 | () | () |

HMM (L) | 12 | () | ( |

HMM (L) | 21 | () | () |

HMM (L) | 32 | () | () |

### 5.3 A Fully Quantum HQMM

In the previous two experiments, the HQMMs we used to generate data were also valid NOOMs since they used only real-valued entries.
Here, we present the results of our algorithm on a fully quantum HQMM. Since we use complex-valued entries, there is no known way of writing our model as an equivalent-sized HMM, NOOM, or OOM.

We motivate this model with a physical system. Consider electron spin: quantized angular momentum that can either be ‘up’ or ‘down’ along whichever spatial axis the measurement is made, but not in between. There is no well-defined 3D vector describing electron spin along the 3 spatial dimensions, only ‘up’ or ‘down’ along a chosen axis of measurement (i.e., measurement basis). This is unlike classical angular momentum which can be represented by a vector with well-defined components in three spatial dimensions. Picking an arbitrary direction as the -axis, we can write the electron’s spin state in the basis so that is and is . But electron spin constitutes a two-state quantum system, so it can be in superpositions of the orthogonal ‘up’ and ‘down’ quantum states, which can be parameterized with and written as , where and . The Bloch sphere (sphere with radius 1) is a useful tool to visualize qubits since it can map any two-state system to a point on the surface of the sphere using as polar and azimuthal angles. We could also have chosen or , which can be written in our original basis:

(19) | |||||

(20) | |||||

(21) | |||||

(22) |

Now consider the following process, inspired by the Stern-Gerlach experiment (Gerlach and Stern (1922)) from quantum mechanics. We begin with an electron whose spin we represent in the basis. At each time step, we pick one of the , , or directions uniformly and at random, and apply an inhomogeneous magnetic field along that axis. This is an act of measurement that collapses the electron spin to either ‘up’ or ‘down’ along that axis, which will deflect the electron in that direction. Let us use the following encoding scheme for the results of the measurement: : , : , : , : , : , : . Consequently, at each time step, the observation tells us which axis we measured along, and whether the spin of the particle is now ‘up’ or ‘down’ along that axis. As an example, if we prepare an electron spin ‘up’ along the -axis, and observe the following sequence: , it means that we applied the inhomogeneous magnetic field in the -direction, then -direction, then -direction, and finally the -direction, causing the electron spin state to evolve as .

Note that transitions , , and are not allowed, since there are no spin-flip operations in our process. Admittedly, this is a slightly contrived example, since normally we think of a hidden state that evolves according to some rules, producing noisy observation. Here, we select the observation (down to the pair, , , ) that we wish to observe, and that tells us how the ‘hidden state’ evolves as described by a chosen basis.

This model is related to the 2-state HQMM requiring 3 classical states described in Monras et al. (2010). It is still a 2-state system, but we add two new Kraus operators with complex entries and renormalize:

(23) | |||||

(24) | |||||

(25) |

Physically, Kraus operators and keep the spin along the -axis, Kraus operators and rotate the spin to lie along the -axis, while Kraus operators and rotate the spin to lie along the -axis. Following the approach of Monras et al. (2010), we write down an equivalent 6-state HMM, and compute the rank of a Hankel matrix with the statistics of this process, yielding a requirement of 4 classical states as a weak lower bound.

We present the results of our learning algorithm applied to data generated by this model in Table 4. We find that our algorithm can learn a 2-state HQMM (same size as the model that generated the data) with predictive power matched only by a 6-state HMM.

Model | P | Train DA | Test DA |

HMM (T) | 24 | () | () |

HQMM (L) | 24 | () | () |

HMM (L) | 16 | () | () |

HMM (L) | 27 | () | () |

HMM (L) | 40 | () | () |

HMM (L) | 55 | () | () |

HMM (L) | 72 | () | () |

### 5.4 Synthetic Data from a hand-written HMM

We have shown that we can generate data using HQMMs that classical HMMs with the same number of hidden states struggle to model. In this section, we explore how well HQMMs can model data generated by a classical HMM. In general, randomly generated HMMs generate data that is hard to predict (i.e., DA closer to 0), so we hand-author an arbitrary, well-behaved HMM with full-rank transition matrix and full-rank emission matrix to compare HQMM learning with EM for HMMs:

(26) |

Our results are presented in Table 5. We find that small HQMMs outperform HMMs with the same number of hidden states, although the parameter count ends up being larger. However, as model size increases, training becomes quite slow, and our HQMMs are over-parameterized, becoming prone to local optima, and EM for HMMs may work better in practice on HMM-generated data. Interestingly, even though our scheme in Section 3.1 requires to simulate HMMs with HQMMs, empirically, we find that we are able to learn reasonable models with .

Model | P | Train DA | Test DA |

HMM (T) | 72 | () | () |

HQMM (L) | 24 | () | () |

HQMM (L) | 54 | () | () |

HQMM (L) | 96 | () | () |

HQMM (L) | 150 | () | () |

HQMM (L) | 300 | () | () |

HQMM (L) | 450 | () | () |

HQMM (L) | 750 | () | () |

HQMM (L) | 216 | () | () |

HQMM (L) | 432 | () | () |

HMM (L) | 16 | () | () |

HMM (L) | 27 | () | () |

HMM (L) | 40 |

Comments

There are no comments yet.