1 Introduction
The study of Markov models is important to control theory and machine learning. In optimal control, it is relevant for the theory of sequential decision making under uncertainty. Seminal results in this area were proved by Blackwell [3], Sondik [12], and others (see [2]
). In machine learning, it is central for studying the main problems in reinforcement learning, such as planning and optimization (see
[10, 14]). The primary model used in both areas is the Markov decision process (partially observable or otherwise).For quantum computing, Markov models are useful for developing algorithms and for modeling the quantum phenomena observed in quantum systems. In the design of algorithms, several fundamental algorithms, such as Grover search, are based on quantization of Markov chain on graphs (see Szegedy
[13]). Further generalizations to quantum Markov processes were studied in quantum information theory in the context of quantum channels (see [15]).In a recent work, Barry et al. [1] introduced a model of quantum observable Markov decision process. This is a natural quantum analogue of a partially observable Markov decision process, although it is a strict generalization of a belief Markov decision process. The latter is a commonly used fullinformation model to represent a partially observable Markov decision process. The main component used to define a quantum observable Markov decision process is a set of superoperators. Using standard machinery from quantum information (see [17, 18]), we recast the set of superoperators as composition of a conditional channel and a quantum instrument. We describe this in the context of quantum transducers.
A finite transducer is a finite automata with output (see [6]). There are two standard models for transducers: the Moore machine and the Mealy machine. In the former model, the output is generated from the current state, while in the second model, the output is generated based on the action taken from the current state. Following in this classical vein, we describe models for quantum Moore and quantum Mealy machines, and we prove that they are equivalent. The proof of this equivalence relies on the decoupling of a superoperator into the composition of a conditional channel and a quantum instrument. In earlier works, Wiesner et al. [16] and Monras et al. [8] had also studied quantum transducers but in the context of stochastic generators.
In [1], Barry et al.
studied the goalstate perfect reachability problem (also called planning) for quantum observable Markov decision process. Here, the objective is to find a sequence of actions which will evolve the system from the initial state to a given target state, with probability one. Their main result is that this perfect planning problem is undecidable in the quantum setting, whereas the classical version is known to be decidable.
Madani et al. [7] proved that most classical problems related to partially observable Markov decision process, such as planning and optimization, are undecidable. Since the quantum problems are generalizations of the classical ones, this immediately implies that the quantum problems are also undecidable. The list of these problems do not include the above perfect planning problem. So, Barry et al. [1] shows an interesting classicaltoquantum transition from decidable to undecidable. A similar phenomenon occurs for the perfect nonoccurrence problem where we ask if there is an output sequence that will never be observed. This problem is decidable in the classical setting, but its quantum version (called the quantum measurement occurrence problem) was proved undecidable by Eisert et al. [5].
The policy existence problem in an infinite horizon that are proved undecidable by Madani et al. [7] include the problems under the total reward, average reward, and discounted reward criteria. For approximating the optimal policy, the problems under the total and average reward criteria remain undecidable, but the problem under the discounted criteria is decidable. We focus on this latter problem and observe that the quantum analogue is still decidable. This is obtained by revisiting the work of Blackwell [3] on Markov decision process over Borel spaces. We also adapt some observations of Sondik [12] on the compact representation of the optimal policy to the quantum setting.
Problem  Classical  Quantum 

Reachability  no (Madani et al. [7])  no 
Perfect Reachability  yes (folklore)  no (Barry et al. [1]) 
NonOccurrence  no (Proposition 5.4)  no 
Perfect NonOccurrence  yes (Eisert et al. [5])  no (Eisert et al. [5]) 
PolicyExistence  no (Madani et al. [7])  no 
Approximate PolicyExistence  yes (Blackwell, Sondik [3, 12])  yes (Theorem 6.8) 
We summarize the complexities of the problems related to Markov decision process in Figure 1.
2 Preliminaries
2.1 Markov models
Suppose , and are finite sets (of states, input actions, and output signals, respectively). Let be a collection of Markov chains with a common state space , where each is a
columnstochastic matrix. We view each column
ofas defining a conditional probability distribution
over , where . Let be a columnstochastic matrix. We view each column of as defining a conditional probability distribution over , where and . Let be an initial probability distribution over . Suppose is a bounded realvalued reward function over and is a discount factor. The tuple is called a partially observable Markov decision process (POMDP).There are several wellknown Markov models that can be derived from the above. We obtain a Markov chain if is a unary alphabet and is trivial (say, the constant zero function), and we let and for the observed chain, or let be unary (and trivial) for the unobserved chain. We derive a Markov decision process (MDP) if and , and denote this simply as . We get a hidden Markov model (HMM) if is a unary alphabet and is trivial, and denote this as .
The random process induced by a partially observable Markov decision process is defined as follows. For a sequence , where
, we consider the sequence of pairs of random variables
, and , defined so that:(1) 
(2) 
(3) 
A policy is a sequence of maps, where and is the history at time , for each . Thus, if is the history at time , then is the action taken, for all . The value function of is given by
(4) 
The expectation is taken over an infinite product probability space^{3}^{3}3A brief discussion on the existence of this infinite probability space is given in Appendix A. The main objective is to compute the optimal value function , and to find for which the maximum is achieved. A policy is called stationary if there is a map so that for all .
2.2 Quantum information
We briefly review some basic notation and background from quantum information (see Nielsen and Chuang [9], Watrous [17]). For a finite set , let
denote the complex Euclidean space spanned by the unit vectors
. We will use the standard Dirac notation throughout. It will often be convenient to view as a Hilbert space.Given complex Euclidean spaces and , let be the set of linear operators from to . For an operator , let denote the unique operator in for which for all and . For brevity, we use when . We say an operator is positive semidefinite if for some . The set of all positive semidefinite operators in is denoted . A positive semidefinite operator is called a density matrix if . The set of all density matrices over is denoted . These density matrices will provide a convenient representation for quantum states.
Let be the set of linear operators from to . An operator is called positive if it maps a positive semidefinite to a positive semidefinite . An operator is called completely positive if is positive for every auxiliary space . The set of all linear operators in which are completely positive is denoted . An operator is called trace preserving if for all . The set of linear operators in that are completely positive and trace preserving is denoted . Each element of is called a quantum channel. These quantum channels will provide a representation of quantum operations.
For a finite alphabet , a conditional channel over is a collection of channels . The input to a conditional channel is a classicalquantum state , where and . First, the conditional channel prepares the following quantum state:
(5) 
It then applies the projection and returns the second register.
For a finite alphabet , a quantum instrument over is a collection of completely positive operators for which forms a channel. On input , the instrument first prepares the following quantum state:
(6) 
Note that is a quantum channel. Then, the quantum instrument applies the measurement where , for each . Here, an element is selected at random with probability and, conditioned on the measurement outcome was observed, the postmeasurement state is (otherwise, if the outcome was not observed, the resulting state is ).
3 Quantum Transducers
A transducer is a finite automata with output (see Hopcroft and Ullman [6]). We consider two standard models of transducers, namely Moore and Mealy machines, and propose their natural quantum analogues. Then, we show that the two quantum models are equivalent.
In this and subsequent sections, we assume is a finite set of states, is a finite set of input symbols, and is a finite set of output symbols. We also assume that is a complex Euclidean space associated with .
A quantum Moore machine is a tuple where is a conditional channel, is a quantum instrument, is the initial density matrix, and is an orthogonal projection (onto an accepting subspace). A single step of is a composition of the transition map followed by the output map . If the current state of is given by the density matrix and the current input is , then transforms to the intermediate state . Further, applies the instrument to and generates an output with probability . The final resulting state is .
For input and output , the probability that on input will output is denoted , while the final resulting state of is denoted . The function computed by on input and output is given by
(7) 
This is the probability that the final state belongs to the accepting subspace. The function computed by on input is given by
(8) 
where
(9) 
3.1 Fact ().
Let be a quantum Moore machine. For input and output , we have
(10) 
Proof.
We prove this by induction on the length . For , the state after the conditional channel is . By the properties of the quantum instrument, the probability that outputs is given by
(11) 
and the postmeasured state of
(12) 
This proves the base case.
Assume the claim holds for . Suppose the input is and the output is . Let and . Then, the probability that outputs on input given that it has read input and emitted output is
(13) 
where the last step follows from the inductive hypothesis. Therefore,
(14) 
The resulting quantum state is
(15) 
This shows the claim holds for . ∎
A quantum Mealy machine is a tuple where is a set of quantum instruments over a common alphabet , is the initial density matrix, and is an orthogonal projection (onto an accepting subspace). Each quantum instrument is a composition of a channel and a complete measurement over . For notational convenience, we will identify this channel as well (it becomes an instrument after the first register is measured; see Watrous [17]). More specifically, we associate with a collection of completely positive operators satisfying . Then, we define the channel
(16) 
Upon measuring the first register, the output is with probability and the postmeasurement state is . Thus, if the current state of is and the current input is , applies to and generates output according to the above process.
For input and output , the probability that on input will output is denoted , while the final resulting state of is denoted . The function computed by on input with output is given by
(17) 
This is the probability that the final state belongs to the accepting subspace. The function computed by on input is given by
(18) 
where
(19) 
3.2 Fact ().
Let be a quantum Mealy machine, where so that for each , the collection forms an instrument. Then, for any input and output , we have
(20) 
3.3 Definition ().
Let be a quantum Moore machine and be a quantum Mealy machine sharing the same input alphabet and output alphabet . We say and are equivalent if for all inputs and outputs , we have and .
In what follows, we show that the two models of quantum transducers are equivalent. The idea behind this is the decoupling of the set of quantum instruments in a quantum Mealy machine into a composition of a conditional channel and a shared quantum instrument in a quantum Moore machine (see Figure 2).
3.4 Proposition ().
For any quantum Moore machine, there is an equivalent quantum Mealy machine.
Proof.
Let be a quantum Moore machine. Consider a quantum Mealy machine where the collection of instruments are defined as
(21) 
Since both and are completely positive, so is . Note that for each , we have
(22) 
is a channel since channels are closed under composition. Therefore, the following map is also a channel:
(23) 
This is the same state that is prepared by before the measurement.
3.5 Proposition ().
For any quantum Mealy machine, there is an equivalent quantum Moore machine.
Proof.
Let be a quantum Mealy machine where so that , for each .
Consider a quantum Moore machine defined as follows. We let be a channel given by
(24) 
and let be a linear operator given by
(25) 
Note that is completely positive and is the identity channel. Hence, forms a quantum instrument.
We see that
(26) 
and
(27) 
This is the same state that is prepared by before the measurement.
3.6 Theorem ().
Quantum Mealy and Moore machines are equivalent.
Remark: The classical transducers are special cases of the quantum transducers. For example, a Moore machine can be defined as a tuple where is a set of Markov chains, is a set of probability distributions over , is the initial state and is a collection of accepting states. A Mealy machine is defined similarly except for one difference: the set is given by , that is, each probability distribution depends on both the current state and the action taken. These differ slightly from the models defined in [6] in that we retain the notion of accepting states (which is standard for finite automata).
Remark: Note that any Markov chain, hidden Markov model, or partially observable Markov decision process (and hence Markov decision process) can be simulated by a quantum Moore machine. Moreover, the decoupled structure of a quantum Moore machine allows us to either focus on the input side (conditional channel) or the output side (quantum instrument). Thus, it is easy to see that a quantum Moore machine can simulate any quantum Markov chain, any quantum automata [4], and any hidden quantum Markov model [16, 8].
4 Quantum observable Markov decision processes
We consider the quantum observable Markov decision process defined by Barry et al. [1]. Let be a finite set of states, be a finite set of input symbols, and be a finite set of output symbols. Let be the complex Euclidean space associated with the set of states.
4.1 Definition ().
We note that a superoperator is a quantum instrument.
4.2 Definition ().
(Barry et al. [1])
A quantum observable Markov decision process (QOMDP) is a tuple where
is a set of superoperators,
is a set of operators,
is the initial state,
and is a discount factor.
Moreover, we assume each superoperator is defined by the set of Kraus operators
satisfying .
The reward associated taking action from state is given by
.
The set of superoperators in Definition 4.2 functionally acts as a “conditional quantum instrument.” Using Theorem 3.6, we formalize this using conditional channel and quantum instrument (as defined in Watrous [17] and in Wilde [18]).
Remark: The reward operator must be Hermitian since we require the rewards be real for the purpose of optimization.
4.3 Fact ().
The reward is a bounded realvalued function.
Proof.
Suppose where is a set of states. So, is a Hermitian matrix, for all , and , where
is the maximum eigenvalue (in absolute value) of
. ∎The next proposition shows we may assume without loss of generality that the reward function depends on the current state alone. We adopt this simplifying assumption from here on.
4.4 Proposition ().
Let be a quantum observable Markov decision process where the reward is defined as for all . There is an equivalent quantum observable Markov decision process where the reward is defined as for all .
Proof.
Suppose is the underlying finite set of states of . Let . Since , we have . Let , for some arbitrary .
Each quantum instrument of is assumed to act on product states and is defined as
(28) 
So, applies the quantum instrument of to its first component and applies the replacement channel to its second component.
We define . Then, . So, if is the state trajectory of with a sequence of actions , then the corresponding trajectory of is . Thus, . ∎
Suppose is a quantum observable Markov decision process. Given a sequence of actions and a sequence of observations , we have
(29) 
where is the state of on input and output and is the probability that outputs on input .
4.5 Definition ().
(Barry et al. [1])
A goal quantum observable Markov decision process is a tuple
where , , , , and are as defined for quantum observable Markov decision process,
and is the goal state. The goal state is absorbing, that is, for all and ,
if ,
then .
4.6 Fact ().
For every goal quantum observable Markov decision process, there are equivalent quantum Mealy and quantum Moore machines.
Proof.
(Sketch) We let the Mealy projector be a projection onto the onedimensional subspace spanned by the unique goal state of the goal quantum observable Markov decision process. The equivalent quantum Moore machine is obtained from the quantum Mealy machine using Theorem 3.6. ∎
Similar to Definition 4.5, we may also define a goal partially observable Markov decision process as where is an absorbing state for each Markov chain , . Likewise, we can show that is equivalent to a Moore machine. Note that the structure of a goal partially observable Markov decision process (used in, for example, Madani et al. [7]) is closer to the structure of a Moore machine than to a Mealy machine (see Hopcroft and Ullman [6], page 4243). This is due to the decoupled nature of the set of Markov chains and the stochastic output function in a goal partially observable Markov decision process.
5 Complexity
We explore some computational problems related to partially observable and quantum observable Markov decision processes. These include reachability (or planning) and occurrence problems.
Reachability
Instance: A Moore machine , where is an absorbing goal state, and a threshold .
Question: Is there so that ?
The above problem is called the planexistence or probabilistic planning problem in [7].
5.1 Theorem ().
(Madani, Hanks, and Condon [7], Theorem 3.7)
Reachability is undecidable.
In contrast, the reachability problem with is decidable (a proof was given in [1]).
Quantum Reachability
Instance: A quantum Moore machine and a threshold .
Question: Is there so that ?
5.2 Theorem ().
Quantum Reachability is undecidable.
Proof.
This follows from Theorem 5.1 since a quantum Moore machine can simulate a Moore machine. ∎
5.3 Theorem ().
(Barry, Barry, and Aaronson [1])
Quantum Reachability with is undecidable.
Blondel et al. [4] proved that the quantum reachability problem where the conditional channel consists of unitary channels with a trivial quantum instrument is decidable if and only if the threshold condition is a strict inequality.
In the occurrence problem, the goal is to determine if, in a partially observable Markov decision process, the probability of observing a sequence of output is less or equal to a given threshold. Informally, this asks if the output sequence is a rare anomaly.
NonOccurrence
Instance: A Moore machine and .
Question: Are there are and so that ?
Although the following result is possibly folklore, for completeness, we provide a simple proof.
5.4 Proposition ().
NonOccurrence is undecidable.
Proof.
We reduce goalstate reachability to nonoccurrence. Let be a Moore machine and is a threshold. Recall that is the probability of the transition to under action and is the probability of output from state .
Let be a Moore machine where , where , and , where . For each , we let
(30) 
Let if and , if and , and otherwise. We may set and arbitrarily.
We observe that reaches its goal state with probability at least if and only if outputs a string without with probability at most . ∎
Eisert et al. [5] showed that the classical NonOccurrence problem with is decidable.
Quantum NonOccurrence
Instance: A quantum Moore machine , and .
Question: Are there are and so that ?
5.5 Corollary ().
Quantum NonOccurrence is undecidable.
Proof.
This follows from the undecidability of the classical problem (Proposition 5.4). ∎
For the special case when , it is known the classical problem is decidable. But, in the quantum setting, we have the following result.
5.6 Theorem ().
(Eisert, Müller, and Gogolin [5])
Quantum NonOccurrence with trivial conditional channel and is undecidable.
6 Policy Existence
In a Markov decision process with finite action space, Blackwell [3] proved that there always exists an optimal policy that is deterministic and stationary. We describe Blackwell’s theorem along with relevant background.
6.1 Definition ().
Let be a topological space. A collection of subsets of is called a algebra of if it contains and it is closed under complementation and countable unions. The smallest algebra of which contains all open subsets of is called the Borel algebra of . The elements of this Borel algebra are called the Borel subsets of . A Borel set is a Borel subset of a complete separable metric space. The set of all bounded realvalued functions over a Borel set is denoted .
Let be a Markov decision process where is a Borel set, is a finite set of actions, is a conditional probability distribution over given , is a reward function, is the initial state, and is a discount factor.
For a positive integer , let be the history at time . A policy is a sequence where is a conditional distribution over given . A policy is Markov if each is a deterministic map. A policy is stationary if for all , , for some map . The value of a policy is given by
(31) 
The expectation is taken over a probability space which contains all infinite trajectories (see Appendix A). Here, , , is the sequence of states induced by . Note . A policy is optimal if for all policy . In this case, we denote the optimal value as .
6.2 Theorem ().
(Blackwell [3], Theorem 5)
Let be a Markov decision process (as defined above).
Suppose the operator is defined as
where
(32) 
Then, is Lipschitz, that is, , so that (by the Banach fixedpoint theorem) has a unique fixed point and , for all and any .
6.3 Theorem ().
(Blackwell [3])
Let be a Markov decision process (as defined above).
Then:

is optimal if and only if (see Theorem 6(f)).

There is an optimal stationary policy for (see Theorem 7(b)).
In what follows, we apply Blackwell’s theorems to a quantum observable Markov decision process. Given that a quantum observable Markov decision process is a Markov decision process over the set of density matrices, it suffices to show that the latter forms a Borel space.
6.4 Proposition ().
Let be a complex Euclidean space over a finite set . Then, the set of all density matrices over is a Borel set.
Proof.
Let . Then, each element of is a unit trace, positive semidefinite matrix from . The latter is a complete separable metric space (with a norm induced by the inner product ). To show is Borel, it suffices to show it is closed. Let be a convergent sequence in where as . We need to show . Since trace is linear, it is continuous; thus, . To show , consider the map for some fixed but arbitrary . Since is linear, it is continuous. Thus, . This shows that . Since is arbitrary, this shows . Thus, . ∎
6.5 Theorem ().
Let be a quantum observable Markov decision process. Then, there is an optimal stationary policy for .
Under certain assumptions, the optimal value function has a compact representation as a piecewise linear and convex function over the probability simplex. This was originally observed by Sondik [12]. In the next theorem, we generalize this observation to the quantum setting.
6.6 Theorem ().
Let
Comments
There are no comments yet.