Machine learning gained huge popularity ever since Google beats GO’s world champion with its AI. In this paper, we generalize the PAC (probably approximately correct) learning model [Val84], a well studied learning model in classical computer science, to the quantum world. We did so by generalizing the concepts in a PAC learning model from classical functions to quantum processes, defining the problem of PAC learning quantum process. The problem of PAC learning quantum process is detailed as follows: Let the concept class be a finite set of known to dimensional quantum process. We are trying to learn an unknown quantum process, the target concept . In order to do this, we are given samples , where are inputs to the quantum process and are the corresponding quantum states outputted by . The inputs are drawn from an unknown distribution . Because of the no-cloning theorem, it is hard to justify holding both the inputs and outputs as unknown quantum states, so we restricted the inputs to computational basis states and keep the outputted states as unknown quantum states, meaning that we hold a copy of the quantum state rather than the full classical description of it. A proper111Proper means that the hypothesis must be inside the concept class unlike the improper case where can be any density matrix. All learner in this paper are proper, and we might the term “proper”. -PAC learner for the concept class of quantum processes is a quantum algorithm that takes the description of and samples as input and outputs a hypothesis that is -close to the target concept with probability for any target concept and input distribution , where the distance between two concepts depends on the input distribution and is defined as , the expected trace distance between the outputs averaged over . We show that the problem of PAC learning quantum process can be solved with
samples when the outputs are pure states and
samples if the outputs can be mixed.
Other than a generalization of the classical PAC learning model, PAC learning quantum process can be viewed as an efficient way to do quantum process tomography when we know that the target quantum processes comes from a finite set. For example, if we try to PAC-learn a polynomial sized quantum circuit of -qubits, since there are only possible polynomial sized circuits, our result shows that we can learn it in samples, an exponential improvement over a full process tomography.
Since our samples consist of unknown quantum states, a challenging part of the problem is how to extract information from those quantum states to distinguish the concepts. In fact, this is most of the challenge the problem, and we can isolate the challenge by focusing on the special case of constant input. In this special case, the problem of PAC learning quantum process becomes an interesting hybrid of quantum state discrimination and quantum state tomography, and we called it the approximate state discrimination problem. The approximate state discrimination problem is detailed as follows: Let be a finite set of -dimensional density matrices. We want to learn a target state using as few identical copies of as possible. A quantum algorithm is an -approximate discriminator of if it takes the description of and copies of as input and with probability outputs a state with , for any .
Since it is a special case of PAC learning quantum process, the approximate state discrimination problem can also be solved with
samples if is consisted of pure states and
samples if the states in can be mixed.
To the knowledge of the authors, the approximate state discrimination problem has not been studied in the literature. It is illuminative to compare the approximate state discrimination problem to other well studied problems in the literature that try to learn/distinguish quantum states while given multiple copies of an unknown -dimensional state . In the following paragraphs we compare approximate state discrimination to quantum state tomography, quantum state discrimination, and quantum property testing.
In the problem of quantum state tomography, we want to get an -approximation of the unknown state . Compared to quantum state tomography, approximate state discrimination has the same goal of finding an -close output, but we are given a promise that the unknown state comes from a known finite set . As a result, the sample complexity of our algorithms is and independent of , the dimension of the target state. This means this we have an speedup if the concept class is not too large. For example, if the size of is exponential in the number of qubits, , the approximate state discrimination problem can be solved in samples, an exponential improvement over full state tomography which uses samples.
In the quantum state discrimination problem [AM14, Mon08, Mon06, BK17, TADR18], which is also called as quantum detection problem [Mon06] or quantum hypothesis testing [TADR18], we are promised that the state comes from a known finite set and we want to find out what is exactly. Compared to the state discrimination problem, approximate state discrimination has the same promise of finite input set, but allows an approximate output instead of finding the exact answer. Therefore, the promise on the minimum distance between inputs can be removed, and the simple state discrimination algorithm of taking several copies to amplify the minimum distance then taking a PGM (pretty good measurement) does not work on the approximate state discrimination problem, since the error probability of PGM is not bounded when some of the states are close to each other.
In the quantum property testing problem [MdW16, HLM17], we are given copies of an unknown quantum state and want to determine whether , where is a known (possibly infinite) set quantum states, or is -far from anything in . Harrow, Cedric, and Montanaro [HLM17] give an upper bound on the quantum property testing problem in the special case where is a finite set of pure states. Comparing to [HLM17], our pure state algorithm has essentially the same sample complexity. Note that in quantum property testing, the unknown state does not always come from , and we only want a decision answer instead of finding a state, so it is pretty different from approximate state discrimination. Also note that the quantum property testing result of [HLM17] cannot be generalized to arbitrary mixed states, as [BOW17] shows that to certify a mixed state requires samples. In the quantum state certification problem, we are given copies of an unknown quantum state and ask whether is equal to some known state or far from it, so it is obviously a special case of quantum property testing with , and the lower bound of is much larger than the sample complexity of in [HLM17] unless is exponential or more in .
There are also several works in the literature that study the sample complexity of PAC learning with different ways of generalization to quantum computation. Aaronson [Aar07] studies the problem of PAC learning an arbitrary unknown quantum state, where the inputs are binary outcome measurements with full classical description and the outputs are the measurement outcomes. They show that sample complexity is linear to the number of qubits of the concepts. Cheng, Hsieh, and Yeh [CHY15] studies the sample complexity of PAC learning arbitrary two outcome measurements, where the inputs are quantum states, and the learner has complete classical description of it. They show an upper of sample complexity linear in the dimension of the Hilbertspace. Note that one can trivially get a lowerbound of similar order by noticing that Boolean functions is a subset of two outcome measurements. Arunachalam and de Wolf [AdW17b] studies the sample complexity of PAC learning classical functions with quantum samples and shows that there is no quantum speed up. See [AdW17a] for a survey of quantum learning theory.
Through out this paper, is base 2.
We use to denote the trace norm . We use or to denote the Frobenius norm .
Denote the trace distance and fidelity between two distribution as and . Denote the trace distance and fidelity between two quantum states as and . For a quantum state and a quantum measurement , denote
as the output probability distribution when applyingon .
Note that fidelity and trace distance are related by
For two quantum process concepts , define the distance between them as
We say that are -close if and -far if . For two sets of concepts and , define the distance between them as .
2.1 Pretty Good Measurement
The pretty good measurement (PGM) is defined as follows:
Definition 1 (pretty good measurement).
Let be a set of density matrices and the set of corresponding probabilities. Define
The PGM associated with is a POVM with
3 Definition of PAC Learning Quantum Process and Approximate State Discrimination
In this section we describe the model of PAC learning quantum process and approximate state discrimination.
3.1 PAC Learning Quantum Process
Let the concept class be a finite set of known to dimensional quantum process. A learner trying to learn the target concept is given samples , where are inputs to the quantum process and are the corresponding quantum density matrices outputted by . The inputs are computational basis states drawn from an unknown distribution . The outputted states are unknown quantum states, meaning that we hold a copy of the quantum state rather than the full classical description of it.
A proper -learner for the concept class is a quantum algorithm that takes the description of and samples as input and outputs a hypothesis such that iss with probability for any target concept and input distribution . The sample complexity of a learner is the maximum of the the number of sample it took over and . The sample complexity of a concept class is the minimum sample complexity over all learners
3.2 Approximate State Discrimination
Let be a finite set of -dimensional density matrices. We want to learn a target state using as few identical copies of as possible. A quantum algorithm is an -approximate discriminator of if it takes the description of and copies of as input and with probability outputs a state with , for any .
4 PAC Learning Quantum Process with Pure State Output
The algorithm follows ideas by Sen [Sen05], who shows that random orthonormal measurement preserves trace distance between pure states. One can then apply random orthonormal measurements on each sampled output and take enough samples to amplify the distance between -far concepts to
and show that the probability for the maximumly likelihood estimate to select a-far concept over the target concept is less than . Take a union bound and we have a bounded error probability.
Algorithm 3 is a proper -PAC learner for any concept class of quantum processes with pure state outputs, using
Algorithm 3 (algorithm for pure state output).
do a random orthonormal measurement on each output state
output the concept that is most likely to give the measured result of step 2:
Lemma 4 (random orthonormal measurement).
Let be two density matrices in . Define . There exists a universal constant such that if then with probability at least over the choice of a random orthonormal measurement basis in , .
Note that if are pure states, for large enough and so that .
The following lemma shows how trace distance of the measured result grows when we take multiple samples.
Lemma 5 (trace distance amplification).
Let be independent distributions and so are . Denote the joint distribution
. Denote the joint distributionas and as . Suppose that
By Cauchy-Schwarz inequality,
Then the joint fidelity is bounded by
where last inequality is true because . And the joint trace distance is
The following lemma analyzes the effectiveness of maximum likely estimate.
For any two distributions have trace distance ,
Now we have everything to prove theorem 2.
Let be the target concept and be the hypothesis guessed by Algorithm 3. Let . Recall that we took
samples. For all , apply Lemma 4 to the pair of states , we get that with probability over random orthonormal measurements ,
is a universal constant. Since you can pad some aniclla states to increasewithout changing trace distnaces if is not small enough, we ignore this term. By Chernoff bound, with probability at least over sampled from ,
So we can apply Lemma 5 to get that with probability at least ,
Finally we apply Lemma 6 and union bound to get
5 PAC Learning Quantum Process with Mixed State Output
The random orthonormal measurement approach in Section 4 does not work since two high dimensional mixed states with constant trace distance between them can have negligible Frobenius distance between them. Instead, we show that if we apply PGM222Technically the measurement we applied is not a PGM but a minimax measurement strategy whose worst case error probability is upper bounded by the error probability of PGMs. over a carefully chosen subset of , we can rule out the possibility of the target concept being inside some subset whose size is constant fraction of . We then repeat this procedure times to pinpoint .
Before we show the procedures about PGM, let us first show that we can efficiently amplify the distance between concepts by taking more samples.
Lemma 7 (concept distance amplification).
Let be a quantum process concept -far from the target concept . Let be inputs drawn from the distribution . With probability over , we have
By Chernoff bound, with probability ,
Then by Cauchy-Schwarz Inequality,
Then the amplified fidelity is bounded by
where last inequality is true because . And the amplified trace distance is
means that we can amplify the distance between tensor products of samples from quantum processes as efficient as we do on samples of fixed quantum states. This means that PAC learning quantum process is really similar to approximate state discrimination even in the mixed state case.
Now back to the topic of PGM. When trying to apply PGM to approximate state discrimination, the main difficulty is that there is no restriction on the concept class, so the distance between two concepts can be arbitrarily small, and this poses a difficulty to PGM, who tries to distinguish every concept. We might even get a pathological case where two orthogonal concepts and be connected by a chain of close concepts, making PGM unable to distinguish those orthogonal cnocepts. To combat these issues, we partition the concept class into three subsets , chosen so that the distance333Recall that the distance between two sets of concepts are defined as between and is , a number to be chosen later. The idea is that we give up gaining information about the concepts in , in exchange for a good “binary distinguishment” between and . We apply PGM444Again, technically the measurement we applied is not a PGM. just to get an answer: if our measurement result is , we know the target concept is in ( or ) and thus not in , and vice versa, so we can either rule out the possibility that the target concept is in or the possibility that the target concept is in . We pick and so that their size are both a constant fraction of the size of the concept class except when an “extreme case” is found, so we can always rule out a constant fraction of the concept class after the PGM measurement. Repeat times and we found the target concept. A careful reader might have already recognized an extreme case: it is not possible to have constant sized and separate by a gap if every concept in is literally on top of each other. But note that in this case, we can output anything in as the hypothesis and it will be -close to . More generally, our partition algorithm will not be able to reserve a constant sized if a significant fraction of is clustered around a concept. In such an extreme case, we choose the cluster as with an -thick “shell” of around it. If we measured , we can rule out , which is a constant fraction of . If we measured yes, we can output the center of the cluster as the hypothesis, and we tune so that everything in the cluster or the -shell is -close to the center.
The measurement we use to distinguish and is derived from pretty good measurement and minimax theorem. First by slightly modifying a result of [BK02] and [AM14], we got a lemma about the distinguishing power of PGM on two disjoint sets:
Lemma 8 (Binary distinguishment power of PGM).
Let be a set of density matrices and corresponding probabilities where and . 555We will slightly abuse the notation and write or instead of or . When we do a PGM on