1 Introduction
How many singlecopy measurements are needed to “learn” an unknown qubit quantum state ? If we wish to reconstruct the full density matrix, even approximately, and if we make no assumptions about , then it is straightforward to show that the number of measurements needed grows exponentially with . In fact, even when we allow joint measurement of multiple copies of the state, an exponential number of copies of are required (see, e.g., O’Donnell and Wright (2016); Haah et al. (2017)).
Suppose, on the other hand, that there is some probability distribution
over possible yes/no measurements, where we identify the measurements with Hermitian matriceswith eigenvalues in
. Further suppose we are only concerned about learning the state well enough to predict the outcomes of most measurements drawn from —where “predict” means approximately calculating the probability, , of a “yes” result. Then for how many (known) sample measurements , drawn independently from , do we need to know the approximate value of , before we have enough data to achieve this?Aaronson (2007) proved that the number of sample measurements needed, , grows only linearly with the number of qubits . What makes this surprising is that it represents an exponential reduction compared to full quantum state tomography. Furthermore, the prediction strategy is extremely simple. Informally, we merely need to find any “hypothesis state” that satisfies for all the sample measurements . Then with high probability over the choice of sample measurements, that hypothesis will necessarily “generalize,” in the sense that for most additional ’s drawn from . The learning theorem led to followup work including a full characterization of quantum advice (Aaronson and Drucker (2014)); efficient learning for stabilizer states (Rocchetto (2017)); the “shadow tomography” protocol (Aaronson (2018)); and recently, the first experimental demonstration of quantum state PAClearning (Rocchetto et al. (2017)).
A major drawback of the learning theorem due to Aaronson is the assumption that the sample measurements are drawn independently from —and moreover, that the same distribution
governs both the training samples, and the measurements on which the learner’s performance is later tested. It has long been understood, in computational learning theory, that these assumptions are often unrealistic: they fail to account for adversarial environments, or environments that change over time. This is precisely the state of affairs in current experimental implementations of quantum information processing. Not all measurements of quantum states may be available or feasible in a specific implementation,
which measurements are feasible are dictated by Nature, and as we develop more control over the experimental setup, more sophisticated measurements become available. The task of learning a state prepared in the laboratory thus takes the form of a game, with the theorist on one side, and the experimentalist and Nature on the other: the theorist is repeatedly challenged to predict the behaviour of the state with respect to the next measurement that Nature allows the experimentalist to realize, with the opportunity to refine the hypothesis as more measurement data become available.It is thus desirable to design learning algorithms that work in the more stringent online learning model. Here the learner is presented a sequence of input points, say , one at a time. Crucially, there is no assumption whatsoever about the ’s: the sequence could be chosen adversarially, and even adaptively, which means that the choice of might depend on the learner’s behavior on . The learner is trying to learn some unknown function , about which it initially knows only that belongs to some hypothesis class —or perhaps not even that; we also consider the scenario where the learner simply tries to compete with the best predictor in , which might or might not be a good predictor. The learning proceeds as follows: for each , the learner first guesses a value for , and is then told the true value , or perhaps only an approximation of this value. Our goal is to design a learning algorithm with the following guarantee: regardless of the sequence of ’s, the learner’s guess, , will be far from the true value at most times (where , of course, is as small as possible). The ’s on which the learner errs could be spaced arbitrarily; all we require is that they be bounded in number.
This leads to the following question: can the learning theorem established by Aaronson (2007) be generalized to the online learning setting? In other words: is it true that, given a sequence of yes/no measurements, where each is followed shortly afterward by an approximation of , there is a way to anticipate the values by guesses , in such a way that at most, say, times (where is some constant, and again is the number of qubits)? The purpose of this paper is to provide an affirmative answer.
Throughout the paper, we specify a (twooutcome) measurement of an qubit mixed state by a “POVM element”: that is, a Hermitian matrix with eigenvalues in , which “accepts” with probability and “rejects” with probability . We prove that:
Theorem 1.
Let be an qubit mixed state, and let be a sequence of outcome measurements that are revealed to the learner one by one, each followed by a value such that . Then there is an explicit strategy for outputting hypothesis states such that for at most values of .
We also prove a theorem for the socalled regret minimization model (i.e., the “nonrealizable case”), where we make no assumption about the input data arising from an actual quantum state, and our goal is simply to do not much worse than the best hypothesis state that could be found with perfect foresight. In this model, the measurements are presented to a learner onebyone. In iteration , after seeing , the learner is challenged to output a hypothesis state , and then suffers a “loss” equal to where
is a real function that is revealed to the learner. Important examples of loss functions are
loss, when , and loss, when , where . The number may be an approximation of for some fixed but unknown quantum state , but is allowed to be arbitrary in general. In particular, the pairs may not be consistent with any quantum state. Define the regret , after iterations, to be the amount by which the actual loss of the learner exceeds the loss of the best single hypothesis:The learner’s objective is to minimize regret. We show that:
Theorem 2.
Let be a sequence of twooutcome measurements on an qubit state presented to the learner, and be the corresponding loss functions revealed in successive iterations in the regret minimization model. Suppose is convex and Lipschitz; in particular, for every , there is a subderivative such that . Then there is an explicit learning strategy that guarantees regret for all . This is so even assuming the measurement and loss function are chosen adaptively, in response to the learner’s previous behavior.
Specifically, the algorithm applies to loss and loss, and achieves regret for both.
The online strategies we present enjoy several advantages over full state tomography, and even over “state certification”, in which we wish to test whether a quantum register is close to a desired state or far from it. Optimal algorithms for state tomography (O’Donnell and Wright (2016); Haah et al. (2017)) or certification (Bădescu et al. (2017)) require joint measurements of an exponential number of copies of the quantum state, and assume the ability to perform noiseless, universal quantum computation. On the other hand, the algorithms implicit in Theorems 1 and 2 involve only singlecopy measurements, allow for noisy measurements, and capture ground reality more closely. They produce a hypothesis state that mimics the unknown state with respect to measurements that can be performed in a given experimental setup, and the accuracy of prediction improves as the set of available measurements grows. For example, in the realizable case, i.e., when the data arise from an actual quantum state, the average loss tends to zero, as the number of measurements becomes large. Finally, the algorithms have run time exponential in the number of qubits, but are entirely classical. (Exponential run time is unavoidable, as the length of the output is exponential in the number of qubits.)
It is natural to wonder whether Theorems 1 and 2 leave any room for improvement. Theorem 1 is asymptotically optimal in its mistake bound of ; this follows from the property that qubit quantum states, considered as a hypothesis class, have fatshattering dimension (see for example Aaronson (2007)). On the other hand, there is room to improve Theorem 2. The bounds of which we are aware are for the loss (see, e.g., (Arora et al., 2012, Theorem 4.1)) in the nonrealizable case and for the loss in the realizable case, when the feedback consists of the measurement outcomes. (The latter bound, as well as an bound for loss in the same setting, come from considering quantum mixed states that consist of independent classical coins, each of which could land heads with probability either or . The paramater is set to .)
We mention an application of Theorem 1, to appear in simultaneous work. Aaronson (2018) has given an algorithm for the socalled shadow tomography problem. Here we have an unknown dimensional pure state , as well as known twooutcome measurements . Our goal is to approximate , for every , to within additive error . We would like to do this by measuring , where is as small as possible. Surprisingly, Aaronson (2018) showed that this can be achieved with , that is, a number of copies of that is only polylogarithmic in both and . One component of his algorithm is essentially tantamount to online learning with mistakes—i.e., what we present in Section 4 of this paper. However, by using Theorem 1 from this paper in a blackbox manner, we can improve the sample complexity of shadow tomography to . Details will appear in (Aaronson (2018)).
To maximize insight, in this paper we give three very different approaches to proving Theorems 1 and 2 (although we do not prove every statement with all three approaches). Our first approach is to adapt techniques from online convex optimization to the setting of density matrices, which in general may be over a complex Hilbert space. This requires extending standard techniques to cope with convexity and Taylor approximations, which are widely used for functions over the real domain, but not over the complex domain. We also give an efficient iterative algorithm to produce predictions. This approach connects our problem to the modern mainstream of online learning algorithms, and achieves the best parameters.
Our second approach is via a postselectionbased learning procedure, which starts with the maximally mixed state as a hypothesis and then repeatedly refines it by simulating postselected measurements. This approach builds on earlier work due to Aaronson (2005), specifically the proof of . The advantage is that it is almost entirely selfcontained, requiring no “power tools” from convex optimization or learning theory. On the other hand, the approach does not give optimal parameters, and we do not know how to prove Theorem 2 with it.
Our third approach is via an upperbound on the socalled sequential fatshattering dimension of quantum states, considered as a hypothesis class. In the original quantum PAClearning theorem by Aaronson, the key step was to upperbound the socalled fatshattering dimension of quantum states considered as a hypothesis class. Fatshattering dimension is a realvalued generalization of VC dimension. One can then appeal to known results to get a sampleefficient learning algorithm. For online learning, however, bounding the fatshattering dimension no longer suffices; one instead needs to consider a possiblylarger quantity called sequential fatshattering dimension. However, by appealing to a lower bound due to Nayak (1999); Ambainis et al. (2002) for a variant of quantum random access codes, we are able to upperbound the sequential fatshattering dimension of quantum states. Using known results—in particular, those due to Rakhlin et al. (2015)—this implies the regret bound in Theorem 2, up to a multiplicative factor of . The statement that the hypothesis class of qubit states has sequential fatshattering dimension might be of independent interest: among other things, it implies that any online learning algorithm that works given bounded sequential fatshattering dimension, will work for online learning of quantum states. We also give an alternative proof for the lower bound due to Nayak for quantum random access codes, and extend it to codes that are decoded by what we call
measurement decision trees
. We expect these also to be of independent interest.1.1 Structure of the paper
We start by describing background and the technical learning setting as well as notations used throughout. In Section 3 we give the algorithms and main theorems derived using convexity arguments and online convex optimization. In Section 4 we describe the postselection algorithm and state the main theorem using this argument. In Section 5 we give a sequential fatshattering dimension bound for quantum states and its implication for online learning of quantum states.
2 Preliminaries and definitions
We define the trace norm of a matrix as , where is the adjoint of . We denote the th eigenvalue of a Hermitian matrix by , its minimum eigenvalue by , and its maximum eigenvalue by . By ‘’ we denote the natural logarithm, unless the base is explicitly mentioned.
An qubit quantum state is an element of , where is the set of all trace1 positive semidefinite (PSD) complex matrices of dimension :
Note that is a convex set. A twooutcome measurement of an qubit state is defined by a Hermitian matrix with eigenvalues in . The measurement “accepts” with probability , and “rejects” with probability . For the algorithms we present in this article, we assume that a twooutcome measurement is specified via a classical description of its defining matrix . In the rest of the article, unless mentioned otherwise, a “measurement” refers to a “twooutcome measurement”.
Online learning and regret.
In online learning of quantum states, we have a sequence of iterations of the following form. First, the learner constructs a state ; we say that the learner “predicts” . It then suffers a “loss” that depends on a measurement , both of which are presented by an adversary. Commonly used loss functions are loss (also called “mean square error”), given by
and loss (also called “absolute loss”), given by
where . The parameter may be an approximation of for some fixed quantum state not known to the learner, but is allowed to be arbitrary in general.
The learner then “observes” feedback from the measurement
; the feedback is also provided by the adversary. The simplest feedback is the realization of a binary random variable
such thatAnother common feedback is the number , especially in case that the learner suffers or loss.
We would like to design a strategy for updating based on the loss, measurements, and feedback in all the iterations so far, so that the learner’s total loss is minimized in the following sense. We would like that over iterations (for a number known in advance), the learner’s total loss is not much more than that of the hypothetical strategy of outputting the same quantum state at every time step, where minimizes the total loss with perfect hindsight. Formally this is captured by the notion of regret , defined as
The sequence of measurements can be arbitrary, even adversarial, based on the learner’s previous actions. Note that if the loss function is given by a fixed state (as in the case of mean square error), the minimum total loss would be . This is called the “realizable” case. However, in general, the loss function presented by the adversary need not be consistent with any quantum state. This is called the “nonrealizable” case.
A special case of the online learning setting is called agnostic learning; here the measurements are drawn from a fixed and unknown distribution . The setting is called “agnostic” because we still do not assume that the losses correspond to any actual state (i.e., the setting may be nonrealizable).
Online mistake bounds.
In some online learning scenarios the quantity of interest is not the mean square error, or some other convex loss, but rather simply the total number of “mistakes” made. For example, we may be interested in the number of iterations in which the predicted probability of acceptance is more than far from the actual value , where is again a fixed state not known to the learner. More formally, let
be the absolute loss function. Then the goal is to bound the number of iterations in which , regardless of the sequence of measurements presented by the adversary. We assume that in this setting,the adversary provides as feedback an approximation that satisfies .
3 Online learning of quantum states
In this section, we use techniques from online convex optimization to minimize regret. The same algorithms may be adapted to also minimize the number of mistakes made.
3.1 Regularized FollowtheLeader
We first follow the template of the Regularized FollowtheLeader algorithm (RFTL; see, for example, (Hazan, 2015, Chapter 5)). The algorithm below makes use of von Neumann entropy, which relates to the Matrix Exponentiated Gradient algorithm (Tsuda et al. (2005)).
(1) 
Remark 1:
The mathematical program in Eq. (1) is convex, and thus can be solved in polynomial time in the dimension, which is .
Theorem 3.
Setting , the regret of Algorithm 1 is bounded by .
Remark 2:
In the case where the feedback is an independent random variable , where with probability and with probability for a fixed but unknown state , we define in Algorithm 1 as . Then is the gradient of the loss function where we receive precise feedback instead of . It follows from the proof of Theorem 3 that the expected regret of Algorithm 1, , is bounded by .
The proof of Theorem 3 appears in Appendix B. The proof is along the lines of (Hazan, 2015, Theorem 5.2), except that the loss function does not take a raw state as input, and our domain for optimization is complex. Therefore, the mean value theorem does not hold, which means we need to approximate the Bregman divergence instead of replacing it by a norm as in the original proof. Another subtlety is that convexity needs to be carefully defined with respect to the complex domain.
3.2 Matrix Multiplicative Weights
The Matrix Multiplicative Weights (MMW) algorithm (Arora and Kale, 2016) provides an alternative means of proving Theorem 2. The algorithm follows the template of Algorithm 1 with step 5 replaced by the following update rule:
(2) 
In the notation of Arora and Kale (2016), this algorithm is derived using the loss matrices . Since and , we have , as requred in the analysis of the Matrix Multiplicative Weights algorithm. We have the following regret bound for the algorithm (proved in Appendix C):
Theorem 4.
Setting , the regret of the algorithm based on the update rule (2) is bounded by .
3.3 Proof of Theorem 1
Consider either the RFTL or MMW based online learning algorithm described in the previous subsections, with the Lipschitz convex absolute loss function . We run the algorithm in a subsequence of the iterations, using only the measurements presented in those iterations. The subsequence of iterations is determined as follows. Let denote the hypothesis maintained by the algorithm in iteration . We run the algorithm in iteration if . Note that whenever , we have , so we update the hypothesis according to the RFTL/MMW rule in that iteration.
As we explain next, the algorithm makes at most updates regardless of the number of measurements presented (i.e., regardless of the number of iterations), giving the required mistake bound. For the true quantum state , we have for all . Thus if the algorithm makes updates (i.e., we run the algorithm in of the iterations), the regret bound implies that . Simplifying, we get the bound , as required.
4 Learning Using Postselection
In this section, we give a direct route to proving a slightly weaker version of Theorem 1: one that does not need the tools of convex optimization, but only tools intrinsic to quantum information.
We need a slight variant of a wellknown result, which Aaronson called the “Quantum Union Bound” (see, for example, Aaronson (2006, 2016); Wilde (2013)). Given a twooutcome measurement on qubits states, we define an operator that postselects on acceptance by . Let be any unitary operation on qubits that maps states of the form to . Such a unitary operation always exists (see, e.g., (Watrous, 2018, Theorem 2.42)). Denote the register holding the th qubit by . Let be the orthogonal projection onto states that equal in register . Then we define the operator as
(3) 
if , and otherwise. We emphasize that we use a fresh ancilla qubit initialized to in register in every application of the operator . We say that the postselection succeeds with probability . Note that the operator is tracepreserving on states which are accepted by with nonzero probability.
Theorem 5 (variant of Quantum Union Bound; Gao (2015)).
Suppose we have a sequence of twooutcome measurements , such that each accepts a certain mixed state with probability at least . Consider the corresponding operators that postselect on acceptance by the respective measurements . Let denote the state obtained by applying each of the postselection operations in succession. Then the probability that all the postselection operations succeed, i.e., the measurements all accept , is at least . Moreover, .
We may infer the above theorem by applying Theorem 1 from (Gao (2015)) to the state augmented with ancillary qubits initialized to , and considering orthogonal projection operators , where the unitary operator and the projection operator are as defined for the postselection operation for . The th projection operator acts on the register holding and the th ancillary qubit .
We now prove the main result of this section (proof in Appendix D):
Theorem 6.
Let be an unknown qubit mixed state, let be a sequence of twooutcome measurements, and let . There exists a strategy for outputting hypothesis states , where depends only on and real numbers in , such that as long as for every , we have
for at most values of . Here the ’s and ’s can otherwise be chosen adversarially.
5 Learning Using Sequential FatShattering Dimension
In this section, we prove regret bounds using the notion of sequential fatshattering dimension. We begin with a bound for a generalization of “random access coding” (Nayak (1999); Ambainis et al. (2002)), also known as the Index function problem in communication complexity. The generalization was called “serial encoding” by Nayak (1999) and arose in the context of quantum finite automata. The serial encoding problem is also called Augmented Index in the literature on streaming algorithms.
The following theorem places a bound on how few qubits serial encoding may use. In other words, it bounds the number of bits we may encode in an qubit quantum state when an arbitrary bit out of the may be recovered well via a twooutcome measurement. The bound holds even when the measurement for recovering may depend adaptively on the previous bits of , which we need not know.
Let be the binary entropy function.
Theorem 7 (Nayak (1999)).
Let and be positive integers. For each bit string , let be an qubit mixed state such that for each , there is a twooutcome measurement that depends only on and the prefix , and has the following properties

if then , and

if then ,
where is the error in predicting the bit at vertex . (We say “serially encodes” .) Then .
In Appendix F, we present a strengthening of this bound when the bits of may be only be recovered in an adaptive order that is a priori unknown. The stronger bound may be of independent interest.
In the context of online learning, the measurements used in recovering bits from a serial encoding are required to predict the bits with probability bounded away from given “pivot points”. Theorem 7 may be specialized to this case as follows (proof in Appendix E).
Corollary 8.
Let and be positive integers. For each bit string , let be an qubit mixed state such that for each , there is a twooutcome measurement that depends only on and the prefix , and has the following properties

if then , and

if then ,
where and is a “pivot point” associated with the prefix . Then
In particular, .
Let be a set of functions , and . Then, following Rakhlin et al. (2015), let the sequential fatshattering dimension of , or , be the largest for which we can construct a complete binary tree of depth , such that

each internal vertex has associated with it a point and a real , and

for each leaf vertex there exists an that causes us to reach if we traverse from the root such that at any internal node we traverse the left subtree if and the right subtree if . If we view the leaf as a bit string, the function is such that for all ancestors of , we have if , and if , when is at depth from the root.
Corollary 8 implies the following theorem:
Theorem 9.
Let be the set of twooutcome measurements on an qubit state, and let be the set of all functions that have the form for some . Then for all , we have .
Theorem 9 strengthens an earlier result due to Aaronson (2007), which proved the same upper bound for the “ordinary” (nonsequential) fatshattering dimension of quantum states considered as a hypothesis class.
Now we may use existing results from the literature, which relate sequential fatshattering dimension to online learnability. In particular, in the nonrealizable case, Rakhlin et al. (2015) recently showed the following:
Theorem 10 (Rakhlin et al. (2015)).
Let be a set of functions and for every integer , let be a convex, Lipschitz loss function. Suppose we are sequentially presented elements , with each followed by the loss function . Then there exists a learning strategy that lets us output a sequence of hypotheses , such that the regret is upperbounded as:
This follows from Theorem 8 in (Rakhlin et al. (2015)) as in the proof of Proposition 9 in the same article.
Corollary 11.
Suppose we are presented with a sequence of twooutcome measurements of an qubit state, with each followed by a loss function as in Theorem 10. Then there exists a learning strategy that lets us output a sequence of hypothesis states such that the regret after the first iterations is upperbounded as:
Note that the result due to Rakhlin et al. (2015) is nonexplicit. In other words, by following this approach, we do not derive any specific online learning algorithm for quantum states that has the stated upper bound on regret; we only prove nonconstructively that such an algorithm exists.
We expect that the approach in this section, based on sequential fatshattering dimension, could also be used to prove a mistake bound for the realizable case, but we leave that to future work.
6 Open Problems
We conclude with some questions arising from this work. The regret bound established in Theorem 2 for loss is tight. Can we similarly achieve optimal regret for other loss functions of interest, for example for loss? It would also be interesting to obtain regret bounds in terms of the loss of the best quantum state in hindsight, as opposed to (the number of iterations), using the techniques in this article. Such a bound has been shown by (Tsuda et al., 2005, Lemma 3.2) for loss using the Matrix Exponentiated Gradient method.
In what cases can one do online learning of quantum states, not only with few samples, but also with a polynomial amount of computation? What is the tight generalization of our results to measurements with outcomes? Is it the case, in online learning of quantum states, that any algorithm works, so long as it produces hypothesis states that are approximately consistent with all the data seen so far? Note that none of our three proof techniques seem to imply this general conclusion.
References
 Aaronson [2005] S. Aaronson. Limitations of quantum advice and oneway communication. Theory of Computing, 1:1–28, 2005. Earlier version in CCC’2004. quantph/0402095.
 Aaronson [2006] S. Aaronson. QMA/qpoly is contained in PSPACE/poly: deMerlinizing quantum protocols. In Proc. Conference on Computational Complexity, pages 261–273, 2006. quantph/0510230.
 Aaronson [2007] S. Aaronson. The learnability of quantum states. Proc. Roy. Soc. London, A463(2088):3089–3114, 2007. quantph/0608142.
 Aaronson [2016] S. Aaronson. The complexity of quantum states and transformations: From quantum money to black holes, February 2016. Lecture Notes for the 28th McGill Invitational Workshop on Computational Complexity, Holetown, Barbados. With guest lectures by A. Bouland and L. Schaeffer. www.scottaaronson.com/barbados2016.pdf.
 Aaronson [2018] S. Aaronson. Shadow tomography of quantum states. To appear in Proceedings of STOC’2018. arXiv:1711.01053, 2018.
 Aaronson and Drucker [2014] S. Aaronson and A. Drucker. A full characterization of quantum advice. SIAM J. Comput., 43(3):1131–1183, 2014. Earlier version in STOC’2010. arXiv:1004.0377.
 Ambainis et al. [2002] A. Ambainis, A. Nayak, A. TaShma, and U. V. Vazirani. Quantum dense coding and quantum finite automata. J. of the ACM, 49:496–511, 2002. Combination of an earlier version in STOC’1999, pp. 376383, arXiv:quantph/9804043 and (Nayak [1999]).
 Arora and Kale [2016] S. Arora and S. Kale. A combinatorial, primaldual approach to semidefinite programs. J. ACM, 63(2):12:1–12:35, 2016.
 Arora et al. [2012] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(1):121–164, 2012.
 Audenaert and Eisert [2005] K. M. R. Audenaert and J. Eisert. Continuity bounds on the quantum relative entropy. Journal of Mathematical Physics, 46(10):102104, 2005. arXiv:quantph/0503218.
 Bădescu et al. [2017] C. Bădescu, R. O’Donnell, and J. Wright. Quantum state certification. Technical Report arXiv:1708.06002 [quantph], arXiv.org, 2017.
 Bhatia [1997] Rajendra Bhatia. Matrix Analysis, volume 169 of Graduate Texts in Mathematics. SpringerVerlag, New York, 1997.
 Carlen and Lieb [2014] E. A. Carlen and E. H. Lieb. Remainder terms for some quantum entropy inequalities. Journal of Mathematical Physics, 55(4), 2014. arXiv:1402.3840.
 Gao [2015] J. Gao. Quantum union bounds for sequential projective measurements. Phys. Rev. A, 92:052331, Nov 2015. doi: 10.1103/PhysRevA.92.052331. URL https://link.aps.org/doi/10.1103/PhysRevA.92.052331.
 Haah et al. [2017] J. Haah, A. W. Harrow, Z. Ji, X. Wu, and N. Yu. Sampleoptimal tomography of quantum states. IEEE Transactions on Information Theory, 63(9):5628–5641, Sept 2017. ISSN 00189448. doi: 10.1109/TIT.2017.2719044.
 Hazan [2015] E. Hazan. Introduction to Online Convex Optimization, volume 2 of Foundations and Trends in Optimization. 2015.
 Nayak [1999] A. Nayak. Optimal lower bounds for quantum automata and random access codes. In Proc. IEEE FOCS, pages 369–376, 1999. quantph/9904093.
 O’Donnell and Wright [2016] R. O’Donnell and J. Wright. Efficient quantum tomography. In Proceedings of the Fortyeighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 899–912, New York, NY, USA, 2016. ACM. ISBN 9781450341325. doi: 10.1145/2897518.2897544. URL http://doi.acm.org/10.1145/2897518.2897544.

Rakhlin et al. [2015]
A. Rakhlin, K. Sridharan, and A. Tewari.
Online learning via sequential complexities.
The Journal of Machine Learning Research
, 16(1):155–186, 2015.  Rocchetto [2017] A. Rocchetto. Stabiliser states are efficiently PAClearnable. arXiv:1705.00345, 2017.
 Rocchetto et al. [2017] A. Rocchetto, S. Aaronson, S. Severini, G. Carvacho, D. Poderini, I. Agresti, M. Bentivegna, and F. Sciarrino. Experimental learning of quantum states. arXiv:1712.00127, 2017.
 Tsuda et al. [2005] K. Tsuda, G. Rätsch, and M. K. Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6:995–1018, 2005. URL www.jmlr.org/papers/v6/tsuda05a.html.
 Watrous [2018] J. Watrous. Theory of Quantum Information. Cambridge University Press, May 2018. doi: 10.1017/9781316848142.
 Wilde [2013] M. Wilde. Sequential decoding of a general classicalquantum channel. Proc. Roy. Soc. London, A469(2157):20130259, 2013. arXiv:1303.0808.
Appendix A Auxiliary Lemmas
The following lemma is from (Tsuda et al. [2005]), given here for completeness.
Lemma 12.
For Hermitian matrices and Hermitian PSD matrix , if , then .
Proof.
Let . By definition, . It suffices to show that . Let be the eigendecomposition of , and let , where . Then Since and all the eigenvalues of are nonnegative, , . Therefore . ∎
Lemma 13.
If are Hermitian matrices, then .
Proof.
The proof is similar to Lemma 12. Let be the eigendecomposition of . Then is a real diagonal matrix. We have , where . Note that , so has a real diagonal. Then . Since for all , . ∎
Appendix B Proof of Theorem 3
Proof of Theorem 3.
Since is convex, for all ,
where ‘’ denotes the trace innerproduct on complex matrices. Summing over ,
Define , and , where is the negative von Neumann Entropy of (in nats). Denote . By [Hazan, 2015, Lemma 5.2], for any , we have
(4) 
Define , then the convex program in line 5 of Algorithm 1 finds the minimizer of in . The following claim shows that that the minimizer is always positive definite (proof provided later in this section):
Claim 14.
For all , we have .
For , we can write , and define
The definition of is analogous to the gradient of if the function is defined over real symmetric matrices. Moreover, the following condition, similar to the optimality condition over a real domain, is satisfied (proof provided later in this section).
Claim 15.
For all ,
(5) 
Denote
Then by the Pinsker inequality (see, for example, Carlen and Lieb [2014] and the references therein),
We have
(6) 
where the first inequality follows from Claim 15, and the second because ( minimizes ). Therefore
(7) 
Let denote the dual of the trace norm, i.e., the spectral norm of the matrix . By Generalized CauchySchwartz [Bhatia, 1997, Exercise IV.1.14, page 90],
by Eq. (7). 
Rearranging,
where is an upper bound on . Combining with Eq. (4), we arrive at the following bound
Taking , we get . Going back to the regret bound,
We proceed to show that . Let denote the set of probability distributions over . By definition,
Since the dual norm of the trace norm is the spectral norm, we have
Therefore . ∎
Proof of Claim 14.
Let be such that . Suppose , where is a diagonal matrix with real values on the diagonal. Assume that and . Let such that , for , and for , so . We show that there exists such that . Expanding both sides of the inequality, we see that it is equivalent to showing that for some ,
Let , and . The inequality then becomes
Observe that . So by the Generalized CauchySchwartz inequality,
Since are finite and as , there exists small such that . We have
So there exists such that