Online Learning of Quantum States

Suppose we have many copies of an unknown n-qubit state ρ. We measure some copies of ρ using a known two-outcome measurement E_1, then other copies using a measurement E_2, and so on. At each stage t, we generate a current hypothesis σ_t about the state ρ, using the outcomes of the previous measurements. We show that it is possible to do this in a way that guarantees that |Tr(E_iσ_t) - Tr(E_iρ) |, the error in our prediction for the next measurement, is at least ε at most O(n / ε^2 ) times. Even in the "non-realizable" setting---where there could be arbitrary noise in the measurement outcomes---we show how to output hypothesis states that do significantly worse than the best possible states at most O(√(Tn)) times on the first T measurements. These results generalize a 2007 theorem by Aaronson on the PAC-learnability of quantum states, to the online and regret-minimization settings. We give three different ways to prove our results---using convex optimization, quantum postselection, and sequential fat-shattering dimension---which have different advantages in terms of parameters and portability.


page 1

page 2

page 3

page 4


On the Algorithmic Content of Quantum Measurements

We show that given a quantum measurement, for an overwhelming majority o...

Flexible learning of quantum states with generative query neural networks

Deep neural networks are a powerful tool for characterizing quantum stat...

More Practical and Adaptive Algorithms for Online Quantum State Learning

Online quantum state learning is a recently proposed problem by Aaronson...

The GTR-model: a universal framework for quantum-like measurements

We present a very general geometrico-dynamical description of physical o...

Inference-Based Quantum Sensing

In a standard Quantum Sensing (QS) task one aims at estimating an unknow...

Measuring quantum discord using the most distinguishable steered states

Any two-qubit state can be represented, geometrically, as an ellipsoid w...

Test-measured Rényi divergences

One possibility of defining a quantum Rényi α-divergence of two quantum ...

1 Introduction

How many single-copy measurements are needed to “learn” an unknown -qubit quantum state ?  If we wish to reconstruct the full density matrix, even approximately, and if we make no assumptions about , then it is straightforward to show that the number of measurements needed grows exponentially with . In fact, even when we allow joint measurement of multiple copies of the state, an exponential number of copies of  are required (see, e.g., O’Donnell and Wright (2016); Haah et al. (2017)).

Suppose, on the other hand, that there is some probability distribution

over possible yes/no measurements, where we identify the measurements with  Hermitian matrices

 with eigenvalues in

.  Further suppose we are only concerned about learning the state  well enough to predict the outcomes of most measurements  drawn from —where “predict” means approximately calculating the probability, , of a “yes” result.  Then for how many (known) sample measurements , drawn independently from , do we need to know the approximate value of , before we have enough data to achieve this?

Aaronson (2007) proved that the number of sample measurements needed, , grows only linearly with the number of qubits .  What makes this surprising is that it represents an exponential reduction compared to full quantum state tomography.  Furthermore, the prediction strategy is extremely simple.  Informally, we merely need to find any “hypothesis state”  that satisfies  for all the sample measurements .  Then with high probability over the choice of sample measurements, that hypothesis  will necessarily “generalize,” in the sense that  for most additional ’s drawn from .  The learning theorem led to followup work including a full characterization of quantum advice (Aaronson and Drucker (2014)); efficient learning for stabilizer states (Rocchetto (2017)); the “shadow tomography” protocol (Aaronson (2018)); and recently, the first experimental demonstration of quantum state PAC-learning (Rocchetto et al. (2017)).

A major drawback of the learning theorem due to Aaronson is the assumption that the sample measurements are drawn independently from —and moreover, that the same distribution

 governs both the training samples, and the measurements on which the learner’s performance is later tested.  It has long been understood, in computational learning theory, that these assumptions are often unrealistic: they fail to account for adversarial environments, or environments that change over time. This is precisely the state of affairs in current experimental implementations of quantum information processing. Not all measurements of quantum states may be available or feasible in a specific implementation,

which measurements are feasible are dictated by Nature, and as we develop more control over the experimental set-up, more sophisticated measurements become available. The task of learning a state prepared in the laboratory thus takes the form of a game, with the theorist on one side, and the experimentalist and Nature on the other: the theorist is repeatedly challenged to predict the behaviour of the state with respect to the next measurement that Nature allows the experimentalist to realize, with the opportunity to refine the hypothesis as more measurement data become available.

It is thus desirable to design learning algorithms that work in the more stringent online learning model.  Here the learner is presented a sequence of input points, say , one at a time.  Crucially, there is no assumption whatsoever about the ’s: the sequence could be chosen adversarially, and even adaptively, which means that the choice of  might depend on the learner’s behavior on .  The learner is trying to learn some unknown function , about which it initially knows only that belongs to some hypothesis class —or perhaps not even that; we also consider the scenario where the learner simply tries to compete with the best predictor in , which might or might not be a good predictor.  The learning proceeds as follows: for each , the learner first guesses a value  for , and is then told the true value , or perhaps only an approximation of this value.  Our goal is to design a learning algorithm with the following guarantee: regardless of the sequence of ’s, the learner’s guess, , will be far from the true value  at most  times (where , of course, is as small as possible).  The ’s on which the learner errs could be spaced arbitrarily; all we require is that they be bounded in number.

This leads to the following question: can the learning theorem established by Aaronson (2007) be generalized to the online learning setting?  In other words: is it true that, given a sequence  of yes/no measurements, where each  is followed shortly afterward by an approximation of , there is a way to anticipate the  values by guesses , in such a way that  at most, say,  times (where  is some constant, and  again is the number of qubits)? The purpose of this paper is to provide an affirmative answer.

Throughout the paper, we specify a (two-outcome) measurement of an qubit mixed state by a “POVM element”: that is, a  Hermitian matrix  with eigenvalues in , which “accepts”  with probability  and “rejects”  with probability . We prove that:

Theorem 1.

Let  be an -qubit mixed state, and let  be a sequence of -outcome measurements that are revealed to the learner one by one, each followed by a value such that .  Then there is an explicit strategy for outputting hypothesis states  such that for at most  values of .

We also prove a theorem for the so-called regret minimization model (i.e., the “non-realizable case”), where we make no assumption about the input data arising from an actual quantum state, and our goal is simply to do not much worse than the best hypothesis state that could be found with perfect foresight. In this model, the measurements  are presented to a learner one-by-one. In iteration , after seeing , the learner is challenged to output a hypothesis state , and then suffers a “loss” equal to where

is a real function that is revealed to the learner. Important examples of loss functions are 

loss, when , and  loss, when , where . The number  may be an approximation of  for some fixed but unknown quantum state , but is allowed to be arbitrary in general. In particular, the pairs  may not be consistent with any quantum state. Define the regret , after iterations, to be the amount by which the actual loss of the learner exceeds the loss of the best single hypothesis:

The learner’s objective is to minimize regret. We show that:

Theorem 2.

Let  be a sequence of two-outcome measurements on an -qubit state presented to the learner, and  be the corresponding loss functions revealed in successive iterations in the regret minimization model. Suppose  is convex and -Lipschitz; in particular, for every , there is a sub-derivative such that . Then there is an explicit learning strategy that guarantees regret  for all .  This is so even assuming the measurement  and loss function  are chosen adaptively, in response to the learner’s previous behavior.

Specifically, the algorithm applies to loss and loss, and achieves regret  for both.

The online strategies we present enjoy several advantages over full state tomography, and even over “state certification”, in which we wish to test whether a quantum register is close to a desired state or far from it. Optimal algorithms for state tomography (O’Donnell and Wright (2016); Haah et al. (2017)) or certification (Bădescu et al. (2017)) require joint measurements of an exponential number of copies of the quantum state, and assume the ability to perform noiseless, universal quantum computation. On the other hand, the algorithms implicit in Theorems 1 and 2 involve only single-copy measurements, allow for noisy measurements, and capture ground reality more closely. They produce a hypothesis state that mimics the unknown state with respect to measurements that can be performed in a given experimental set-up, and the accuracy of prediction improves as the set of available measurements grows. For example, in the realizable case, i.e., when the data arise from an actual quantum state, the average loss tends to zero, as the number of measurements becomes large. Finally, the algorithms have run time exponential in the number of qubits, but are entirely classical. (Exponential run time is unavoidable, as the length of the output is exponential in the number of qubits.)

It is natural to wonder whether Theorems 1 and 2 leave any room for improvement.  Theorem 1 is asymptotically optimal in its mistake bound of ; this follows from the property that -qubit quantum states, considered as a hypothesis class, have -fat-shattering dimension (see for example Aaronson (2007)). On the other hand, there is room to improve Theorem 2. The bounds of which we are aware are for the loss (see, e.g., (Arora et al., 2012, Theorem 4.1)) in the non-realizable case and for the loss in the realizable case, when the feedback consists of the measurement outcomes.  (The latter bound, as well as an  bound for  loss in the same setting, come from considering quantum mixed states that consist of independent classical coins, each of which could land heads with probability either or . The paramater  is set to .)

We mention an application of Theorem 1, to appear in simultaneous work.  Aaronson (2018) has given an algorithm for the so-called shadow tomography problem.  Here we have an unknown -dimensional pure state , as well as known two-outcome measurements .  Our goal is to approximate , for every , to within additive error .  We would like to do this by measuring , where is as small as possible.  Surprisingly, Aaronson (2018) showed that this can be achieved with , that is, a number of copies of that is only polylogarithmic in both and .  One component of his algorithm is essentially tantamount to online learning with mistakes—i.e., what we present in Section 4 of this paper.  However, by using Theorem 1 from this paper in a black-box manner, we can improve the sample complexity of shadow tomography to .  Details will appear in (Aaronson (2018)).

To maximize insight, in this paper we give three very different approaches to proving Theorems 1 and 2 (although we do not prove every statement with all three approaches). Our first approach is to adapt techniques from online convex optimization to the setting of density matrices, which in general may be over a complex Hilbert space.  This requires extending standard techniques to cope with convexity and Taylor approximations, which are widely used for functions over the real domain, but not over the complex domain.  We also give an efficient iterative algorithm to produce predictions.  This approach connects our problem to the modern mainstream of online learning algorithms, and achieves the best parameters.

Our second approach is via a postselection-based learning procedure, which starts with the maximally mixed state as a hypothesis and then repeatedly refines it by simulating postselected measurements.  This approach builds on earlier work due to Aaronson (2005), specifically the proof of .  The advantage is that it is almost entirely self-contained, requiring no “power tools” from convex optimization or learning theory.  On the other hand, the approach does not give optimal parameters, and we do not know how to prove Theorem 2 with it.

Our third approach is via an upper-bound on the so-called sequential fat-shattering dimension of quantum states, considered as a hypothesis class.  In the original quantum PAC-learning theorem by Aaronson, the key step was to upper-bound the so-called -fat-shattering dimension of quantum states considered as a hypothesis class.  Fat-shattering dimension is a real-valued generalization of VC dimension.  One can then appeal to known results to get a sample-efficient learning algorithm.  For online learning, however, bounding the fat-shattering dimension no longer suffices; one instead needs to consider a possibly-larger quantity called sequential fat-shattering dimension.  However, by appealing to a lower bound due to Nayak (1999); Ambainis et al. (2002) for a variant of quantum random access codes, we are able to upper-bound the sequential fat-shattering dimension of quantum states.  Using known results—in particular, those due to Rakhlin et al. (2015)—this implies the regret bound in Theorem 2, up to a multiplicative factor of .  The statement that the hypothesis class of -qubit states has -sequential fat-shattering dimension might be of independent interest: among other things, it implies that any online learning algorithm that works given bounded sequential fat-shattering dimension, will work for online learning of quantum states. We also give an alternative proof for the lower bound due to Nayak for quantum random access codes, and extend it to codes that are decoded by what we call

measurement decision trees

. We expect these also to be of independent interest.

1.1 Structure of the paper

We start by describing background and the technical learning setting as well as notations used throughout. In Section 3 we give the algorithms and main theorems derived using convexity arguments and online convex optimization.  In Section 4 we describe the postselection algorithm and state the main theorem using this argument.  In Section 5 we give a sequential fat-shattering dimension bound for quantum states and its implication for online learning of quantum states.

2 Preliminaries and definitions

We define the trace norm of a matrix  as , where  is the adjoint of . We denote the th eigenvalue of a Hermitian matrix  by , its minimum eigenvalue by , and its maximum eigenvalue by . By ‘’ we denote the natural logarithm, unless the base is explicitly mentioned.

An -qubit quantum state  is an element of , where is the set of all trace-1 positive semi-definite (PSD) complex matrices of dimension :

Note that is a convex set. A two-outcome measurement of an -qubit state is defined by a Hermitian matrix with eigenvalues in . The measurement “accepts” with probability , and “rejects” with probability . For the algorithms we present in this article, we assume that a two-outcome measurement is specified via a classical description of its defining matrix . In the rest of the article, unless mentioned otherwise, a “measurement” refers to a “two-outcome measurement”.

Online learning and regret.

In online learning of quantum states, we have a sequence of iterations  of the following form. First, the learner constructs a state ; we say that the learner “predicts” . It then suffers a “loss”  that depends on a measurement , both of which are presented by an adversary.  Commonly used loss functions are loss (also called “mean square error”), given by

and loss (also called “absolute loss”), given by

where . The parameter  may be an approximation of  for some fixed quantum state not known to the learner, but is allowed to be arbitrary in general.

The learner then “observes” feedback from the measurement 

; the feedback is also provided by the adversary. The simplest feedback is the realization of a binary random variable

such that

Another common feedback is the number , especially in case that the learner suffers or loss.

We would like to design a strategy for updating based on the loss, measurements, and feedback in all the iterations so far, so that the learner’s total loss is minimized in the following sense. We would like that over  iterations (for a number  known in advance), the learner’s total loss is not much more than that of the hypothetical strategy of outputting the same quantum state at every time step, where  minimizes the total loss with perfect hindsight. Formally this is captured by the notion of regret , defined as

The sequence of measurements  can be arbitrary, even adversarial, based on the learner’s previous actions. Note that if the loss function is given by a fixed state  (as in the case of mean square error), the minimum total loss would be . This is called the “realizable” case. However, in general, the loss function presented by the adversary need not be consistent with any quantum state. This is called the “non-realizable” case.

A special case of the online learning setting is called agnostic learning; here the measurements  are drawn from a fixed and unknown distribution .  The setting is called “agnostic” because we still do not assume that the losses correspond to any actual state (i.e., the setting may be non-realizable).

Online mistake bounds.

In some online learning scenarios the quantity of interest is not the mean square error, or some other convex loss, but rather simply the total number of “mistakes” made.  For example, we may be interested in the number of iterations in which the predicted probability of acceptance  is more than -far from the actual value , where  is again a fixed state not known to the learner.  More formally, let

be the absolute loss function. Then the goal is to bound the number of iterations in which , regardless of the sequence of measurements  presented by the adversary.  We assume that in this setting,the adversary provides as feedback an approximation  that satisfies .

3 Online learning of quantum states

In this section, we use techniques from online convex optimization to minimize regret. The same algorithms may be adapted to also minimize the number of mistakes made.

3.1 Regularized Follow-the-Leader

We first follow the template of the Regularized Follow-the-Leader algorithm (RFTL; see, for example, (Hazan, 2015, Chapter 5)). The algorithm below makes use of von Neumann entropy, which relates to the Matrix Exponentiated Gradient algorithm (Tsuda et al. (2005)).

1:  Input: , ,
2:  Set .
3:  for  do
4:     Predict . Consider the convex and -Lipschitz loss function given by measurement Let be a sub-derivative of with respect to . Define
5:     Update decision according to the RFTL rule with von Neumann entropy:
6:  end for
Algorithm 1 RFTL for Quantum Tomography

Remark 1:

The mathematical program in Eq. (1) is convex, and thus can be solved in polynomial time in the dimension, which is .

Theorem 3.

Setting  , the regret of Algorithm 1 is bounded by  .

Remark 2:

In the case where the feedback is an independent random variable , where  with probability and with probability for a fixed but unknown state , we define  in Algorithm 1 as . Then  is the gradient of the  loss function where we receive precise feedback instead of . It follows from the proof of Theorem 3 that the expected regret of Algorithm 1, , is bounded by .

The proof of Theorem 3 appears in Appendix B. The proof is along the lines of (Hazan, 2015, Theorem 5.2), except that the loss function does not take a raw state as input, and our domain for optimization is complex. Therefore, the mean value theorem does not hold, which means we need to approximate the Bregman divergence instead of replacing it by a norm as in the original proof. Another subtlety is that convexity needs to be carefully defined with respect to the complex domain.

3.2 Matrix Multiplicative Weights

The Matrix Multiplicative Weights (MMW) algorithm (Arora and Kale, 2016) provides an alternative means of proving Theorem 2. The algorithm follows the template of Algorithm 1 with step 5 replaced by the following update rule:


In the notation of Arora and Kale (2016), this algorithm is derived using the loss matrices . Since and , we have , as requred in the analysis of the Matrix Multiplicative Weights algorithm. We have the following regret bound for the algorithm (proved in Appendix C):

Theorem 4.

Setting , the regret of the algorithm based on the update rule (2) is bounded by .

3.3 Proof of Theorem 1

Consider either the RFTL or MMW based online learning algorithm described in the previous subsections, with the -Lipschitz convex absolute loss function . We run the algorithm in a sub-sequence of the iterations, using only the measurements presented in those iterations. The subsequence of iterations is determined as follows. Let  denote the hypothesis maintained by the algorithm in iteration . We run the algorithm in iteration  if . Note that whenever , we have , so we update the hypothesis according to the RFTL/MMW rule in that iteration.

As we explain next, the algorithm makes at most updates regardless of the number of measurements presented (i.e., regardless of the number of iterations), giving the required mistake bound. For the true quantum state , we have for all . Thus if the algorithm makes updates (i.e., we run the algorithm in  of the iterations), the regret bound implies that . Simplifying, we get the bound , as required.

4 Learning Using Postselection

In this section, we give a direct route to proving a slightly weaker version of Theorem 1: one that does not need the tools of convex optimization, but only tools intrinsic to quantum information.

We need a slight variant of a well-known result, which Aaronson called the “Quantum Union Bound” (see, for example, Aaronson (2006, 2016); Wilde (2013)). Given a two-outcome measurement  on -qubits states, we define an operator  that post-selects on acceptance by . Let  be any unitary operation on  qubits that maps states of the form  to . Such a unitary operation always exists (see, e.g., (Watrous, 2018, Theorem 2.42)). Denote the register holding the th qubit by . Let  be the orthogonal projection onto states that equal  in register . Then we define the operator  as


if , and  otherwise. We emphasize that we use a fresh ancilla qubit initialized to  in register  in every application of the operator . We say that the post-selection succeeds with probability . Note that the operator is trace-preserving on states which are accepted by  with non-zero probability.

Theorem 5 (variant of Quantum Union Bound; Gao (2015)).

Suppose we have a sequence of two-outcome measurements , such that each  accepts a certain mixed state  with probability at least .  Consider the corresponding operators  that post-select on acceptance by the respective measurements . Let  denote the state  obtained by applying each of the  post-selection operations in succession. Then the probability that all the post-selection operations succeed, i.e., the  measurements all accept , is at least . Moreover, .

We may infer the above theorem by applying Theorem 1 from (Gao (2015)) to the state  augmented with  ancillary qubits  initialized to , and considering  orthogonal projection operators , where the unitary operator  and the projection operator  are as defined for the postselection operation  for . The th projection operator  acts on the register holding  and the th ancillary qubit .

We now prove the main result of this section (proof in Appendix D):

Theorem 6.

Let  be an unknown -qubit mixed state, let  be a sequence of two-outcome measurements, and let .  There exists a strategy for outputting hypothesis states , where  depends only on and real numbers in , such that as long as for every , we have

for at most  values of .  Here the ’s and ’s can otherwise be chosen adversarially.

5 Learning Using Sequential Fat-Shattering Dimension

In this section, we prove regret bounds using the notion of sequential fat-shattering dimension.  We begin with a bound for a generalization of “random access coding” (Nayak (1999); Ambainis et al. (2002)), also known as the Index function problem in communication complexity.  The generalization was called “serial encoding” by Nayak (1999) and arose in the context of quantum finite automata.  The serial encoding problem is also called Augmented Index in the literature on streaming algorithms.

The following theorem places a bound on how few qubits serial encoding may use. In other words, it bounds the number of bits we may encode in an -qubit quantum state when an arbitrary bit out of the  may be recovered well via a two-outcome measurement. The bound holds even when the measurement for recovering  may depend adaptively on the previous bits  of , which we need not know.

Let  be the binary entropy function.

Theorem 7 (Nayak (1999)).

Let  and  be positive integers. For each -bit string , let  be an -qubit mixed state such that for each , there is a two-outcome measurement  that depends only on  and the prefix , and has the following properties

  1. if  then , and

  2. if  then ,

where  is the error in predicting the bit  at vertex . (We say  “serially encodes” .)  Then .

In Appendix F, we present a strengthening of this bound when the bits of  may be only be recovered in an adaptive order that is a priori unknown. The stronger bound may be of independent interest.

In the context of online learning, the measurements used in recovering bits from a serial encoding are required to predict the bits with probability bounded away from given “pivot points”. Theorem 7 may be specialized to this case as follows (proof in Appendix E).

Corollary 8.

Let  and  be positive integers. For each -bit string , let  be an -qubit mixed state such that for each , there is a two-outcome measurement  that depends only on  and the prefix , and has the following properties

  1. if  then , and

  2. if  then ,

where  and  is a “pivot point” associated with the prefix . Then

In particular, .

Let be a set of functions , and .  Then, following Rakhlin et al. (2015), let the -sequential fat-shattering dimension of , or , be the largest for which we can construct a complete binary tree of depth , such that

  • each internal vertex has associated with it a point and a real , and

  • for each leaf vertex there exists an that causes us to reach if we traverse from the root such that at any internal node  we traverse the left subtree if and the right subtree if . If we view the leaf  as a -bit string, the function  is such that for all ancestors  of , we have  if , and  if , when  is at depth  from the root.

Corollary 8 implies the following theorem:

Theorem 9.

Let be the set of two-outcome measurements on an -qubit state, and let be the set of all functions that have the form for some .  Then for all , we have .

Theorem 9 strengthens an earlier result due to Aaronson (2007), which proved the same upper bound for the “ordinary” (non-sequential) fat-shattering dimension of quantum states considered as a hypothesis class.

Now we may use existing results from the literature, which relate sequential fat-shattering dimension to online learnability.  In particular, in the non-realizable case, Rakhlin et al. (2015) recently showed the following:

Theorem 10 (Rakhlin et al. (2015)).

Let be a set of functions and for every integer , let  be a convex, -Lipschitz loss function.  Suppose we are sequentially presented elements , with each followed by the loss function .  Then there exists a learning strategy that lets us output a sequence of hypotheses , such that the regret is upper-bounded as:

This follows from Theorem 8 in (Rakhlin et al. (2015)) as in the proof of Proposition 9 in the same article.

Combining Theorem 9 with Theorem 10 gives us the following:

Corollary 11.

Suppose we are presented with a sequence of two-outcome measurements of an -qubit state, with each followed by a loss function  as in Theorem 10.  Then there exists a learning strategy that lets us output a sequence of hypothesis states such that the regret after the first iterations is upper-bounded as:

Note that the result due to Rakhlin et al. (2015) is non-explicit.  In other words, by following this approach, we do not derive any specific online learning algorithm for quantum states that has the stated upper bound on regret; we only prove non-constructively that such an algorithm exists.

We expect that the approach in this section, based on sequential fat-shattering dimension, could also be used to prove a mistake bound for the realizable case, but we leave that to future work.

6 Open Problems

We conclude with some questions arising from this work. The regret bound established in Theorem 2 for  loss is tight. Can we similarly achieve optimal regret for other loss functions of interest, for example for -loss? It would also be interesting to obtain regret bounds in terms of the loss of the best quantum state in hindsight, as opposed to (the number of iterations), using the techniques in this article. Such a bound has been shown by (Tsuda et al., 2005, Lemma 3.2) for -loss using the Matrix Exponentiated Gradient method.

In what cases can one do online learning of quantum states, not only with few samples, but also with a polynomial amount of computation? What is the tight generalization of our results to measurements with  outcomes? Is it the case, in online learning of quantum states, that any algorithm works, so long as it produces hypothesis states that are approximately consistent with all the data seen so far?  Note that none of our three proof techniques seem to imply this general conclusion.


Appendix A Auxiliary Lemmas

The following lemma is from (Tsuda et al. [2005]), given here for completeness.

Lemma 12.

For Hermitian matrices and Hermitian PSD matrix , if , then .


Let . By definition, . It suffices to show that . Let be the eigen-decomposition of , and let , where . Then Since and all the eigenvalues of are nonnegative, , . Therefore . ∎

Lemma 13.

If are Hermitian matrices, then .


The proof is similar to Lemma 12. Let be the eigendecomposition of . Then is a real diagonal matrix. We have , where . Note that , so has a real diagonal. Then . Since for all , . ∎

Appendix B Proof of Theorem 3

Proof of Theorem 3.

Since is convex, for all ,

where ‘’ denotes the trace inner-product on  complex matrices. Summing over ,

Define , and , where is the negative von Neumann Entropy of  (in nats). Denote . By [Hazan, 2015, Lemma 5.2], for any , we have


Define , then the convex program in line 5 of Algorithm 1 finds the minimizer of in . The following claim shows that that the minimizer is always positive definite (proof provided later in this section):

Claim 14.

For all , we have .

For , we can write , and define

The definition of is analogous to the gradient of if the function is defined over real symmetric matrices. Moreover, the following condition, similar to the optimality condition over a real domain, is satisfied (proof provided later in this section).

Claim 15.

For all ,



Then by the Pinsker inequality (see, for example, Carlen and Lieb [2014] and the references therein),

We have


where the first inequality follows from Claim 15, and the second because ( minimizes ). Therefore


Let  denote the dual of the trace norm, i.e., the spectral norm of the matrix . By Generalized Cauchy-Schwartz [Bhatia, 1997, Exercise IV.1.14, page 90],

by Eq. (7).


where is an upper bound on . Combining with Eq. (4), we arrive at the following bound

Taking , we get . Going back to the regret bound,

We proceed to show that . Let  denote the set of probability distributions over . By definition,

Since the dual norm of the trace norm is the spectral norm, we have

Therefore . ∎

Proof of Claim 14.

Let be such that . Suppose , where is a diagonal matrix with real values on the diagonal. Assume that  and . Let such that , for , and for , so . We show that there exists such that . Expanding both sides of the inequality, we see that it is equivalent to showing that for some ,

Let , and . The inequality then becomes

Observe that . So by the Generalized Cauchy-Schwartz inequality,

Since are finite and as , there exists small such that . We have

So there exists such that