# Provable quantum state tomography via non-convex methods

With nowadays steadily growing quantum processors, it is required to develop new quantum tomography tools that are tailored for high-dimensional systems. In this work, we describe such a computational tool, based on recent ideas from non-convex optimization. The algorithm excels in the compressed-sensing-like setting, where only a few data points are measured from a low-rank or highly-pure quantum state of a high-dimensional system. We show that the algorithm can practically be used in quantum tomography problems that are beyond the reach of convex solvers, and, moreover, is faster than other state-of-the-art non-convex approaches. Crucially, we prove that, despite being a non-convex program, under mild conditions, the algorithm is guaranteed to converge to the global minimum of the problem; thus, it constitutes a provable quantum state tomography protocol.

## Authors

• 55 publications
• 4 publications
• 1 publication
• 25 publications
• 47 publications
• 50 publications
04/14/2021

### Fast quantum state reconstruction via accelerated non-convex programming

We propose a new quantum state reconstruction method that combines ideas...
06/16/2021

### Momentum-inspired Low-Rank Coordinate Descent for Diagonally Constrained SDPs

We present a novel, practical, and provable approach for solving diagona...
06/29/2021

### On exploring practical potentials of quantum auto-encoder with advantages

Quantum auto-encoder (QAE) is a powerful tool to relieve the curse of di...
06/08/2020

### The Snake Optimizer for Learning Quantum Processor Control Parameters

High performance quantum computing requires a calibration system that le...
06/04/2020

### Semi-device-dependent blind quantum tomography

Extracting tomographic information about quantum states is a crucial tas...
02/23/2017

### Horseshoe Regularization for Feature Subset Selection

Feature subset selection arises in many high-dimensional applications of...
04/22/2016

### Non-convex Global Minimization and False Discovery Rate Control for the TREX

The TREX is a recently introduced method for performing sparse high-dime...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Like any other processor, the behavior of a quantum information processor must be characterized, verified, and certified. Quantum state tomography (QST) is one of the main tools for that purpose (Altepeter et al., 2005). Yet, it is generally an inefficient procedure, since the number of parameters that specify quantum states, grows exponentially with the number of sub-systems. This inefficiency has two practical manifestations: without any prior information, a vast number of data points needs to be collected (Altepeter et al., 2005); once the data is gathered, a numerical procedure should be executed on an exponentially-high dimensional space, in order to infer the quantum state that is most consistent with the observations. Thus, to perform QST on nowadays steadily growing quantum processors (Zhang et al., 2017; IBM, ), we must introduce novel, more efficient, techniques for its completion.

Since often the aim in quantum information processing is to coherently manipulate pure quantum states (i.e., states that can be equivalently described with rank-1, positive semi-definite (PSD) density matrices), the use of such prior information is the modus operandi towards making QST manageable, with respect to the amount of data required (Flammia et al., 2005; Gross et al., 2010; Heinosaari et al., 2013; Baldwin et al., 2016). Compressed sensing (CS) (Donoho, 2006; Baraniuk, 2007) –and its extension to guaranteed low-rank approximation (Recht et al., 2010; Candès and Plan, 2011; Candès and Recht, 2009)– has been applied to QST (Gross et al., 2010; Kalev et al., 2015) within this context. In particular, it has been proven (Gross et al., 2010; Flammia and Liu, 2011; Liu, 2011; Kalev et al., 2015)

that convex programming guarantees robust estimation of pure

-qubit states from much less information than common wisdom dictates, with overwhelming probability.

These advances, however, leave open the question of how efficiently one can estimate exponentially large-sized quantum states, from a limited set of observations. Since convex programming is susceptible of provable performance, typical QST protocols rely on convex programs Gross et al. (2010); Kalev et al. (2015); Baldwin et al. (2016). Nevertheless, their Achilles’ heel remains the high computational and storage complexity. In particular, due to the PSD nature of density matrices, a key step is the repetitive application of Hermitian eigenproblem solvers. Such solvers include the well-established family of Lanczos methods (Kokiopoulou et al., 2004; Baglama and Reichel, 2005, 2006; Cullum et al., 1983), the Jacobi-Davinson SVD type of methods (Hochstenbach, 2001), as well as preconditioned hybrid schemes (Wu and Stathopoulos, 2015), among others; see also the recent article in (Stathopoulos et al., 2017)

for a more complete overview. Since –at least once per iteration– a full eigenvalue decomposition is required in such convex programs, these eigensolvers contribute a

computational complexity, where is the number of qubits of the quantum processor. It is obvious that the recurrent application of such eigensolvers makes convex programs impractical, even for quantum systems with a relatively small number of qubits Haffner et al. (2006); Gross et al. (2010).

Ergo, to improve the efficiency of QST, we need to complement it with numerical algorithms that can efficiently handle large search spaces using limited amount of data, while having rigorous performance guarantees. This is the purpose of this work. Inspired by the recent advances on finding the global minimum in non-convex problems (Sun and Luo, 2015; Zhao et al., 2015; Chen and Wainwright, 2015; Jain et al., 2015; Tu et al., 2016; Bhojanapalli et al., 2016; Park et al., 2017; Ge et al., 2016; Park et al., 2016a, b; Li et al., 2016a, b; Tran-Dinh and Zhang, 2016; Wang et al., 2017; Ge et al., 2017), we propose the application of alternating gradient descent in QST, that operates directly on the assumed low-rank structure of the density matrix. The algorithm –named Projected Factored Gradient Decent (ProjFGD) and described below in detail– is based on the recently analyzed non-convex method in (Bhojanapalli et al., 2016) for PSD matrix factorization problems. The added twist is the inclusion of further constraints in the optimization program, that makes it applicable for tasks such as QST.

In general, finding the global minimum in non-convex problems is a hard problem. However, our approach assumes certain regularity conditions –that are, however, satisfied by common CS-inspired protocols in practice Gross et al. (2010); Kalev et al. (2015); Baldwin et al. (2016)– and a good initialization –which we make explicit in the text; both lead to a fast and provable estimation of the state of the system, even with limited amount of data. Our numerical experiments show that our scheme outperforms in practice state-of-the-art approaches for QST.

Apart from the QST application, our aim is to broaden the results on efficient, non-convex recovery within the set of constrained low-rank matrix problems. Our developments maintain a connection with analogous results in convex optimization, where standard assumptions are made. However, this work goes beyond convexity, in an attempt to justify recent findings that non-convex methods show significant acceleration, as compared to state-of-the-art convex analogs.

## Ii Quantum state tomography setup

We begin by describing the problem of QST. We are focusing here on QST of a highly-pure -qubit state from Pauli measurements. In particular, let

be the measurement vector with elements

, for some measurement error . Here, denotes the unknown -qubit density matrix, associated with the pure quantum state; is a randomly chosen Pauli observable; and the normalization is chosen to follow the results of Liu (2011). For brevity, we denote as the linear “sensing” map, such that , for .

For the above setting, we assume that the data is given in the form of expectation values of -qubit Pauli observables. An -qubit Pauli observable is given by where . There are such observables in total. In general, one needs to have the expectation values of all Pauli observables to uniquely reconstruct ; i.e., performing for many repetitions/shots the same experiment and then taking the expectation of the results (e.g., counts of qubit registers). Since is a highly-pure density matrix, we apply the CS result on Pauli measurements (Gross et al., 2010; Liu, 2011), that guarantees robust estimation, with high probability, from just randomly chosen Pauli observables (in expectation). Key property to achieve this is the restricted isometry property (Liu, 2011):

###### Definition 1 (Restricted Isometry Property (RIP) for Pauli measurements).

Let be a linear map, such that , for . Then, with high probability over the choice of Pauli observables , where is an absolute constant, satisfies the -RIP with constant , ; i.e.,

 (1−δr)∥ρ∥2F≤∥M(ρ)∥22≤(1+δr)∥ρ∥2F,

where denote the Frobenius norm, is satisfied such that .

An accurate estimation of is obtained by solving, essentially, a convex optimization problem constrained to the set of quantum states (Kalev et al., 2015), consistent with the measured data. Two such convex program examples are:

 minimizeρ∈C2n×2n Tr(ρ) (1) subject to ρ⪰0, ∥y−M(ρ)∥2≤ϵ,

and,

 minimizeρ∈C2n×2n 12⋅∥y−M(ρ)∥22 (2) subject to ρ⪰0, Tr(ρ)≤1,

where captures the positive semi-definite assumption, is the vector Euclidean -norm, and is a parameter related to the error level in the model. Key in both programs is the combination of the PSD and the trace constraints: combined, they constitute the tightest convex relaxation to the low-rank, PSD structure of the unknown ; see also (Recht et al., 2010).

As was discussed in the introduction, the problem with convex programs, such as (1) and (2), is their inefficiency when applied in high-dimensional systems: most practical solvers for (1)-(2) are iterative and handling PSD constraints adds an immense complexity overhead per iteration, especially when is large; see also Section V.

In this work, we propose to use non-convex programming for QST of low-rank density matrices, which leads to higher efficiency than typical convex programs. We achieve this by restricting the optimization over the intrinsic non-convex structure of rank- PSD matrices. This allow us to “describe” an PSD matrix with only space, as opposed to the ambient space. Even more substantially, our program has theoretical guarantees of global convergence, similar to the guarantees of convex programming, while maintaining faster preformace than the latter. These properties make our scheme ideal to complement the CS methodology for QST in practice.

## Iii Projected Factored Gradient Decent algorithm

Optimization criterion recast: At its basis, the Projected Factored Gradient Decent (ProjFGD) algorithm transforms convex programs, such as in (1)-(2), by imposing the factorization of a PSD matrix such that . This factorization, popularized by Burer and Monteiro Burer and Monteiro (2003, 2005) for solving semi-definite convex programming instances, naturally encodes the PSD constraint, removing the expensive eigen-decomposition projection step. For concreteness, we focus here on the convex program (2), where . In order to encode the trace constraint, ProjFGD enforces additional constraints on . In particular, the requirement that is translated to the convex constraint , where is the Frobenius norm. The above recast the program (2) as a non-convex program:

 minimizeA∈Cd×r 12⋅∥y−M(AA†)∥22 (3) subject to ∥A∥2F≤1.

Observe that, while the constraint set is convex, the objective is no longer convex due to the bilinear transformation of the parameter space

. Such criteria have been studied recently in machine learning and signal processing applications

(Sun and Luo, 2015; Zhao et al., 2015; Chen and Wainwright, 2015; Jain et al., 2015; Tu et al., 2016; Bhojanapalli et al., 2016; Park et al., 2017; Ge et al., 2016; Park et al., 2016a, b; Li et al., 2016a, b; Tran-Dinh and Zhang, 2016; Wang et al., 2017; Ge et al., 2017). Here, the added twist is the inclusion of further matrix norm constraints, that makes it proper for tasks such as QST; as we show in Appendices A and B, such addition complicates the algorithmic analysis.

The prior knowledge that is imposed in the program by setting . In real experiments, the state of the system, , could be full rank, but often is highly-pure with only few dominant eigenvalues. In this case, is well-approximated by a low-rank matrix of rank , which can be much smaller than , similar to the CS methodology. Therefore in the ProjFGD protocol we set . In this form, contains much less variables to maintain and optimize than a PSD matrix, and thus it is easier to update and to store its iterates.

An important issue in optimizing (3) over the factored space is the existence of non-unique possible factorizations for a given . To see this, if

, then for any unitary matrix

such that , we have , where . Since we are interested in obtaining a low-rank solution in the original space, we need a notion of distance to the low-rank solution over the factors. We use the following unitary-invariant distance metric:

###### Definition 2.

Let matrices . Define:

 \textscDist(A,A⋆):=minR:R∈U∥A−A⋆R∥F,

where is the set of unitary matrices.

The ProjFGD algorithm: At heart, ProjFGD is a projected gradient descent algorithm over the variable . The pseudocode is provided in Algorithm 1.

First, some properties of the objective in (3). Denote and . Due to the symmetry of , i.e., , the gradient of with respect to variable is given by

 ∇g(A)=(∇f(ρ)+∇f(ρ)†)⋅A=2∇f(ρ)⋅A,

where , and is the adjoint operator for . For the Pauli measurements case we consider in this paper, the adjoint operator for an input vector is .

Let denote the projection a matrix onto the set , as in (3). For this particular , , where ; in the case where , already. As an initialization, we compute , which denotes the projection onto the set of PSD matrices with trace bound ; we discuss later in the text how to complete this step in practice.

The main iteration of ProjFGD is in line 7 of Algorithm 1, where it applies a simple update rule over the factors:

 At+1=ΠC(At−η∇f(AtA†t)⋅At),

Observe that the input argument in is:

 At−η∇f(AtA†t)⋅At=At−η∇g(At);

i.e., it performs gradient descent over variable, with step size . Any constants are “absorbed” in the step size selection, for clarity.

Two vital components of our algorithm are: the initialization step and, step size selection.

### iii.1 Initialization ρ0

Due to the bilinear structure in (3), at first glance it is not clear whether the factorization introduces spurious local minima, i.e., local minima that do not exist in (1)-(2), but are “created” after the substitution . This necessitates careful initialization, in order to obtain the global minimum.

Before describing our initialization procedure, we find it helpful to first discuss an initialization procedure for an altered version of (3) where trace constraints are excluded. In this case, (3) transforms to:

 minimizeA∈Cd×r 12⋅∥y−M(AA†)∥22. (4)

Under this setting, the following theory stems from (Park et al., 2017):

###### Theorem 3.

Suppose the unknown is a rank- density matrix with a non-unique factorization , for . Under the noiseless model, the observations satisfy . Assuming that the linear map satisfies the restricted isometry property in Definition 1, with constant , any critical point satisfying first- and second-order optimality conditions is a global minimum.

###### Corollary 4.

Suppose the unknown is a full rank density matrix and let denote its best rank- approximation, in the Eckart-Young-Minsky-Steward sense (Eckart and Young, 1936; Stewart, 1993). Let have a non-unique factorization , for . Under the noiseless model, the observations satisfy . Assuming that the linear map satisfies the restricted isometry property in Definition 1, with constant , any critical point satisfying first- and second-order optimality conditions satisfy:

 \textscDist(A,A⋆)≤12503σr(ρ⋆)⋅∥M(ρ⋆−ρ⋆,r)∥2.

In plain words, under the noiseless model and with high probability (that depends on the random structure of the sensing map ), Theorem 3 states that the non-convex change of variables does not introduce any spurious local minima in the low-rank case, and random initialization is sufficient for Algorithm 1 to find the global minimum, assuming a proper step size selection. Further, when is not low-rank, there are low-rank solutions , close to ; how much close is a function of the spectrum of and its best rank- approximation residual (Corollary 4). In these cases, Step 3 in Algorithm boils down to random initialization.

The above cases hold when corresponds to a POVM; then the explicit trace constraint is redundant. In the more general case where the trace constraint is present, a different approach is followed. In that case, the initial point is set as , where denotes the projection onto the set of PSD matrices that satisfy . Here, represents an approximation of , where is such that for all rank- matrices :

 ∥∇f(ρ)−∇f(ζ)∥F≤L⋅∥ρ−ζ∥F. (5)

(This also means that is restricted gradient Lipschitz continuous with parameter . We defer the reader to the Appendix A for more information). In practice, we set .

This is the only place in the algorithm where eigenvalue-type calculation is required. The projection is given in (Gonçalves et al., 2016). Given , it is described with the following criterion:

 minimizeρ0∈Cd×d 12⋅∥ρ0−M†(y)∥2F (6) subject to ρ0⪰0, Tr(ρ0)≤1,

To solve this problem, we first compute its eigen-decomposition , where

is a unitary matrix containing the eigenvectors of the input matrix. Due to the fact that the Frobenius norm is invariant under unitary transformations,

(Gonçalves et al., 2016) proves that , where is a diagonal matrix, computed via:

 minimizeˆΛ 12⋅∥ˆΛ−Λ∥2F (7) subject to ∑iˆΛii≤1, ˆΛ⪰0.

The last part can be easily solved using the projection onto the unit simplex (Michelot, 1986; Duchi et al., 2008; Kyrillidis et al., 2013).

Alternatively, in practice, we could just use a standard projection onto the set of PSD matrices ; our experiments show that it is sufficient and can be implemented by any off-the-shelf eigenvalue solver. In that case, the algorithm generates an initial matrix by truncating the computed eigen-decomposition, followed by a projection onto the convex set, , defined by set of constraints in the program, . In our case . Note again that the projection operation is a simple entry-wise scaling, for , , where .

Apart from the procedure mentioned above, we could also use more specialized spectral methods for initialization Chen and Wainwright (2015); Zheng and Lafferty (2015) or, alternatively, run convex algorithms, such as (2) for only a few iterations. However, this choice often leads to an excessive number of full or truncated eigenvalue decompositions Tu et al. (2016), which constitutes it a non-practical approach.

The discussion regarding the step size and what type of guarantees we obtain is discussed next.

### iii.2 Step size selection and theoretical guarantees

Focusing on (3), we provide theoretical guarantees for ProjFGD. Our theory dictates a specific constant step size selection that guarantees convergence to the global minimum, assuming a satisfactory initial point is provided.

Let us first describe the local convergence rate guarantees of ProjFGD.

###### Theorem 5 (Local convergence rate for QST).

Let be a rank- quantum state density matrix of an -qubit system with a non-unique factorization , for . Let be the measurement vector of random -qubit Pauli observables, and be the corresponding sensing map, such that . Let the step in ProjFGD satisfy:

 η≤1128(ˆLσ1(ρ0)+σ1(∇f(ρ0))), (8)

where

denotes the leading singular value of

. Here, and is the initial point such that:

 \textscDist(A0,A⋆)≤γ′σr(A⋆),

for , where is the RIP constant. Let be the estimate of ProjFGD at the -th iteration.; then, the new estimate satisfies

 \textscDist(At+1,A⋆)2≤α⋅\textscDist(At,A⋆)2, (9)

where . Further, satisfies , .

The above theorem provides a local convergence guarantee: given an initialization point close enough to the optimal solution –in particular, where is satisfied– our algorithm converges locally with linear rate. In particular, in order to obtain , ProjFGD requires number of iterations. We conjecture that this further translates into linear convergence in the infidelity metric, .

The per-iteration complexity of ProjFGD is dominated by the application of the linear map and by matrix-matrix multiplications. We note that, while both eigenvalue decomposition and matrix multiplication are known to have complexity in Big-Oh notation, the latter is at least two-orders of magnitude faster than the former on dense matrices (Park et al., 2016b).

The proof of Theorem 5 is provided in the Appendix A. We believe that our result, as stated in its most generality, complements recent results from the machine learning and optimization communities, where different assumptions were made Chen and Wainwright (2015), or where constraints on cannot be accommodated Bhojanapalli et al. (2016).

So far, we assumed is provided such that . The next theorem shows that our initialization could achieve this guarantee (under assumptions) and turn the above local convergence guarantees to convergence to the global minimum.

###### Lemma 6.

Let be such that . Consider the problem in (3) where satisfies the RIP property for some constant . Further, assume the optimum point satisfies . Then, computed as above satisfies:

 \textscDist(A0,A⋆)≤γ′⋅σr(A⋆),

where and .

This initialization introduces further restrictions on the condition number of , , and the condition number of the objective function, which is proportional to . In particular, the initialization assumptions in Theorem 5 are satisfied by Lemma 6 if and only if satisfies RIP with constant fulfilling the following expression:

 1+δ4r1−δ4r⋅√1−1−δ4r1+δ4r≤√2(√2−1)200⋅1√r⋅τ2(ρ⋆).

While such conditions are hard to check a priori, our experiments showed that our initialization, as well as the random initialization, work well in practice, and this behavior has been observed repeatedly in all the experiments we conducted. Thus, the method returns the exact solution of the convex programming problem, while being orders of magnitude faster.

## Iv Related work

We focus on efficient methods for QST; for a broader set of citations that go beyond QST, we defer the reader to (Park et al., 2016b) and references therein.

The use of non-convex algorithms in QST is not new (Řeháček et al., 2007; Teo et al., 2013a), and dates before the introduction of the CS protocol in QST settings (Gross et al., 2010). Assuming a multinomial distribution, (Řeháček et al., 2007) focus on the normalized negative log-likelihood objective (see Eq. (2) in (Řeháček et al., 2007)) and propose an diluted non-convex iterative algorithm for its solution. The suggested algorithm exhibits good convergence and monotonic increase of the likelihood objective in practice; despite its success, there is no argument that guarantees its performance, neither a provable setup for its execution.

(Banaszek et al., 1999; Paris et al., 2001) use the reparameterization in a Lagrange augmented maximum log-likelihood (ML) objective (see Eq. (3) in (Banaszek et al., 1999) and Eq. (9) in (Paris et al., 2001)), under the multinomial distribution assumption. The authors state that such a problem can be solved by standard numerical procedures for searching the maximum of the ML objective (Banaszek et al., 1999), and use the downhill simplex method for its solution, over the parameters of the matrix (Nelder and Mead, 1965). Albeit (Banaszek et al., 1999; Paris et al., 2001) rely on the uniqueness of the ML solution before the reformulation (due to the convexity of the original problem), there are no theoretical results on the non-convex nature of the transformed objective (e.g., the presence of spurious local minima).

(Smolin et al., 2012) consider the case of maximum likelihood quantum state tomography, under additive Gaussian noise, in the informationally complete case. Assuming the measurement operators are traceless, simple linear inversion techniques are shown to work accurately to infer the constrained ML state in a single projection step, from the unconstrained ML state. As an extension, (Hou et al., 2016) present a GPU implementation of the algorithm that recovers a simulated 14-qubit density matrix within four hours; however, implementations of a linear system inversion could increase dramatically the computational and storage complexity, as the dimension of the problem grows.

Based on the extremal equations for the multinomial ML objective, (Teo et al., 2013a) propose a fixed-point iteration steepest-ascent method on

(with user-defined hyperparameters, such as the step size of the ascent). How many iterations required and how to set up initial conditions are heuristically defined. Typically these methods, as discussed in

Shang et al. (2017), lead to ill-conditioned optimization problems, resulting in slow convergence.

(Shang et al., 2017) propose a hybrid algorithm that starts with a conjugate-gradient (CG) algorithm in the space, in order to get initial rapid descent, and switch over to accelerated first-order methods in the original space, provided one can determine the switchover point cheaply. Under the multinomial ML objective, in the initial CG phase, the Hessian of the objective is computed per iteration (i.e., a matrix), along with its eigenvalue decomposition. Such an operation is costly, even for moderate values of , and heuristics are proposed for its completion. In the later phase, the authors exploit “momentum” techniques from convex optimization, that lead to provable acceleration when the objective is convex; as we state in the Conclusions section, such acceleration techniques have not considered in the factored space , and constitute an interesting research direction. From a theoretical perspective, (Shang et al., 2017) provide no convergence or convergence rate guarantees.

(Teo et al., 2013b) use in practice the general parameterization of density matrices, , that ensures jointly positive definiteness and unity in the trace. There, in order to attain the maximum value of the log-likehoood objective, a steepest ascent method is proposed over variables, where the step size is an arbitrarily selected but sufficient small parameter. There is no discussion regarding convergence and convergence rate guarantees, as well as any specific set up of the algorithm (step size, initialization, etc.).

(Gonçalves et al., 2016) study the QST problem in the original parameter space, and propose a projected gradient descent algorithm. The proposed algorithm applies both in convex and non-convex objectives, and convergence only to stationary points could be expected. (Bolduc et al., 2017) extend the work in (Gonçalves et al., 2016) with two first-order variants, using momentum motions, similar to the techniques proposed by Polyak and Nesterov for faster convergence in convex optimization (Nesterov, 1983). The above algorithms operate in the informationally complete case. Similar ideas in the informationally incomplete case can be found in (Kyrillidis and Cevher, 2014; Becker et al., 2013).

Very recently, (Riofrío et al., 2017) presented an experimental implementation of CS tomography of a qubit system, where only Pauli basis measurements are available. To achieve recovery in practice –within a reasonable time frame over hundreds of problem instances– the authors proposed a computationally efficient estimator, based on the factorization . The resulting method resembles the gradient descent on the factors , as the one presented in this paper. However, the authors focus only on the experimental efficiency of the method and provide no specific results on the optimization efficiency of the algorithm, what are its theoretical guarantees, and how its components (such as initialization and step size) affect its performance (e.g., the step size is set to a sufficiently small constant).

One of the first provable algorithmic solutions for the QST problem was through convex approximations Recht et al. (2010): this includes nuclear norm minimization approaches Gross et al. (2010), as well as proximal variants, as the one that follows:

 minimizeρ⪰0 ∥M(ρ)−y∥2F+λTr(ρ). (10)

See also Gross et al. (2010) for the theoretical analysis. Within this context, we mention the work of Yurtsever et al. (2015): there, the AccUniPDGrad algorithm is proposed –a universal primal-dual convex framework with sharp operators, in lieu of proximal low-rank operators– where QST is considered as an application. AccUniPDGrad combines the flexibility of proximal primal-dual methods with the computational advantages of conditional gradient (Frank-Wolfe-like) methods. We will use this algorithm for comparisons in the experimental section.

Hazan (2008) presents SparseApproxSDP algorithm that solves the QST problem in (2), when the objective is a generic gradient Lipschitz smooth function, by updating a putative low-rank solution with rank-1 refinements, coming from the gradient. This way, SparseApproxSDP avoids computationally expensive operations per iteration, such as full eigen-decompositions. In theory, at the -th iteration, SparseApproxSDP is guaranteed to compute a -approximate solution, with rank at most , i.e., achieves a sublinear convergence rate. However, depending on , SparseApproxSDP might not return a low rank solution.

Finally, Becker et al. (2013) propose Randomized Singular Value Projection (RSVP), a projected gradient descent algorithm for QST, which merges gradient calculations with truncated eigen-decompositions, via randomized approximations for computational efficiency.

Overall, our program is tailored for tomography of highly-pure quantum states, by incorporating this constraint into the structure of . This has two advantages. First, it results in a faster algorithm that enables us to deal many-qubit state reconstruction in a reasonable time; and, second, it allows us prove the accuracy of the ProjFGD estimator under model errors and experimental noise, similar to the CS results.

## V Numerical experiments

We conducted experiments in an Matlab environment, installed in a Linux-based system with 256 GB of RAM, and equipped with two Intel Xeon E5-2699 v3 2.3GHz, 45M Cache, 9.60GT/s. In all the experiments, the error reported in the Frobenius metric, , where is the estimation of the true state . Note that for a pure state , . For some experiments we also report the infidelity metric . We will also use to denote the set of density matrices .

### v.1 Comparison of ProjFGD with second-order methods

As a first set of experiments, we compare the efficiency of ProjFGD with second-order cone convex programs. State of the art solvers within this class of solvers are the SeDuMi (Sturm, 1999) and SDPT3 (Tütüncü et al., 2003) methods; for their use, we rely on the off-the-shelf Matlab wrapper CVX (CVX Research, 2012). In our experiments, we observed that SDPT3 was faster and we select it for our comparison.

The setting is as described in Section II: we consider rank-1 normalized density matrices , from we which we obtain Pauli measurements such that , for some i.i.d. Gaussian measurement error

, with variance

, i.e., . We consider both convex formulations (1)-(2) and compare it to the ProjFGD estimator with ; in figures we use the notation CVX 1 and CVX 2 for simplicity.

We consider two cases: , and . Table 1 shows median values of 10 independent experimental realizations for ; this selection of was made so that all algorithms return a solution close to the optimum . Empirically, we have observed that ProjFGD succeeds even for cases . We consider both noiseless and noisy settings.

In order to accelerate the execution of convex programs, we set the solvers in CVX to low precision. From Table 1, we observe that our method is two orders of magnitude faster than second-order methods for : ProjFGD achieves better performance (in both error metrics, and in both noisy/noiseless cases), faster. For the higher qubit case, we could not complete the experiments for (1)-(2) due to system crash (RAM overflow). Contrariwise, our method was able to complete the task with success within about minutes of CPU time.

Figures 1-2 show graphically how second-order convex vs. our first-order non-convex schemes scale, as a function of time. In Figure 1, we fix the dimension to and study how increasing the number of observations affects the performance of the algorithms. We observe that, while in the ProjFGD, more observations lead to faster convergence (Chandrasekaran and Jordan, 2013), the same does not hold for the second-order cone programs. In Figure 2, we fix the number data points to , and we scale the dimension . It is obvious that the convex solvers do not scale easily beyond , whereas our method handles cases up to , within reasonable time.

### v.2 Comparison of ProjFGD with first-order methods

Here, we compare our method with more efficient first-order methods, both convex (AccUniPDGrad Yurtsever et al. (2015)) and non-convex (SparseApproxSDP Hazan (2008) and RSVP Becker et al. (2013)).

We consider two settings: is a pure state (i.e., ) and, a nearly low-rank state. In the latter case, we construct , where is a rank-deficient PSD satisfying , and is a full-rank PSD noise term with a fast decaying eigen-spectrum, significantly smaller than the leading eigenvalues of . In other words, we can well-approximate with . For all cases, we model the measurement vector as ; here, the noise is such that . The number of data points satisfy , for various values of .

For all algorithms, we assumed is known and use it to reconstruct a rank- approximation of . All methods that require an SVD routine use lansvd from the PROPACK software package. Experiments and algorithms are implemented in a Matlab environment; we used non-specialized and non-mexified code parts for all algorithms. For initialization, we use the same starting point for all algorithms, which is either specific (Section III.1) or random. As a stopping criterion, we use ; we set the tolerance parameter to .

Convergence plots. Figure 3 (two-leftmost plots) illustrates the iteration and timing complexities of each algorithm under comparison, for a pure state recovery setting () of a highly-pure . Here, which corresponds to a dimensional problem; moreover, we assume and thus the number of data points are . For initialization, we use the proposed initialization in Section III.1 for all algorithms: we compute , extract factor as the best- PSD approximation of , and project onto .

It is apparent that ProjFGD converges faster to a vicinity of , compared to the rest of the algorithms; observe also the sublinear rate of SparseApproxSDP in the inner plots, as reported in Hazan (2008).

Table 2 contains recovery error and execution time results for the case (); in this case, we solve a dimensional problem. For this case, RSVP and SparseApproxSDP algorithms were excluded from the comparison, due to excessive execution time. Appendix C provides extensive results, where similar performance is observed for other values of and .

Figure 3 (rightmost plot) considers the more general case where is nearly low-rank: i.e., it can be well-approximated by a density matrix where (low-rank density matrix). In this case, , for . As the rank in the model, , increases, algorithms that utilize an SVD routine spend more CPU time on singular value/vector calculations. Certainly, the same applies for matrix-matrix multiplications; however, in the latter case, the complexity scale is milder than that of the SVD calculations. Further metadata are also provided in Table 3.

For completeness, in Appendix C we provide results that illustrate the effect of random initialization: Similar to above, ProjFGD shows competitive behavior by finding a better solution faster, irrespective of initialization point.

Timing evaluation (total and per iteration). Figure 4 highlights the efficiency of our algorithm in terms of time complexity, for various problem configurations. Our algorithm has fairly low per iteration complexity (where the most expensive operation for this problem is matrix-matrix and matrix-vector multiplications). Since our algorithm shows also fast convergence in terms of the number of iterations, this overall results into faster convergence towards a good approximation of , even as the dimension increases. Figure 4 shows how the total execution time scales with parameters and .

Overall performance. ProjFGD shows a substantial improvement in performance, as compared to the state-of-the-art algorithms; we would like to emphasize also that projected gradient descent schemes, such as in Becker et al. (2013), are also efficient in small- to medium-sized problems, due to their fast convergence rate. Further, convex approaches might show better sampling complexity performance (i.e., as decreases). Nevertheless, one can perform accurate MLE reconstruction for larger systems in the same amount of time using our methods for such small- to medium-sized problems. We defer the reader to Appendix C, due to space restrictions.

## Vi Summary and conclusions

In this work, we propose a non-convex algorithm, dubbed as ProjFGD, for estimating a highly-pure quantum state, in a high-dimensional Hilbert space, from relatively small number of data points. We showed empirically that ProjFGD is orders of magnitude faster than state-of-the-art convex and non-convex programs, such as (Yurtsever et al., 2015),(Hazan, 2008), and (Becker et al., 2013). More importantly, we prove that under proper initialization and step-size, the ProjFGD is guaranteed to converge to the global minimum of the problem, thus ensuring a provable tomography procedure; see Theorem 5 and Lemma 6.

In our setting, we model the state as a low-rank PSD matrix. This, in turn, means that the estimator is biased towards low-rank states. However, such bias is inherent to all CS-like QST protocols by the imposition of the positivity constraint (Kalev et al., 2015).

Our techniques and proofs can be applied –in some cases under proper modifications– to scenaria beyond the ones considered in this work. We restricted our discussions to a measurement model of random Pauli observables, that satisfies RIP. We conjecture that our results apply for other “sensing” settings, that are informationally complete for low-rank states; see e.g., (Baldwin et al., 2016). The results presented here are independent of the noise model and could be applied for non-Gaussian noise models, such as those stemming from finite counting statistics. Lastly, while here we focus on state tomography, it would be interesting to explore similar techniques for the problem of process tomography.

We conclude with a short list of interesting future research directions. Our immediate goal is the application of ProjFGD in real-world scenaria; this could be completed by utilizing the infrastructure at the IBM T.J. Watson Research Center (IBM, ). This could complement the results found in (Riofrío et al., 2017) for a different quantum system.

Beyond this practical implementation, we identify the following interesting open questions. First, the ML estimator is one of the most frequently-used methods for QST experiments. Beyond its use as point estimator, it is also used as a basis for inference around the point estimate, via confidence intervals

(Christandl and Renner, 2012) and credible regions (Shang et al., 2013). However, there is still no rigorous analysis when the factorization is used.

The work of (Shang et al., 2017) considers accelerated gradient descent methods for QST in the original parameter space : Based on the seminal work of Polyak and Nesterov (Nesterov, 1983) on convex optimization first-order methods, one can achieve orders of magnitude acceleration (both in theory and practice), by exploiting the momentum from previous iterates. It remains an open question how our approach could exploit acceleration techniques that lead to faster convergence in practice, along with rigorous approximation and convergence guarantees. Further, distributed/parallel implementations, like the one in (Hou et al., 2016), remain widely open using our approach, in order to accelerate further the execution of the algorithm. Research along these directions is very interesting and is left for future work.

Finally, while we saw numerically that a random initialization under noisy and constrained settings works well, a careful theoretical treatment for this case is an open problem.

###### Acknowledgements.
Anastasios Kyrillidis is supported by the IBM Goldstine Fellowship and Amir Kalev is supported by the Department of Defense.

## Appendix A Theory

Notation. For matrices , represents their inner product. We use and for the Frobenius and spectral norms of a matrix, respectively. We denote as the -th singular value of . denotes the best rank- approximation of .

### a.1 Problem generalization, notation and definitions

To expand the generality of our scheme, we re-state the problem setting for a broader set of objectives. The following arguments hold for both real and complex matrices. We consider criteria of the following form:

 minimizeρ∈Cd×d f(ρ)subject toρ⪰0, ρ∈C′. (11)

To make connection with the QST objective, set , and . Apart from the least-squares objective in QST, our theory extends to applications that can be described by strongly convex functions with gradient Lipschitz continuity. Further, our ideas can be applied in a similar fashion to the case of restricted smoothness and restricted strong convexity Agarwal et al. (2010). We state these standard definitions below for the square case.

###### Definition 7.

Let be convex and differentiable. is restricted -strongly convex if for all rank- matrices ,

 (12)
###### Definition 8.

Let be a convex differentiable function. is restricted gradient Lipschitz continuous with parameter (or -smooth) if for all rank- matrices ,

 ∥∇f(ρ)−∇f(ζ)∥F≤L⋅∥ρ−ζ∥F. (13)

To shed some light on the notions of (restricted) strong convexity and smoothness and how they relate to the QST objective, consider the restricted isometry property, which holds with high probability under Pauli measurements for low rank Candès and Plan (2011); Liu (2011); here, we present a simplified version of the definition in the main text:

###### Definition 9 (Restricted Isometry Property (RIP)).

A linear map satisfies the -RIP with constant , if

 (1−δr)∥ρ∥2F≤∥M(ρ)∥22≤(1+δr)∥ρ∥2F,

is satisfied for all matrices such that .

According to the quadratic loss function in QST:

 f(ρ)=12∥y−M(ρ)∥22,

its Hessian is given by . Then, (restricted) strong convexity suggests that:

 ∥M(ρ)∥22≥C⋅∥ρ∥2F,ρ∈Cd×d,

for a restricted set of directions , where is a small constant. Then, the correspondence of restricted strong convexity and smoothness with the RIP is obvious: both lower and upper bound the quantity , where is drawn from a restricted (low-rank) set. It turns out that linear maps that satisfy the RIP for low rank matrices, also satisfy the restricted strong convexity; see Theorem 2 in Chen and Sanghavi (2010).

By assuming RIP, the condition number of depends on the RIP constants of the linear map ; in particular, one can show that , since the eigenvalues of lie between and , when restricted to low-rank matrices. Observe that for sufficiently small and dimension sufficiently large, , with high probability.

We assume the optimum satisfies . For our analysis, we further assume we know and set .

As suggested in the main text, we solve (11) in the factored space, as follows:

 minimizeA∈Cd×r f(AA†)subject toA∈C. (14)

In the QST setting, .

In our theory we mostly focus on sets that satisfy the following assumptions.

###### Assumption 1.

For , there is and such that . Then, is endowed with a constraint set that for each , there is an subset in where each satisfies and its projection operator, say for , is an entrywise scaling operation on the input .

We also require the following faithfulness assumption Chen and Wainwright (2015):

###### Assumption 2.

Let denote the set of equivalent factorizations that lead to a rank- matrix ; i.e., Then, we assume , i.e., the resulting convex set in (14) (from in (11)) respects the structure of .

Summarizing, by faithfulness of (Assumption 2), we assume that . This means that the feasible set in (14) contains all matrices that lead to in (11). Moreover, we assume both are convex sets and there exists a “mapping” from to , such that the two constraints are “equivalent”: i.e., , we are guaranteed that . We restrict our discussion on norm-based sets for such that Assumption 1 is satisfied. As a representative example, consider the QST case where, for any , .

For our analysis, we will use the following step sizes:

 ˆη =1128(Lσ1(ρt)+σ1(QAtQ†At∇f(ρt))), η⋆ =1128(Lσ1(ρ⋆)+σ1(∇f(ρ⋆))).

Here, is the Lipschitz constant in (5) and represents the projection onto the column space of . In our algorithm, as described in the main text, we use the following step size:

 η≤1128(Lσ1(ρ0)+σ1(∇f(ρ0)))for given initial ρ0.

While different, by Lemma A.5 in (Bhojanapalli et al., 2016), we know that and . Thus, in our proof, we will work with step size , which is equivalent –up to constants– to the original step size in the proposed algorithm.

For ease of exposition, we re-define the sequence of updates: is the current estimate in the factored space, is the putative solution after the gradient step (observe that might not belong in ), and is the projection step onto . Observe that for the constraint cases we consider in this paper, , where ; in the case , the algorithm simplifies to the algorithm in (Bhojanapalli et al., 2016). For simplicity, we drop the subscript and the parenthesis of the parameter; these values are apparent from the context.

An important issue in optimizing over the factored space is the existence of non-unique possible factorizations. We use the following rotation invariant distance metric:

###### Definition 10.

Let matrices . Define:

 \textscDist(A,B):=minR:R∈U∥A−BR∥F,