DeepAI

# Alternating minimization, scaling algorithms, and the null-cone problem from invariant theory

Alternating minimization heuristics seek to solve a (difficult) global optimization task through iteratively solving a sequence of (much easier) local optimization tasks on different parts (or blocks) of the input parameters. While popular and widely applicable, very few examples of this heuristic are rigorously shown to converge to optimality, and even fewer to do so efficiently. In this paper we present a general framework which is amenable to rigorous analysis, and expose its applicability. Its main feature is that the local optimization domains are each a group of invertible matrices, together naturally acting on tensors, and the optimization problem is minimizing the norm of an input tensor under this joint action. The solution of this optimization problem captures a basic problem in Invariant Theory, called the null-cone problem. This algebraic framework turns out to encompass natural computational problems in combinatorial optimization, algebra, analysis, quantum information theory, and geometric complexity theory. It includes and extends to high dimensions the recent advances on (2-dimensional) operator scaling. Our main result is a fully polynomial time approximation scheme for this general problem, which may be viewed as a multi-dimensional scaling algorithm. This directly leads to progress on some of the problems in the areas above, and a unified view of others. We explain how faster convergence of an algorithm for the same problem will allow resolving central open problems. Our main techniques come from Invariant Theory, and include its rich non-commutative duality theory, and new bounds on the bitsizes of coefficients of invariant polynomials. They enrich the algorithmic toolbox of this very computational field of mathematics, and are directly related to some challenges in geometric complexity theory (GCT).

• 11 publications
• 17 publications
• 22 publications
• 12 publications
• 18 publications
08/29/2018

### Recent progress on scaling algorithms and applications

Scaling problems have a rich and diverse history, and thereby have found...
04/12/2018

### Efficient algorithms for tensor scaling, quantum marginals and moment polytopes

We present a polynomial time algorithm to approximately scale tensors of...
09/02/2019

### Singular tuples of matrices is not a null cone (and, the symmetries of algebraic varieties)

The following multi-determinantal algebraic variety plays a central role...
02/15/2021

### Polynomial time algorithms in invariant theory for torus actions

An action of a group on a vector space partitions the latter into a set ...
04/03/2018

### Operator Scaling via Geodesically Convex Optimization, Invariant Theory and Polynomial Identity Testing

We propose a new second-order method for geodesically convex optimizatio...
11/12/2017

### On the ERM Principle with Networked Data

Networked data, in which every training example involves two objects and...
09/26/2021

### The Connes Embedding Problem: A guided tour

The Connes Embedding Problem (CEP) is a problem in the theory of tracial...

## 1 Introduction and summary of results

Alternating minimization refers to a large class of heuristics commonly used in optimization, information theory, statistics and machine learning. It addresses optimization problems of the following general form. Given a function

 f:X1×⋯×Xd→R (1.1)

the goal is to find a global optimum

 infx1∈X1,…,xd∈Xdf(x1,…,xd). (1.2)

While both the function and its domain may be extremely complicated, in particular non-convex, the decomposition of the domain to blocks is such that the local optimization problems are all feasible. More precisely, for every , and for every choice of with , computing

 infxi∈Xif(α1,…,αi−1,xi,αi+1,…,αd)

is easy.

A natural heuristic in such cases is to repeatedly make local improvements to different coordinates. Namely, we start from some arbitrary vector

, and generate a sequence , such that is obtained by solving the above local optimization problem for some , freeing up this variable while fixing all other coordinates according to . The argmin of the optimum replaces the th coordinate in to create .

There is a vast number of situations which are captured by this natural framework. For one example (we’ll see more), consider the famous Lemke-Howson [LH64] algorithm for finding a Nash-equilibrium in a 2-player game. Here

, and starting from an arbitrary pair of strategies, proceed by alternatingly finding a “best response” strategy for one player given the strategy of the other. This local optimization is simply a linear program that can be efficiently computed. As is well known, this algorithm always converges, but in some games it requires exponential time!

So, the basic questions which arise in some settings are under which conditions does this general heuristic converge at all, and even better, when does it converge efficiently. Seemingly the first paper to give provable sufficient conditions (the “5-point property”) for convergence in a general setting was Csiszár and Tusnády [CT84]. An application is computing the distance (in their case, KL-divergence, but other metrics can be considered) between two convex sets in , as well as the two closest points achieving that distance. Again, the local problem (fixing one point and finding the closest to it in the other convex set) is a simple convex program that has a simple efficient algorithm. For two affine subspaces and the -distance, von Neumann’s alternating projection method [Neu50] is an important special case with numerous applications. As mentioned, in the past three decades numerous papers studied various variations and gave conditions for convergence, especially in cases where (often called also “block-coordinate descent”) – we cite only a few recent ones [XY13, RHL13, WYZ15] which address a variety of problems and have many references. Much much fewer cases are known in which convergence is fast, namely requires only a polynomial number of local iterations. Some examples include the recent works on matrix completion [JNS13, Har14]. Our aim is to develop techniques that expand efficient convergence results in the following algebraic setting, which as we shall see is surprisingly general and has many motivations.

We will consider the case of minimizing (1.1) with very specific domain and function, that we in turn explain and then motivate (so please bear with this necessary notation). First, each of the blocks  is a special linear group, (which we will also abbreviate by ), namely the invertible complex matrices of some size and determinant one. Note that this domain is quite complex, as these groups are certainly not convex, and not even compact. To explain the function

to be minimized over this domain, consider the natural linear transformation (basis change) of the vector space

by matrices . Thus, the product group acts on tensors of order 111To better capture past work and some mathematical motivations, we add a -th component (on which there is no action at all) and make the tensor of order . Equivalently, this means that we study tuples of tensors of order ., where (the basis change) is applied to all vectors (slices, “fibers”) of along the th dimension. Now the objective function  depends on an input tensor ; for any vector of matrices , is defined simply as the -norm squared222Namely, the sum of squares of the absolute values of the entries. of , the tensor resulting by the action of on . Summarizing this description, for input we would like to compute (or approximate) the quantity we call capacity of , denoted defined by

 cap(X):=infA∈G∥A⋅X∥22. (1.3)

In particular, we would like to decide the null-cone problem: is the capacity of zero or not?

While this formulation readily lends itself to an alternating minimization procedure (the local problems have an easily computed closed-form solution), the algorithms we will develop will be related to a dual optimization (scaling) problem that also has a similar form. Our main result will be a fully polynomial time approximation scheme (FPTAS) for that dual problem! This will also allow us to solve the null-cone problem. However we defer this discussion and turn to motivations.

Now where do such problems and this notion of capacity naturally arise? Let us list a few sources, revealing how fundamental this framework really is. Some of these connections go back over a century and some were discovered in the past couple of years. Also, some are less direct than others, and some overlap. Some of them, and others, will be elaborated on throughout the paper.

• Combinatorial Optimization. Computing matroid intersection (for linear matroids over ) easily reduces to checking zeroness of capacity for for very “simple” inputs . This was discovered in [Gur04], who showed it remains true for an implicit version of this problem for which there was no known polynomial time algorithm. His alternating minimization algorithm (called today operator scaling) is efficient for such simple inputs. It also generalizes similar algorithms for such problems as perfect matchings in bipartite graphs and maximum flow in networks, where again zeroness of capacity captures the decision versions of the problems. The associated alternating minimization algorithm gives a very different and yet efficient way than is taught in algorithms texts for these classical problems (these go by the name of matrix scaling, originated in [Sin64] and described explicitly in [LSW98]).

• Non-commutative algebra. The most basic computational problem in this field is the word problem

, namely, given an expression over the free skew field (namely, a formula over

in non-commutative variables), is it identically zero333This is an analog of the PIT problem in algebraic complexity, for non-commutative formulas with division.. As shown in [GGOW16] this problem is reducible to testing if the capacity is 0, for . [GGOW16] significantly extends [Gur04], proving that his algorithm is actually an FPTAS for all inputs, resolving the of our general problem above, and providing the first polynomial time algorithm for this basic word problem (the best previous algorithm, deterministic or probabilistic, required exponential time)444This problem was solved for finite characteristics by different methods in [IQS17]..

• Analysis. We move from efficiently testing identities to efficiently testing inequalities. The Brascamp-Lieb inequalities [BL76, Lie90]

are an extremely broad class capturing many famous inequalities in geometry, probability, analysis and information theory (including Hölder, Loomis-Whitney, Shearer, Brunn-Minkowski, …). A central theorem in this field

[Lie90] expresses feasibility, and the tightest possible constant if feasible, for every instantiation of such inequality in terms of capacity (defined somewhat differently, but directly related to ours). In [GGOW17] a direct alternating minimization algorithm555Which can be viewed also as a reduction to the operator scaling algorithm in [GGOW16]. is given which efficiently approximates it on every input! Previous techniques to even bound capacity were completely ineffective, using compactness arguments.

• Quantum Information Theory and Geometric Complexity Theory. We finally move to arbitrary . The quantum marginal problem (generalizing the classical marginal problem

in probability theory) asks if a collection of marginals (namely, density matrices) on

subsystems of a given quantum system are consistent, namely is there a density matrix on the whole system whose partial traces on the given subsystems result in these marginals [Kly06]. When the state of the whole system is pure and the marginals are proportional to identity matrices, each subsystem is maximally entangled with the others. It is an important task to distill entanglement, that is, given a tensor, to transform it into such a locally maximally entangled state by SLOCC666Acronym for Stochastic Local Operations and Classical Communication; these are the natural actions when each of the subsystems is owned by a different party. operations [BPR00, Kly02]. It turns out that this amounts precisely to the capacity optimization problem, and the matrices which minimize it solve this distillation problem. One can thus view our FPTAS as achieving approximate entanglement distillation (see, e.g., [VDB03, WDGC13, Wal14]). We note that for , the non-zeroness of capacity captures a refinement of the asymptotic positivity of special Kronecker coefficients, a central notion in representation theory and geometric complexity theory (GCT) of [MS01, BCMW17, IMW17]. We refer to Section 5 for more detail.

• Invariant Theory. We will be brief, and soon discuss this area at length, as it underlies most of the work in this paper. Invariant theory studies symmetries, namely group actions on sets, and their invariants. The action of our particular group above on the particular linear space of tensors is an example of the main object of study in invariant theory: actions of reductive777A somewhat technical term which includes all classical linear groups. groups on linear spaces . A central notion in the (geometric) study of such actions is the null cone; it consists of all elements in which maps arbitrarily close to . Here the connection to our problem is most direct: the null cone of the action in our framework is precisely the input tensors whose capacity is 0!

We stress that previous algorithms for the null cone (for general reductive actions) required doubly exponential time (e.g., [Stu08, Algorithm 4.6.7]). In contrast, our algorithm (for the general framework above) decides if a given tensor is in the null cone in singly exponential time! We will also discuss special cases where our FPTAS achieves polynomial time. The same null-cone problem is a direct challenge among the algorithmic problems suggested by Mulmuley towards the GCT program in [Mul17].

We now turn to discuss at more length the relevant features of invariant theory, explaining in particular the dual optimization problem that our algorithm actually solves. Then we will state our main results.

### 1.1 Invariant theory

Invariant theory is an area of mathematics in which computational methods and efficient algorithms are extremely developed (and sought). The reader is referred to the excellent texts [DK02, Stu08] for general background, focused on algorithms. Using invariant theory was key for the recent development of efficient algorithms for operator scaling mentioned above [GGOW16, IQS17], while these algorithms in turn solved basic problems in this field itself. This paper proceeds to expand the applications of invariant theory for algorithms (and computational complexity!) and to improve the efficiency of algorithms for basic problems in the field.

Invariant theory is an extremely rich theory, starting with seminal works of Cayley [Cay46]888Which describes his ingenious algorithm, the Omega-process., to compute invariants of , of Hilbert [Hil90, Hil93]999In which he proves his celebrated Nullstellensatz and Finite Basis theorem. and others in the 19th century. Again, we focus on linear actions of reductive groups on vector spaces . It turns out that in this linear setting the relevant invariants are polynomial functions of the variables (entries of vectors in with respect to some fixed basis), and the central problem is to understand the invariant ring, namely all polynomials which are left invariant (compute the same value) under the action of . Two familiar examples, which illustrate how good an understanding we can hope for, are the following:

• When , the group of permutations of the coordinates of an -vector in , the ring of invariants are (naturally) all symmetric polynomials, which is (finitely!) generated by the elementary symmetric polynomials.

• When 101010Here is the group of invertible matrices of determinant 1. acts on matrices by changing basis of the rows and columns respectively, the ring of invariants is generated by one polynomial in the entries of this matrix: the determinant.

One of Hilbert’s celebrated results in [Hil90, Hil93] was that the invariant ring is always finitely generated. Further understanding calls for listing generators, finding their algebraic relations, making them as low degree as possible, as few as possible, and as easily computable as possible. Many new motivations for these arose from the GCT program mentioned above (and below).

The action of naturally carves the space into orbits, namely sets of the form consisting of all images of under the action of . Understanding when are two vectors in the same orbit captures numerous natural problems, e.g., graph isomorphism (when is a graph in the orbit of another, under vertex renaming), module isomorphism (when is a vector of matrices in the orbit of another, under simultaneous conjugation), and others. Over the complex numbers it is more natural to consider orbit closures and answer similar problems regarding membership of vectors in such orbit closures. For an important example, the central approach of Mulmuley and Sohoni’s GCT program [MS01] to Valiant’s conjecture that VPVNP is to translate it to the question of whether the permanent is in the orbit closure of (a polynomially larger) determinant under a linear action on the matrix entries. Understanding orbit closures is the subject of Geometric Invariant Theory, starting with the seminal work of Mumford [Mum65] in the middle of the last century. He proved that answering such questions is equivalent to computing invariants. That is, the orbit closures of two vectors intersect if and only if for all invariant polynomials .

An important special case is to understand the orbit closures that contain the zero vector . Note that by the theorem above, these are all vectors for which holds for all invariant polynomials without constant term. This is an algebraic variety (nothing more than a zero set a system of polynomials), which is called the null cone of the action of on . Understanding the null cone is also central to geometric complexity theory from a topological viewpoint when projectivizing the space of orbits (making a point of each one, “modding out” by the group action).

A key to understanding the null cone is a duality theory for the group action, which may be viewed as a non-commutative analog of the duality theory for linear programming. We will elaborate on this duality theory, essential to our work, in Sections 4 and 3 and only sketch the essentials here. It is clear that a vector is in the null cone if its capacity is equal to zero, and so can be “certified” by exhibiting a sequence of group elements such that approaches zero. How can we “certify” that a vector is not in the null cone? The capacity formulation means that there must be some fixed such that, for every element in the orbit of , . Consider a point attaining this minimum distance to  (assume it exists). There is a way to write down a non-commutative analog of the “Lagrange multiplier” conditions giving equations saying essentially that the derivative of the group action on in any direction is zero. It turns out that the distance to satisfying these equations is another, dual optimization problem. In particular, to “certify” non-membership of in the null cone, it suffices to exhibit a sequence of group elements such that approaches distance 0 from satisfying these equations. Again, we give plenty more detail on that in Sections 4 and 3.

What is remarkable is that in our setting, when is a product of groups as above, the new optimization has the exact same form as the original one we started with – only with a different function to minimize! Moreover, the set of equations decouples to subsets, and optimizing each local one is efficiently computable. In short, this makes our new optimization problem again amenable to alternating minimization. Further, the conditions that we need to satisfy may be viewed as “scaling conditions”, which for can be shown to be equivalent to the matrix scaling conditions in the commutative case, and to the operator scaling conditions in the non-commutative case. Thus, these scaling algorithms in combinatorial optimization to “doubly stochastic” position naturally arise from general considerations of duality in geometric invariant theory. We will continue to borrow this name and call a tensor that satisfies the minimization conditions -stochastic", and we quantify how well a tensor  satisfies -stochasticity by a distance measure denoted . For an input tensor , we then seek to compute the minimal distance to -stochasticity, denoted , which is the infimum of over all in the orbit of (and formally defined in Definition 3.4). Thus, summarizing, with and as before eq. 1.3, our new, dual optimization problem is

 dds(X):=infA∈Gds(A⋅X). (1.4)

In our framework, has a very simple form. Taking the quantum information theory view of the the problem (described above in the motivation), the tensor captures a quantum system of local parts; then is simply the total -distance squared of the subsystems to being maximally mixed (that is, proportional to normalized identity density matrices).

We now proceed to describe our main results: the problems we address, our algorithms for them, and their complexities. More formal statements of all will appear in the technical sections.

### 1.2 Our results

We start by describing our technical results, and then discussing the conceptual contribution of our paper.

#### Technical results.

We fix and as above. While we work over the complex numbers, we will think of an input tensor as an array of integers (one can consider rational numbers, but this is no gain in generality) represented in binary. The input size parameters will be the number of tensor entries , and the maximum binary length of each entry, which will be denoted by . So the total input length may be thought of as .

Recall again that we have considered two dual optimization problems111111While we do not know of any precise relationship between the value of these optimization problems, the vanishing behaviour of these two problems, as explained above, is dual to each other. iff . above for which the input is :

 cap(X)=infA∈G∥A⋅X∥22, (1.3)
 dds(X)=infA∈Gds(A⋅X). (1.4)

These lead to both exact and approximate computational problems for which we give new algorithms.

The exact null-cone problem we will consider is to test if the input tensor is in the null cone of the group action . As we discussed, is in the null cone iff and iff (!).

We will give two different exponential time algorithms for this problem. We note that the previous best algorithms (which work in the greater generality of all reductive group actions) required doubly exponential time! These algorithms may be very useful for the study of invariants for the actions of “small”, specific groups121212Invariant theory started with the study of invariants of binary, quadratic forms under the action of . Constructing complete sets of invariants for slightly larger groups is still open.. We will discuss these algorithms soon.

The approximate problem we will consider will be to approximate 131313One could also consider the same for , but unlike , we have no faster algorithm for approximating than computing it exactly.. Our approximation of runs in polynomial time in the input length and the (additive) approximation – thus a FPTAS is our main result141414Some readers might object to our use of the term FPTAS for our algorithm since we have an additive approximation guarantee whereas the term is commonly used for multiplicative guarantees. However, note that a multiplicative guarantee for either or will need to determine their vanishing behaviour. This captures the null-cone problem, a polynomial time algorithm for which is the main open problem left open by this work. Also note that we don’t have any guarantee on approximating if , which is also left as an interesting open problem. However, approximating when it is is more fundamental because of the connection to the null cone., and we state it first.

###### Theorem 1.1 (Main theorem).

There is a time deterministic algorithm (Algorithm 1) that, given a tensor with integer coordinates of bit size bounded by , either identifies that is in the null cone or outputs a “scaling” such that .

We present the algorithm right below in Section 3.1 and analyze it in Section 3.2. It is a realization of the alternating minimization scheme discussed before, and may be viewed as a scaling algorithm. Indeed, it generalizes the FPTAS given in [GGOW16] for , which is the operator scaling case. While the analysis is similar in spirit, and borrows many ideas from [GGOW16], we note that for higher dimensions , the understanding of invariant polynomials was much poorer. Thus, new ideas are needed, and in particular we give explicit151515The generators are explicitly defined but may not be efficiently computable by algebraic circuits. generators (via classical Invariant Theory techniques) of the invariant ring (for all degrees), all of which have small coefficients, and we bound the running time in a way which is independent of degree bounds.

Moving to the algorithms for the exact null-cone problem, as discussed, we have two exponential time ones, improving the known doubly exponential ones.

The first algorithm is a natural instantiation of the FPTAS above, with the right choice of . Specifically, we prove that when if is not in the null cone, . So picking suffices. This algorithm is described in Theorem 3.8.

Our second algorithm is much more general, and applies to the action of a product of general linear groups on any vector space , not just the the set of tensors. Here, instead of giving explicit generators, we simply prove their existence, via Cayley’s famous Omega-process. Namely, we prove that in this very general case there are always generating invariants which have both exponential degree and whose coefficients have exponential bit length. These bounds turns out to be sufficient to carry out our time analysis. The bound on the size of coefficients is (to the best of our knowledge) a new structural result in Invariant Theory, and what our analysis demonstrates is its usefulness for solving computational problems.

The corollaries of these main results to specific areas, which we described in some of the motivation bullets earlier in the introduction, are presented in the different technical sections.

#### Conceptual contributions.

We believe that the general framework we present, namely an algebraic framework of alternating minimization, establishes an important connection between the fields of optimization and invariant theory. As we saw, this framework is very general and encompasses many natural problems in several fields. It exposes the importance of symmetry in many of these problems, and so invites the use of tools from invariant theory which studies symmetry to the analysis of algorithms. This is akin to the GCT program, which connects algebraic complexity theory with invariant theory and representation theory, exposing the importance of using symmetry to proving lower bounds.

At the same time we expose basic computational problems of invariant theory, which are in great need of better algorithms and algorithmic techniques from computer science and optimization to solve them. The advent of operating scaling already pointed out to this power in their utility of alternate minimization and scaling numerical algorithms to a field in which most work was symbolic, and we expand on this theme in this paper. But there are many other optimization methods which can be useful there, and we note only one other example, that may in particular be useful for the very null-cone problem we study. While the problem of capacity optimization is not convex under the usual Euclidean metric, it is actually geodesically convex [KN79, NM84, Woo11] if one moves to the natural (hyperbolic) metric on the group. This opens up this problem (and related ones) to avenues of attack from techniques of classical optimization (gradient and other descent methods, interior point methods, etc.) when generalized and adapted to such metrics. We believe that the area of geodesic convex optimization, which itself in its nascent stages (from an algorithmic standpoint), is likely to have a significant impact on algorithmic invariant theory. See more on algorithmic techniques for geodesic convexity in the work of [ZS16] and the references therein.

### 1.3 Plan of the paper

In the remainder of the paper, we explain the problems that we study in more detail; we discuss our results and techniques and point out various connections and interesting open problems. In Section 2, we expand on the null-cone problem and motivate a natural alternating minimization algorithm for it. In Section 3, we explain the Kempf-Ness theorem, which can be seen as a form of noncommutative duality, and we present our scaling algorithm for solving the null cone problem and its analysis. We first give an overview of the techniques that go into its analysis. We then construct a generating set of invariants via Schur-Weyl duality and lower bound the capacity of tensors that are not in the null cone. Lastly, we give a precise mathematical analysis of our algorithm. In Section 4, we describe the Hilbert-Mumford criterion for the null cone and quantify the instability of a tensor in the null cone, which gives a way of characterizing when our algorithm can decide the null-cone problem in polynomial time. Section 5 explores connections to the quantum marginal problem and Kronecker coefficients and Section 6 connects the null cone to the notion of slice rank of tensors. In Section 7, we describe an explicit construction of Reynolds operators via Cayley’s Omega process, and we present alternative, weaker bounds on the capacity and a resulting algebraic exponential-time algorithm for the null-cone problem, which generalizes to arbitrary polynomial actions of products of special linear groups. In Section 8, we mention some open problems. To make the paper self-contained, we supply concrete proofs of well-known statements from geometric invariant theory in Appendix A.

## 2 Null-cone problem as an optimization problem

In the introduction, we defined the null cone as the set of tensors  such that the zero vector lies in the orbit closure , i.e., there exists a sequence of group elements sending the vector to zero, . Thus the null-cone problem amounts to the optimization problem (1.3): A tensor is in the null cone if and only if the capacity is equal to zero, namely:

 cap(X)=infA∈G\normA⋅X22=minY∈¯¯¯¯¯¯¯¯¯G⋅X\normY22=0.
###### Remark 2.1.

We are denoting the above optimization problem by , short for capacity, to remain consistent with similar notions defined in previous papers like [GY98, Gur04, Gur06, GGOW16]. In most of these cases, the capacity notion that they consider is also looking at the optimization problem: “minimum -norm in an orbit closure" for specific group actions. As phrased in these papers, they might look different (e.g., they involve determinants in place of

norms) but there is a tight connection between the two via the AM-GM inequality (i.e., the inequality of arithmetic and geometric means) in one direction and a single alternating minimization step in the other direction.

A fundamental theorem of Hilbert [Hil93] and Mumford [Mum65] gives an alternative characterization of the null cone. It states that the null cone is precisely the set of tensors on which all invariant polynomials  (without constant term) vanish. This is the starting point to the field of Geometric Invariant Theory.

It is instructive to see why must hold for any tensor  in the null cone: Since is -invariant, for all in the -orbit of , and, by continuity, also for all in the closure of the orbit. If the tensor  is in the null cone, then is in its closure. But then , since the polynomial  has no constant term. In Proposition 3.11 we derive a more subtle, quantitative version of this observation. It will be a fundamental ingredient to the analysis of our algorithm.

Since the group consists of tuples , an alternating minimization algorithm suggests itself for solving the optimization problem : Fix an index  and optimize only over a single , leaving the unchanged. This gives rise to an optimization problem of the following form:

 infAS∈SL(nS)\norm(AS⊗InT)⋅Y22

Here , and

. This optimization problem has a closed form solution in terms of the following quantum mechanic analog of the marginals of a probability distribution:

###### Definition 2.2 (Quantum marginal).

Given a matrix acting on , where and are Hilbert spaces of dimensions and , respectively (e.g., and ). Then there is a unique matrix acting on , called the quantum marginal (or reduced density matrix) of , such that

 \tr[(MS⊗IT)ρ]=\tr[MSρS] (2.1)

for every acting on .

This point of view gives rise to an important interpretation of our result (Section 5).

It can be easily seen that if is positive semidefinite (PSD) then is PSD, and that it has the same trace as . It is given by the following explicit formula

 ρS=nT∑k=1(IS⊗ek)†ρ(IS⊗ek).

Here the ’s are the elementary column vectors of dimension .

Now observe that161616Note that we can also view tensors in as column vectors in the space .

 ∥(AS⊗IT)⋅Y∥22=\tr[(A†SAS⊗IT)YY†].

Let be the partial trace of obtained by tracing out all the Hilbert spaces except the one, and rename . Then, using eq. 2.1,

 \tr[(A†SAS⊗I)YY†]=\tr[A†iAiρi].

Hence we are left with the optimization problem:

 infAi∈SL(ni)\tr[AiA†iρi]. (2.2)

We can see by the AM-GM inequality that the optimum of the program (2.2) is , which is achieved for (if is not invertible then we define the inverse on the support of ; the infimum will at any rate be zero).

We thus obtain the core of an alternating minimization algorithm for the capacity . At each step , select an index , compute the quantum marginal of the current tensor , and update by performing the following scaling step:

 X(t+1)←det(ρi)1/2niρ−1/2i⋅X(t). (2.3)

Here and in the following we use the abbreviation

 Ai⋅Y:=(In0⊗In1⊗⋯⊗Ini−1⊗Ai⊗Ini+1⊗⋯⊗Ind)⋅Y,

where we act nontrivially on the -th tensor factor only.

We will analyze essentially this algorithm, augmented by an appropriate method for selecting the index at each step and a stopping criterion (see Algorithm 1 in the next section).

## 3 Noncommutative duality and the tensor scaling algorithm

To obtain a scaling algorithm with rigorous guarantees, we will now derive a dual formulation of the optimization problem . This will achieved by a theorem from Geometric Invariant Theory, due to Kempf and Ness, which can be understood as part of a noncommutative duality theory.

We briefly return to the setting of a general action of a group on a vector space to explain this result. Fix a vector and consider the function . The moment map at , , measures how much changes when we perturb around the identity. So is just the derivative of the function at the identity element (in a precise sense which we do not need to define here). Now suppose is not in the null cone. Then there exists a nonzero vector of minimal norm in the orbit closure. Since the norm is minimal, perturbing by the action of group elements close to the identity element does not change the norm to first order. Hence . To summarize, if is not in the null cone, then there exists such that . This condition can be understood as a vast generalization of the notion of “doubly stochastic”, which one might be tempted to call -stochastic”.

A remarkable theorem due to Kempf and Ness [KN79] asserts that this is not only a necessary but in fact also sufficient condition, i.e., is not in the null cone if and only if there exists such that . This is a duality theorem and (along with the Hilbert-Mumford criterion discussed below) should be thought of as a non-commutative version of linear programming duality, which arises when the group is commutative (see Remark 3.5 below, [Sur00], and the proof of Theorem 3.2 in Appendix A). For a detailed discussion of moment maps, we refer the reader to [Woo11, Wal14].

We now return to our setting where acts on tensors in . We first define the notion of a scaling and then instantiate the Kempf-Ness theorem in the situation at hand:

###### Definition 3.1 (Scaling).

A tensor is called a (tensor) scaling of if .

This notion of scaling generalizes the notions of operator scaling [Gur04, GGOW16] and matrix scaling [Sin64]. It is immediate from the definition of the capacity that a tensor is in the null cone if and only if any of its scalings lies in the null cone.

We now state the Kempf-Ness theorem for tensors:

###### Theorem 3.2 (Duality, special case of [Kn79]).

A tensor is not in the null cone if and only if there exists such that the quantum marginals of of the last tensor factors are given by .

To be self-contained, we give a proof of Theorem 3.2 in Appendix A.

We note that the trace of is one. This normalization is very natural from the quantum viewpoint. In quantum theory, PSD matrices of unit trace describe quantum states of multi-partite systems and the quantum marginals describe the state of subsystems. The condition that is proportional to the identity means that the state of the subsystem is maximally mixed or, for as above, that the system is maximally entangled with the rest. We discuss applications of our result to quantum information theory in Section 5.

The condition that each

is proportional to the identity matrix generalizes the notion of “doubly stochastic”, which arises in the study of the left-right action/operator scaling

[Gur04, GGOW16]. We will refer to it as -stochastic”, which we define formally below.

###### Definition 3.3 (d-stochastic).

A tensor is called -stochastic if the quantum marginals of of the last tensor factors are given by for .

More explicitly, we can state the condition that as follows: We want that the slices of the tensor in the direction are pairwise orthogonal and that their norm squares are equal to . That is, for each consider the order tensors defined as

 Y(j)(j0,…,ji−1,ji+1,…,jd):=Y(j0,…,ji−1,j,ji+1,…,jd);

then the following should hold:

 ⟨Y(j),Y(j′)⟩=1niδj,j′.

Here is the Kronecker delta function and the inner product is the Euclidean inner product of tensors.

Theorem 3.2 implies that is not in the null cone if and only if we can find scalings whose last quantum marginals are arbitrarily close to . We can therefore rephrase the null-cone problem as another, dual optimization problem, where we seek to minimize the distance of the marginals to being maximally mixed. This is captured by the following definitions:

###### Definition 3.4 (Distance to d-stochasticity).

Let be a tensor and , with quantum marginals on the last systems. Then we define the distance to -stochasticity as the (squared) distance between the marginals and -stochastic marginals,

 ds(Y):=d∑i=1\normρi−Inini2F,

where is the Frobenius norm. Following eq. 1.4, we further define the minimal distance to -stochasticity as

 dds(X):=infA∈Gds(A⋅X).

Using this language, Theorem 3.2 states that is in the null cone if and only if the minimal distance to -stochasticity is nonzero. We summarize the duality between the two optimization problems:

 X is in the null cone⟺cap(X)=0⟺dds(X)>0 (3.1)
###### Remark 3.5.

According to eq. 3.1, for any tensor  exactly one of the following two statements is true: is in the null cone () or its orbit closure contains a -stochastic tensor (

). Such dichotomies are well-known from the duality theory of linear programming (Farkas’ lemma, Gordan’s lemma, the duality between membership in a convex hull and the existence of separating hyperplanes, etc.). In the case of the (commutative) group of diagonal matrices, one recovers precisely these linear programming dualities from the general noncommutative framework.

### 3.1 The tensor scaling algorithm

If is not in the null cone then there exist scalings  such that is arbitrarily small. The main technical result of this paper is an algorithm that finds such scalings for any fixed . It is a generalization of the results obtained for matrix and operator scaling in [LSW98, GY98, Gur04, GGOW16]. Recall that we use  to denote the total number of coordinates of the tensor, .

###### Theorem 3.6.

Let be a (nonzero) tensor whose entries are integers of bitsize no more than . Let and suppose that . Then Algorithm 1 with iterations either identifies that is in the null cone or outputs a scaling such that .

To motivate Algorithm 1, note that, for every , we would like the quantum marginal of the system to be proportional to the identity matrix. Suppose the quantum marginal on the system is . Then by acting on this system by any proportional to we can precisely arrange for the quantum marginal on system to be proportional to the identity matrix171717It is not hard to see that if is not in the null cone then is invertible (Lemma 3.13).. (But this might disturb the quantum marginals on the other systems!) An appropriate choice of that ensures that it has determinant one is . Thus we find that the alternating minimization heuristics in eq. 2.3 that we previously derived for is equally natural from the perspective of the dual problem of minimizing ! Remarkably, iterating this operation in an appropriate order of subsystems does ensure that all the quantum marginals get arbitrarily close to the normalized identity matrices – provided that is not in the null cone. (If is in the null cone, then it will lead to a sequence of scalings of norm arbitrarily close to zero, certifying that is in the null cone.)

###### Remark 3.7 (Rational Entries).

Notice that we required our input tensor to have integer entries. In general, an input tensor could have rational entries. If this is the case, we can simply multiply by the denominators to obtain an integral tensor , on which we can then apply our algorithm above. Since multiplying a tensor by a nonzero number does not change the property of being in the null cone, our algorithm will still be correct. The following remarks discuss the changes in bit complexity of our algorithm. In this case, the size of is at most , and and therefore the bound for the number of iterations is still , as we wanted.

When analyzing bit complexity, Algorithm 1 is not directly applicable, since it computes inverse square roots of matrices. Even if the entries of the square roots were always rational, the algorithm would also iteratively multiply rationals with very high bit complexity. We can handle these numerical issues by truncating the entries of the scalings of , as well as the quantum marginals , to many bits after the decimal point. Then, in a similar way as done in [Gur04, GGOW16], we can prove that there is essentially no change in the convergence required in the number of iterations . Since each arithmetic operation will be done with numbers of many bits, this truncation ensures that we run in polynomial time. As a consequence, we obtain our main theorem, which we had already announced in the introduction:

See 1.1

We remark that for , our algorithm in essence reduces to the algorithm of [GY98, Gur04, Gur06, GGOW16]; for , it refines the algorithm proposed previously in [VDB03] without a complexity analysis. Indeed, Theorem 1.1 is a generalization of the main theorem of [GGOW16] on operator scaling (or the left-right action). There it was enough to take to be sure that the starting tensor is not in the null cone.

In our much more general setting it is still true that there exists some such that for any in the null cone. Unfortunately, in our general setting, will be exponentially small (see Lemma 4.9 in the next section). Therefore, Theorem 1.1 with this choice of gives an exponential time algorithm for deciding the null-cone problem:

###### Theorem 3.8 (Null-cone problem).

There is a time deterministic algorithm Algorithm 1 with that, given a tensor with integer coordinates of bit size at most , decides whether is in the null cone.

Thus our algorithm does not solve the null-cone problem in polynomial time for general tensor actions, and this remains an excellent open question!

Nevertheless, Theorem 1.1 as such, with its promise that , is of interest beyond merely solving the null-cone problem. It has applications to quantitative notions of instability in geometric invariant theory, to a form of multipartite entanglement distillation in quantum information theory, and to questions about the slice rank of tensors, which underpinned recent breakthroughs in additive combinatorics. We will discuss these in Sections 6, 5 and 4 below.

### 3.2 Analysis sketch

To analyze our algorithm and prove Theorem 3.6, we follow a three-step argument similar to the analysis of the algorithms for matrix scaling and operator scaling in [LSW98, Gur04, GGOW16] once we have identified the appropriate potential function and distance measure. In the following we sketch the main ideas and we refer to Sections 3.4 and 3.3 for all technical details.

As potential function, we will use the norm squared of the tensor, denoted , and the distance measure is defined above in Definition 3.4. Note that these the two functions are exactly dual to each other in our noncommutative optimization framework (see eqs. 1.4, 1.3 and 3.1)!

The following two properties of the capacity will be crucial for the analysis:

1. if is a tensor with integral entries that is not in the null cone (Proposition 3.11).

2. is invariant under the action of the group .

Using the above properties, a three-step analysis follows the following outline:

1. An upper bound on from the input size.

2. As long as , the norm squared decreases by a factor in each iteration: , where we recall that .

3. A lower bound on for all . This follows from properties A and B above.

The above three steps imply that we must achieve for some .

Step is immediate and step follows from the -invariance of the capacity (property B) and a quantative form the AM-GM inequality (Lemma 3.12).

Step , or rather the lower bound on  (property A) is the technically most challenging part of the proof. It is achieved by quantifying the basic argument given on p. 2. The main tool required to carry out this analysis is the existence of a set of invariant polynomials with “small" integer coefficients that spans the space of invariant polynomials. We do so by giving an “explicit" construction of such a set via Schur-Weyl duality (Section 3.3). Remarkably, we do not need to use the known exponential bounds on the degree of generating invariants in [Der01].

We give an alternative lower bound on using Cayley’s Omega-process (Section 7). This proof is more general (it applies to arbitrary actions of product of ’s) but gives a weaker bound (although sufficient to yield a polynomial time analysis for the algorithm). Here we do need to rely on recent exponential bounds on the degree of generating invariants by Derksen.

The size of the integer coefficients of a spanning set of invariant polynomials is an interesting invariant theoretic quantity that appears to not have been studied in the invariant theory literature. It is crucial for our analysis, and we believe it could be of independent interest in computational invariant theory.

### 3.3 Capacity bound from invariant polynomials

If a tensor  is not in the null cone then its -orbit remains bounded away from zero, i.e., . In this section we will derive a quantitative version of this statement (Proposition 3.11). As mentioned in the preceding section, this bound features as an important ingredient to the analysis our scaling algorithm (see Section 3.4 below). To prove the bound, we use the concrete description of invariants in a crucial way.

We start by deriving a concrete description of the space of -invariant polynomials on of fixed degree , denoted ; see, e.g., [Pro07, BI13b] for similar analyses. Any such polynomial can be written as , where is a -invariant linear form on .181818In general, is not unique. It would be unique if we required to be an element in the symmetric subspace. We are thus interested in the vector space

 ((V⊗m)∗)G=((Cn0)⊗m)∗⊗(((Cn1)⊗m)∗)SL(n1)⊗…⊗(((Cnd)⊗m)∗)SL(nd).

Here is the dual vector space of , i.e., the space of linear functions . The action of on induces an action of on as follows: . The notation denotes the subspace of invariant under the specified action of the group on . We first focus on a single tensor factor. Clearly, the by determinant is an -invariant linear form in . More generally, if is a multiple of , then for each permutation ,

 pn,π(v1⊗⋯⊗vm)=det(vπ(1),…,vπ(n))det(vπ(n+1),…,vπ(2n))⋯det(vπ(m−n+1),…,vπ(m)) (3.2)

defines an -invariant linear form in . In fact:

###### Lemma 3.9.

If divides then the linear forms for span the space of -invariant forms in . If does not divide then no nonzero -invariant linear forms exist.

###### Proof.

The space is not only a representation of , which acts by -fold tensor powers, but also of the permutation group , which acts by permuting the tensor factors. Both actions clearly commute, so the subspace of -invariant forms is again a representation of . A fundamental result in representation theory known as Schur-Weyl duality implies that the space of -invariant forms is in fact an irreducible representation of if divides (and that otherwise it is zero). This implies that the space of -invariant forms is spanned by the -orbit of any single nonzero -invariant form. It is clear that the linear forms form a single -orbit; since this establishes the lemma. ∎

Lastly, we denote by the dual basis of the standard product basis of , i.e., . As a direct consequence of Lemma 3.9 and our preceding discussion, we find that the polynomials

 P(X)=(εi1,…,im⊗pn1,π1⊗⋯⊗pnd,πd)(X⊗m), (3.3)

where and , span the space of -invariant polynomials of degree on if (otherwise there exist no -invariant polynomials).

This concrete description of allows us to establish a fundamental bound on invariants:

###### Proposition 3.10.

The vector space of -invariant homogeneous polynomials is nonzero only if . In this case, it is spanned by the polynomials defined in (3.3), where and . These are polynomials with integer coefficients and they satisfy the bound

 \absP(X)≤(n1⋯nd)m\normXm2.
###### Proof.

The determinant is a homogeneous polynomial with integer coefficients and it is clear that this property is inherited by the polynomials , so it remains to verify the bound. Let be an arbitrary tensor. We expand , where the are vectors in . Note that the euclidean norm of is bounded by . We obtain

and so

 P(X) =(εi1,…,im⊗pn1,π1⊗⋯⊗pnd,πd)(X⊗m) =∑J1,…,Jdεi1,…,im(m⨂α=1vJ1(α),…,Jd(α))pn1,π1(m⨂α=1eJ1(α))⋯pnd,πd(m⨂α=1eJd(α)).

The first factor is bounded in absolute value by . In view of eq. 3.2, the remaining factors are determinants of matrices whose columns are standard basis vectors; hence they are equal to zero or . Altogether, we find that

 |P(X)|≤∑J1,…,Jd\normXm2=(n1⋯nd)m\normXm2.\qed

We remark that similar results can be established for covariants. We now use Proposition 3.10 to deduce the following lower bound on the capacity:

###### Proposition 3.11.

Let be a tensor with integral entries that is not in the null cone. Then,

 cap(X)=infg∈G∥g⋅X∥22≥1(n1⋯nd)2≥1n2,

where we recall that .

###### Proof.

Since is not in the null cone, there exists some -invariant homogeneous polynomial  as in Proposition 3.10 such that . Let denote the degree of . Then it holds that

 1≤\absP(X)=\absP(g⋅X)≤(n1⋯nd)m∥g⋅X∥m2.

The first inequality holds because both and have integer coefficients, the subsequent equality because is a -invariant polynomial, and the last inequality is the bound from Proposition 3.10. From this the claim follows immediately. ∎

In Section 7 we explain how rather worse bounds, that are nevertheless sufficient to establish Theorem 1.1, can be established by analyzing the Cayley Omega-process [DK02], a classical tool from invariant theory, and Derksen’s degree bounds [Der01].

### 3.4 Analysis of the algorithm

In this section, we formally analyze Algorithm 1 and prove Theorem 3.6. As mentioned in Section 3.2, key to our analysis is the bound on the norm of vectors in a -orbit established in Proposition 3.11, as well the following result from [LSW98] (see [GGOW16, Lemma 5.1]), which will allow us to make progress in each step of the scaling algorithm.

###### Lemma 3.12.

Let be positive real numbers such that and , where . Then, .

We will also need the following easy lemmas which are well-known in the quantum information literature. We supply proofs for sake of completeness:

###### Lemma 3.13.

Let be a nonzero tensor with a singular reduced density matrix for some . Then is in the null cone for the -action.

###### Proof.

Let denote the orthogonal projection onto the support of and . Note that since is singular. Then the following block diagonal matrix is in ,

 gi:=ε1/rk(Pi)Pi+ε−1/rk(Qi)Qi,

and it is clear that

 gi⋅X=ε1/rk(Pi)X→0

as . Thus is in the null cone. ∎

###### Lemma 3.14 ([Dvc00]).

The rank of the quantum marginals associated with a tensor is invariant under scaling.

###### Proof.

Let and be one of its quantum marginals. Then the rank of is equal to the rank of considered as a 2-tensor (namely, a matrix) in . But the latter is well-known to be invariant under . In particular, it is invariant when we apply for arbitrary invertible matrices . Thus the rank of is a scaling invariant. ∎

We now prove Theorem 3.6, restated for convenience, following the outline given in Section 3.2.

See 3.6

###### Proof.

If we return in step 1, then is correctly identified to be in the null cone by Lemma 3.13. Now assume that the quantum marginals are all nonsingular so that the algorithm proceeds to step 2. Let us denote by the indices selected at the algorithm in steps and by the corresponding marginals.

We first note that the inverse matrix square roots are always well-defined and that each is a scaling of . Indeed, this is clear for , and since , where

 A(t)i(t):=det(ρ(t)i(t))1/2ni(t)(ρ(t)i(t))−1/2∈SL(ni(t)),

we have that is a scaling , which in turn implies that its quantum marginals are nonsingular (Lemma 3.14). In particular, it follows that if Algorithm 1 returns in step 2 then it returns a scaling such that .

To finish the proof of the theorem, let us assume Algorithm 1 does not return in step 2 but instead proceeds to step 3. We follow the sketch in Section 3.2 and track the norm squared of the tensors . Initially,

 \normY(1)22=∥∥X∥∥22≤n22b (3.4)

since is a bound on the bitsize of each entry in the tensor. Next, we note in each step of the algorithm,

 \normY(t+1)22=det(ρ(t)i(t))