Tensor Balancing on Statistical Manifold

02/27/2017 ∙ by Mahito Sugiyama, et al. ∙ The University of Tokyo 0

We solve tensor balancing, rescaling an Nth order nonnegative tensor by multiplying N tensors of order N - 1 so that every fiber sums to one. This generalizes a fundamental process of matrix balancing used to compare matrices in a wide range of applications from biology to economics. We present an efficient balancing algorithm with quadratic convergence using Newton's method and show in numerical experiments that the proposed algorithm is several orders of magnitude faster than existing ones. To theoretically prove the correctness of the algorithm, we model tensors as probability distributions in a statistical manifold and realize tensor balancing as projection onto a submanifold. The key to our algorithm is that the gradient of the manifold, used as a Jacobian matrix in Newton's method, can be analytically obtained using the Moebius inversion formula, the essential of combinatorial mathematics. Our model is not limited to tensor balancing, but has a wide applicability as it includes various statistical and machine learning models such as weighted DAGs and Boltzmann machines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

newton-balancing

Matrix balancing algorithms


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matrix balancing is the problem of rescaling a given square nonnegative matrix to a

doubly stochastic matrix

, where every row and column sums to one, by multiplying two diagonal matrices and . This is a fundamental process for analyzing and comparing matrices in a wide range of applications, including input-output analysis in economics, called the RAS approach [Parikh, 1979, Miller and Blair, 2009, Lahr and de Mesnard, 2004], seat assignments in elections [Balinski, 2008, Akartunalı and Knight, 2016], Hi-C data analysis [Rao et al., 2014, Wu and Michor, 2016], the Sudoku puzzle [Moon et al., 2009], and the optimal transportation problem [Cuturi, 2013, Frogner et al., 2015, Solomon et al., 2015]. An excellent review of this theory and its applications is given by Idel [2016].

The standard matrix balancing algorithm is the Sinkhorn-Knopp algorithm [Sinkhorn, 1964, Sinkhorn and Knopp, 1967, Marshall and Olkin, 1968, Knight, 2008], a special case of Bregman’s balancing method [Lamond and Stewart, 1981] that iterates rescaling of each row and column until convergence. The algorithm is widely used in the above applications due to its simple implementation and theoretically guaranteed convergence. However, the algorithm converges linearly [Soules, 1991], which is prohibitively slow for recently emerging large and sparse matrices. Although Livne and Golub [2004] and Knight and Ruiz [2013] tried to achieve faster convergence by approximating each step of Newton’s method, the exact Newton’s method with quadratic convergence has not been intensively studied yet.

Another open problem is tensor balancing, which is a generalization of balancing from matrices to higher-order multidimentional arrays, or tensors. The task is to rescale an th order nonnegative tensor to a multistochastic tensor, in which every fiber sums to one, by multiplying th order tensors. There are some results about mathematical properties of multistochastic tensors [Cui et al., 2014, Chang et al., 2016, Ahmed et al., 2003]. However, there is no result for tensor balancing algorithms with guaranteed convergence that transforms a given tensor to a multistochastic tensor until now.

Here we show that Newton’s method with quadratic convergence can be applied to tensor balancing while avoiding solving a linear system on the full tensor. Our strategy is to realize matrix and tensor balancing as projection onto a dually flat Riemmanian submanifold (Figure 1), which is a statistical manifold and known to be the essential structure for probability distributions in information geometry [Amari, 2016]. Using a partially ordered outcome space, we generalize the log-linear model [Agresti, 2012]

used to model the higher-order combinations of binary variables 

[Amari, 2001, Ganmor et al., 2011, Nakahara and Amari, 2002, Nakahara et al., 2003], which allows us to model tensors as probability distributions in the statistical manifold. The remarkable property of our model is that the gradient of the manifold can be analytically computed using the Möbius inversion formula [Rota, 1964], the heart of combinatorial mathematics [Ito, 1993], which enables us to directly obtain the Jacobian matrix in Newton’s method. Moreover, we show that entries for the size of a tensor are invariant with respect to one of the two coordinate systems of the statistical manifold. Thus the number of equations in Newton’s method is .

[width=.6]overview.pdf

Figure 1: Overview of our approach.

The remainder of this paper is organized as follows: We begin with a low-level description of our matrix balancing algorithm in Section 2 and demonstrate its efficiency in numerical experiments in Section 3. To guarantee the correctness of the algorithm and extend it to tensor balancing, we provide theoretical analysis in Section 4. In Section 4.1, we introduce a generalized log-linear model associated with a partial order structured outcome space, followed by introducing the dually flat Riemannian structure in Section 4.2. In Section 4.3, we show how to use Newton’s method to compute projection of a probability distribution onto a submanifold. Finally, we formulate the matrix and tensor balancing problem in Section 5 and summarize our contributions in Section 6.

2 The Matrix Balancing Algorithm

Given a nonnegative square matrix , the task of matrix balancing is to find that satisfy

(1)

where and . The balanced matrix is called doubly stochastic, in which each entry and all the rows and columns sum to one. The most popular algorithm is the Sinkhorn-Knopp algorithm, which repeats updating and as and . We denote by hereafter.

In our algorithm, instead of directly updating and , we update two parameters and defined as

(2)

for each , where we normalized entries as so that . We assume for simplicity that each entry is strictly larger than zero. The assumption will be removed in Section 5.

[width=.6]matrix.pdf

Figure 2: Matrix balancing with two parameters and .

The key to our approach is that we update with or by Newton’s method at each iteration while fixing with so that satisfies the following condition (Figure 2):

Note that the rows and columns sum not to but to due to the normalization. The update formula is described as

(3)

where is the Jacobian matrix given as

(4)

which is derived from our theoretical result in Theorem 3. Since is a matrix, the time complexity of each update is , which is needed to compute the inverse of .

After updating to , we can compute and by Equation (2). Since this update does not ensure the condition , we again update as

and recompute and for each .

By iterating the above update process in Equation (3) until convergence, with becomes doubly stochastic.

[width=.6]synth.pdf

Figure 3: Results on Hessenberg matrices. The BNEWT algorithm (green) failed to converge for .

[width=.6]synth_log.pdf

Figure 4: Convergence graph on .

3 Numerical Experiments

We evaluate the efficiency of our algorithm compared to the two prominent balancing methods, the standard Sinkhorn-Knopp algorithm [Sinkhorn, 1964] and the state-of-the-art algorithm BNEWT [Knight and Ruiz, 2013], which uses Newton’s method-like iterations with conjugate gradients. All experiments were conducted on Amazon Linux AMI release 2016.09 with a single core of 2.3 GHz Intel Xeon CPU E5-2686 v4 and 256 GB of memory. All methods were implemented in C++ with the Eigen library and compiled with gcc 4.8.3111An implementation of algorithms for matrices and third order tensors is available at: https://github.com/mahito-sugiyama/newton-balancing. We have carefully implemented BNEWT by directly translating the MATLAB code provided in [Knight and Ruiz, 2013] into C++ with the Eigen library for fair comparison, and used the default parameters. We measured the residual of a matrix by the squared norm , where each entry is obtained as in our algorithm, and ran each of three algorithms until the residual is below the tolerance threshold .
Hessenberg Matrix. The first set of experiments used a Hessenberg matrix, which has been a standard benchmark for matrix balancing [Parlett and Landis, 1982, Knight and Ruiz, 2013]. Each entry of an Hessenberg matrix is given as if and otherwise. We varied the size from to , and measured running time (in seconds) and the number of iterations of each method.

Results are plotted in Figure 3. Our balancing algorithm with the Newton’s method (plotted in blue in the figures) is clearly the fastest: It is three to five orders of magnitude faster than the standard Sinkhorn-Knopp algorithm (plotted in red). Although the BNEWT algorithm (plotted in green) is competitive if is small, it suddenly fails to converge whenever , which is consistent with results in the original paper [Knight and Ruiz, 2013] where there is no result for the setting on the same matrix. Moreover, our method converges around to steps, which is about three and seven orders of magnitude smaller than BNEWT and Sinkhorn-Knopp, respectively, at .

[width=.6]real.pdf

Figure 5: Results on Trefethen matrices. The BNEWT algorithm (green) failed to converge for .

To see the behavior of the rate of convergence in detail, we plot the convergence graph in Figure 4 for , where we observe the slow convergence rate of the Sinkhorn-Knopp algorithm and unstable convergence of the BNEWT algorithm, which contrasts with our quick convergence.
Trefethen Matrix. Next, we collected a set of Trefethen matrices from a collection website222http://www.cise.ufl.edu/research/sparse/matrices/, which are nonnegative diagonal matrices with primes. Results are plotted in Figure 5, where we observe the same trend as before: Our algorithm is the fastest and about four orders of magnitude faster than the Sinkhorn-Knopp algorithm. Note that larger matrices with do not have total support, which is the necessary condition for matrix balancing [Knight and Ruiz, 2013], while the BNEWT algorithm fails to converge if or .

4 Theoretical Analysis

In the following, we provide theoretical support to our algorithm by formulating the problem as a projection within a statistical manifold, in which a matrix corresponds to an element, that is, a probability distribution, in the manifold.

We show that a balanced matrix forms a submanifold and matrix balancing is projection of a given distribution onto the submanifold, where the Jacobian matrix in Equation (4) is derived from the gradient of the manifold.

4.1 Formulation

We introduce our log-linear probabilistic model, where the outcome space is a partially ordered set, or a poset [Gierz et al., 2003]. We prepare basic notations and the key mathematical tool for posets, the Möbius inversion formula, followed by formulating the log-linear model.

4.1.1 Möbius Inversion

A poset , the set of elements and a partial order on , is a fundamental structured space in computer science. A partial order” is a relation between elements in that satisfies the following three properties: For all , (1) (reflexivity), (2) , (antisymmetry), and (3) , (transitivity). In what follows, is always finite and includes the least element (bottom) ; that is, for all . We denote by .

Rota [1964] introduced the Möbius inversion formula on posets by generalizing the inclusion-exclusion principle. Let be the zeta function defined as

The Möbius function satisfies , which is inductively defined for all with as

From the definition, it follows that

(5)

with the Kronecker delta such that if and otherwise. Then for any functions , , and with the domain such that

is uniquely recovered with the Möbius function:

This is called the Möbius inversion formula and is at the heart of enumerative combinatorics [Ito, 1993].

4.1.2 Log-Linear Model on Posets

We consider a probability vector

on that gives a discrete probability distribution with the outcome space . A probability vector is treated as a mapping such that , where every entry is assumed to be strictly larger than zero.

Using the zeta and the Möbius functions, let us introduce two mappings and as

(6)
(7)

From the Möbius inversion formula, we have

(8)
(9)

They are generalization of the log-linear model [Agresti, 2012] that gives the probability of an -dimensional binary vector as

where is a parameter vector, is a normalizer, and represents the expectation of variable combinations such that

They coincide with Equations (8) and (7) when we let with , each as the set of indices of “” of , and the order as the inclusion relationship, that is, if and only if . Nakahara et al. [2006] have pointed out that can be computed from using the inclusion-exclusion principle in the log-linear model. We exploit this combinatorial property of the log-linear model using the Möbius inversion formula on posets and extend the log-linear model from the power set to any kind of posets . Sugiyama et al. [2016] studied a relevant log-linear model, but the relationship with Möbius inversion formula has not been analyzed yet.

4.2 Dually Flat Riemannian Manifold

We theoretically analyze our log-linear model introduced in Equations (6), (7) and show that they form dual coordinate systems on a dually flat manifold, which has been mainly studied in the area of information geometry [Amari, 2001, Nakahara and Amari, 2002, Amari, 2014, 2016]. Moreover, we show that the Riemannian metric and connection of our model can be analytically computed in closed forms.

In the following, we denote by the function or and by the gradient operator with respect to , i.e., for , and denote by the set of probability distributions specified by probability vectors, which forms a statistical manifold. We use uppercase letters for points (distributions) in and their lowercase letters for the corresponding probability vectors treated as mappings. We write and if they are connected with by Equations (6) and (7), respectively, and abbreviate subscripts if there is no ambiguity.

4.2.1 Dually Flat Structure

We show that has the dually flat Riemannian structure induced by two functions and in Equation (6) and (7). We define as

(10)

which corresponds to the normalizer of . It is a convex function since we have

from . We apply the Legendre transformation to given as

(11)

Then coincides with the negative entropy.

Theorem 1 (Legendre dual).
Proof.

From Equation (5), we have

Thus it holds that

(12)

Hence it is maximized with . ∎

Since they are connected with each other by the Legendre transformation, they form a dual coordinate system and of  [Amari, 2016, Section 1.5], which coincides with and as follows.

Theorem 2 (dual coordinate system).
(13)
Proof.

They can be directly derived from our definitions (Equations (6) and (11)) as

Moreover, we can confirm the orthogonality of and as

The last equation holds from Equation (5), hence the Möbius inversion directly leads to the orthogonality.

The Bregman divergence is known to be the canonical divergence [Amari, 2016, Section 6.6] to measure the difference between two distributions and on a dually flat manifold, which is defined as

In our case, since we have and from Theorem 1 and Equation (12), it is given as

which coincides with the Kullback–Leibler divergence (KL divergence) from to : .

4.2.2 Riemannian Structure

Next we analyze the Riemannian structure on and show that the Möbius inversion formula enables us to compute the Riemannian metric of .

Theorem 3 (Riemannian metric).

The manifold is a Riemannian manifold with the Riemannian metric such that for all

Proof.

Since the Riemannian metric is defined as

when we have

When , it follows that

Since coincides with the Fisher information matrix,

Then the Riemannian (Levi–Chivita) connection with respect to , which is defined as

for all , can be analytically obtained.

Theorem 4 (Riemannian connection).

The Riemannian connection on the manifold is given in the following for all ,

Proof.

We have for all ,

where

and

It follows that

On the other hand,

Therefore, from the definition of , it follows that

4.3 The Projection Algorithm

Projection of a distribution onto a submanifold is essential; several machine learning algorithms are known to be formulated as projection of a distribution empirically estimated from data onto a submanifold that is specified by the target model 

[Amari, 2016]. Here we define projection of distributions on posets and show that Newton’s method can be applied to perform projection as the Jacobian matrix can be analytically computed.

4.3.1 Definition

Let be a submanifold of such that

(14)

specified by a function with . Projection of onto , called -projection, which is defined as the distribution such that

is the minimizer of the KL divergence from to :

The dually flat structure with the coordinate systems and guarantees that the projected distribution always exists and is unique [Amari, 2009, Theorem 3]. Moreover, the Pythagorean theorem holds in the dually flat manifold, that is, for any we have

We can switch and in the submanifold by changing to , where the projected distribution of is given as

This projection is called -projection.

Example 1 (Boltzmann machine).

Given a Boltzmann machine represented as an undirected graph with a vertex set and an edge set . The set of probability distributions that can be modeled by a Boltzmann machine coincides with the submanifold

with . Let be an empirical distribution estimated from a given dataset. The learned model is the -projection of the empirical distribution onto , where the resulting distribution is given as

4.3.2 Computation

Here we show how to compute projection of a given probability distribution. We show that Newton’s method can be used to efficiently compute the projected distribution by iteratively updating as until converging to .

Let us start with the -projection with initializing . In each iteration , we update for all while fixing for all , which is possible from the orthogonality of and . Using Newton’s method, should satisfy

for every , where is an entry of the Jacobian matrix and given as

from Theorem 3. Therefore, we have the update formula for all as

In -projection, update for while fixing for all . To ensure , we add to and . We update at each step as

In this case, we also need to update as it is not guaranteed to be fixed. Let us define

Since we have

it follows that

The time complexity of each iteration is , which is required to compute the inverse of the Jacobian matrix.

Global convergence of the projection algorithm is always guaranteed by the convexity of a submanifold defined in Equation (14). Since is always convex with respect to the - and -coordinates, it is straightforward to see that our -projection is an instance of the Bregman algorithm onto a convex region, which is well known to always converge to the global solution [Censor and Lent, 1981].

5 Balancing Matrices and Tensors

Now we are ready to solve the problem of matrix and tensor balancing as projection on a dually flat manifold.

5.1 Matrix Balancing

Recall that the task of matrix balancing is to find that satisfy and with and