Deep Compression of Sum-Product Networks on Tensor Networks

11/09/2018 ∙ by Ching-Yun Ko, et al. ∙ University of Southern California The University of Hong Kong Delft University of Technology 0

Sum-product networks (SPNs) represent an emerging class of neural networks with clear probabilistic semantics and superior inference speed over graphical models. This work reveals a strikingly intimate connection between SPNs and tensor networks, thus leading to a highly efficient representation that we call tensor SPNs (tSPNs). For the first time, through mapping an SPN onto a tSPN and employing novel optimization techniques, we demonstrate remarkable parameter compression with negligible loss in accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Since the inception of sum-product networks (SPNs) [1], a multitude of works have emerged with respect to their structure and weight learning, e.g., [2, 3, 4], as well as their application in image completion, speech modeling, semantic mapping and robotics, e.g., [5], just to name a few. An SPN exhibits a clear semantics of mixtures (sum nodes) and features (product nodes).

In short, given a high-dimensional dataset (

), an SPN learns and encodes a probability distribution over the data and implicit latent (hidden) variables. Compared to other probabilistic graphical models like Bayesian and Markov networks with #P or NP-hard computation, an SPN enjoys a tractable exact inference cost, and its learning is relatively simple and fast.

On the other hand, there has been an exploding number of works on tensors (a multilinear operator rooted in physics) [6] including their connection and utilization in various engineering fields such as signal processing [7]

, and lately also in neural networks and machine learning 

[8, 9, 10]

. The power of tensors lies in their ability to lift the curse of dimensionality, reducing computational and storage complexity from exponential to linear cost.

In this work, we establish a natural tensor representation of an SPN which we call a tensor SPN (tSPN). By utilizing the tensor notion, we essentially expand the set of tractable SPNs to match that of compact and deep SPNs. We focus on structure compression rather than the learning of SPNs. Indeed, our work leverages on the wealth of existing SPN learning algorithms such as [11, 2, 3], and attempts to turn their inherently wide SPN tree outputs (owing to the intrinsic way of learning through partitioning the data matrix) into a “deep” tree via tensor network techniques subject to a non-negativity constraint. This is in-line with the recent trend on pruning neural networks (NNs) [12], but operates quite differently in the regime of tensor networks with the imposition of a non-negative constraint not needed in the NN counterparts.

In particular, sparsifying an SPN and transforming it into a tSPN brings about the following advantages:

  • It automatically enables sharing of weights through the tensor network cores, often resulting in significant parameter compression.

  • Fewer parameters can facilitate faster computation, lower power consumption, and less stringent cooling requirements in hardware implementations.

  • The above is crucial in downstream edge computing, which translates into better user privacy, lower reliance on the cloud and lower wireless bandwidth for over-the-air network update on edge devices.

To our knowledge, this paper is the first to systematically transform an SPN into a tensor format, specifically, a tensor network counterpart named tSPN. The following sections cover SPN and tensor basics, and then introduce the tSPN conversion. Numerical experiments then demonstrate the remarkable compression in the number of network parameters, with negligible loss in the probabilistic modeling accuracy.

Ii Preliminaries

Ii-a SPN basics

Fig. 1: (a) An example SPN with boolean variables and (b) its equivalent leaf-node representation where, e.g., the lower left ball represents the Bernoulli function . The bold edges in (a) denote an induced tree example (see Definition 3).
Fig. 2: The scopes, denoted by , for every node in the example SPN.

We use a slightly modified SPN example from [1, 13], shown in Fig. 1(a), to motivate some important concepts and operations of SPNs. Boolean variables are chosen for the ease of illustration, but their generalization to multi-nominal or continuous variables are straightforward [1]

. To begin with, an SPN is a directed acyclic graph with alternating layers of sum and product (internal) nodes and a root node on top. The edges emanating downwards from sum nodes have non-negative weights, while the edges emanating downwards from product nodes are all of unit weight. The leaves contain the set of random variables

. For boolean variables, the indicator functions and are 1 when and are 1, respectively, and 0 otherwise. We adopt the abbreviations and for and , respectively.

Definition 1

The scope of an SPN is the set of variables appearing in its leaves. The scope of an internal sum or product node is the scope of the corresponding sub-SPN rooted at that node, as illustrated in Fig. 2.

Definition 2

An SPN is called complete when all children of a sum node have identical scope. It is called decomposable when all children of a product node have disjoint scopes. An SPN is called valid when all its sum nodes are complete and all its product nodes are decomposable. The sum nodes have the semantics of a mixture of components, while product nodes represent features. An SPN is called a normalized SPN when the edges emanating from a sum node have a total weight of one, and is an unnormalized SPN otherwise.

Consequently, the SPN in Fig. 1(a) is a valid and normalized SPN. We use to denote the value of an SPN where

is a vector containing all (non-negative) weights in the network, and

is the vector of all random variables. A distribution is called tractable if any marginal probability can be computed in linear time.

Definition 3

An induced tree [3] is a sub-tree of an SPN originating from the root following two rules: i) only one edge out of a sum node is selected at a time; ii) all edges out of a product node are selected. It can be readily checked that the total number of induced trees arising from an SPN can be computed by , i.e., by setting and where is the all-ones vector.

For instance, the bolded edges in Fig. 1(a) denotes an induced tree by selecting the left route out of each sum node. The notion of a network polynomial [14] comes in handy at this point.

Definition 4


be the probability mass function of a set of discrete random variables

. The network polynomial of is the multilinear polynomial , where is the product of evidence indicators that has a value of 1 in the state .

Any joint probability function of -valued discrete random variables is represented by probabilities. The corresponding network polynomial has therefore terms. For example, the joint probability function of the SPN in Fig. 1 has a network polynomial that consists of terms:


Therefore an equivalence is drawn between an SPN and its network polynomial representation, which in turn encodes a probability function.

The beauty of an SPN lies in its exact and tractable inference. Equation (II-A) is an instance of a normalized SPN. For an unnormalized SPN, there are two ways to build its probability function. One is to scale the edge weights out of each sum node such that they add up to one, i.e., turning it back into a normalized SPN. Alternatively, we can compute the partition function in one bottom-up pass by setting :

such that is a probability function.

Example 1

Assuming a normalized SPN, the probability of a fully specified state (also called a complete evidence) , e.g., , , in Fig. 1(a), is easily computed through a bottom-up pass by setting and for and and .

Example 2

Assuming a normalized SPN, the probability of an evidence, e.g., in Fig. 1(a), can be computed by marginalizing over and . This is computed through a bottom-up pass by setting and , and and for .

The above two examples are readily verified by comparing with (II-A). Similar operations allow us to compute the conditional probability and most probable explanation (MPE) by replacing sum nodes with max nodes, called a max-product network, in a similarly efficient manner [15].

To articulate with tensor-based reformulation of the SPN, we will need to slightly modify the induced tree (Definition 3) by terminating at the leaf nodes, i.e., the bottom nodes of the SPN in Fig. 1(b) instead of down to individual or . With reference to Fig. 1(b), the SPN contains two such induced trees corresponding to the left and right branches originating from the root sum node.

Fig. 3: The learnSPN operations: (a) slicing (b) chopping.

In fact, prevailing SPN learning algorithms or alike, e.g, LearnSPN, SPN-B and SPN-BT [11, 2], all produce SPN trees terminating at leaf nodes. Although SPN illustrations often utilize networks with shared weights (e.g., the two top branches in Fig. 1 are shared among many induced trees), conventional learning algorithms are all based on the “slice” and “chop” operations on the dataset matrix [4], or their variants with additional regularization constraints. A toy example illustrates the basic learnSPN flow. Referring to Fig. 3

(a), the slicing operation constructs children of a sum node by clustering similar sample instances. This is often done via k-means or expectation-maximization (EM) for Gaussian mixture models (GMMs). In Fig. 


(b), the chopping operation constructs children of a product node by grouping dependent variables. This is often done by the G-test or mutual information methods wherein a scoring formula is used to determine variables belonging to the same group, if any. These so-called hierarchical divisive clustering steps are surprisingly simple and effective, but they never look back and this leads to inherently wide SPN tree structures. For example, in the standard NLTCS benchmark, learnSPN (with default hyperparameters) generates an SPN with 19 layers and 1420 leaf nodes, even though there are only 16 variables. This demonstrates that existing learning algorithms do not readily produce shared edges (and weights) across different induced trees, and do not generate SPNs that can otherwise be represented compactly.

Ii-B Tensor basics

In this article, tensors are high-dimensional arrays that generalize vectors and matrices to higher orders. A -way or -order tensor is an array where each entry is indexed by indices . We use the convention for . When the tensor is called cubical. MATLAB notation is used to denote entries of tensors. Boldface capital calligraphic letters are used to denote tensors, boldface capital letters denote matrices, boldface letters denote vectors, and Roman letters denote scalars. A set of tensors, like that of a tensor train (TT), is denoted as . The notion of a rank-1 matrix is generalized to tensors in the following manner.

Definition 5

[6, p. 460] For a given set of vectors , the corresponding rank-1 tensor is defined such that

Alternatively, this is written as

where denotes the outer product.

A rank- tensor is then defined as the sum of rank-1 tensors. The generalization of the matrix-vector multiplication to tensors involves a multiplication of a vector with a -way tensor along one of its modes.

Definition 6

[6, p. 458]) The -mode product of a tensor  with a vector is denoted and defined by

where .

The proposed method also requires the knowledge of the matrix Khatri-Rao product as defined below.

Definition 7

If and , then their Khatri-Rao product is the matrix

where denotes the standard Kronecker product.

The storage of a -way tensor with dimensions requires elements. Tensor decompositions are crucial in reducing the exponential storage requirement of a given tensor. In this article we will make use of the tensor train (TT) decomposition [16], which is a particular kind of tensor network.

Definition 8

[16, p. 2296] A tensor train (TT) representation of a tensor is a set of 3-way tensors such that can be computed from

The dimensions are called the TT-ranks, and the three-way tensors are called the TT-cores.

When , the TT is said to have periodic boundary conditions and is called a tensor loop or tensor ring. In this article, we will always set , i.e., a standard TT structure. The key idea now is to represent the network polynomial of any joint probability function by a low-rank TT. In this way, all (if ) probabilities can be computed from numbers, where is the maximal TT-rank. Without loss of generality, we consider only boolean variables from now on.

Definition 9

For a given network polynomial of binary random variables we define the corresponding TT consisting of 3-way tensors such that the evaluation of for a given state can be computed from

Note that the and factors are a column and row vector, respectively. The other factors are matrices, such that the whole product results in the scalar . Small TT-ranks for then imply small matrix factors, and a massive reduction in the required number of parameters can be achieved.

Iii Putting an SPN on a tSPN

Fig. 4: The tSPN equivalence of the SPN in Fig. 1(b), where the two terms correspond to the two induced trees.
Fig. 5: The normalized tSPN corresponding to a normalized SPN, wherein the shaded parts within a core have entries summing up to unity. The vertical, cross, and horizontal lines in the lower tensor diagram denote left normalized, mixed and right normalized cores.

The key motivation of this work comes from the important observation that an SPN induced tree terminated at leaf nodes is in fact a rank-1 tensor. Using Fig. 1(b) again as an example, there are two such induced trees that can be regarded as the addition of two rank-1 tensor terms with mode products onto their th mode, as shown in Fig. 4. Consequently, summing all the rank-1 terms (gray shaded terms in Fig. 4) produces a -way cubical tensor of dimension . This tensor can then hopefully be sufficiently approximated by a low-rank TT as a particular kind of tSPN. We aim at building a tSPN based on the TT structure, depicted in Fig. 5, that satisfies the following constraints:

  • The TT-cores have non-negative entries.

  • There is a mixed core, whose position is arbitrary, with all its entries summing up to 1.

  • Every core to the left of the mixed core is a left normalized core, which means that each of its vertical slices sums up to 1.

  • Every core to the right of the mixed core is a right normalized core, which means that each slice sums up to 1.

  • When we encounter a slice that contains all zeros, then it means the two slices (one vertical and one horizontal) in two adjacent cores corresponding to the same can be removed and the dimension is shrunk by one.

The first constraint ensures that a tSPN has non-negative weights. The remaining constraints ensure that the partition function when for all ’s. A tSPN obeying the above constraints is called a normalized tSPN in analogy to a normalized SPN. The left/right normalized cores and the mixed core are strongly analogous to the mixed-canonical form of a TT which consists of left/right orthogonalized cores and a mixed core. We remark that a tSPN having a TT structure automatically enforces the desired weight parameter sharing as well as a deep network. This is because each scalar (viz. probability) evaluation of a TT-based tSPN, when contracted with onto its th mode, , results in a matrix product using all TT-cores (cf. Definition 9).

Recalling from Fig. 4, a tSPN is fully captured by a -way cubical tensor through summing all rank-1 terms (induced trees) extracted from the learned SPN. The conversion of such a full-tensor tSPN into a TT-based tSPN then boils down to converting into its TT format .

A direct way to obtain the TT form of is by regarding each rank-1 tensor term, corresponding to an induced tree, as a rank-1 TT and sum them all up into a new TT [16]. However, in this case, the TT ranks

equal the number of rank-1 terms and are therefore impractically high. Although TT-rounding by singular value decomposition (SVD) between successive cores may reduce the TT-ranks, it will generally destroy the non-negativity of weights and result in cores with negative values. Similar issues arise when we use non-negative tensor factorization (NTF) algorithms 

[17] on which also produce a large number of rank-1 tensor terms. In fact, constructing the full tensor explicitly is computationally prohibitive when the number of variables go beyond 17 on our computers. This motivates us to develop an SPN-to-tSPN construction algorithm, called spn2tspn, through a recently proposed tensor-network based nonlinear system identification method [18] elaborated below.

Iii-a Algorithm: spn2tspn

Starting from a learned SPN trained by a given dataset, we compute the probabilities of a set of training input samples as well as non-samples through exact inference. This step has linear complexity due to the SPN nature, and generates a set of multi-input single-output (MISO) data used for the identification of the underlying TT. On the one hand, training samples are meaningful data (positive samples) used in the SPN learning and therefore correspond to higher probabilities. On the other hand, we randomly generate some non-samples (negative samples) outside the dataset and feed them into the SPN for their probabilities which are mostly close to zero. We then utilize these MISO data to identify a TT-based tSPN by adapting the approach in [18].

Now having a set of training samples and non-samples together with their probabilities, the goal is to obtain a tensor in a TT format such that the probability distribution inferred from this TT-based tSPN closely tracks that of the SPN. We first collect the column vectors into the matrix for . Next, we formulate the optimization problem


such that is a TT with non-negative cores , and is computed from


and is the collection of probabilities of samples and non-samples.

Following from [18], (2) is broken into least-squares subproblems of small sizes solved by the alternating linear scheme (ALS). To obtain non-negative ’s, we further impose a non-negative constraint into each subproblem which is then solved by the non-negative least-squares (NNLS) method [19] within each ALS iteration. This problem formulation also resembles that of a tensor completion work that employs a TT format [20] but without the non-negativity constraint in place. We therefore refer the reader to the paper for details. In short, to solve for , one solves the following least-squares subproblem by NNLS


where denotes the th column of , whereas and are auxiliary notations defined as:

The proposed algorithm is summarized in Algorithm 1.

Dataset name # variables # layers # leaf nodes # weights # parameters # parameters reduction TV
(SPN) (tSPN) () distance
TABLE I: SPN and tSPN information for various datasets.
Algorithm 1

SPN-to-tSPN Conversion (spn2tspn)
Input: A trained SPN and training samples and non-samples.
Output: A compressed TT-based tSPN.

1:Construct input matrices as described above
2:Obtain the probabilities of samples and non-samples through inference by the trained SPN
3:Collect probabilities to construct the vector
4:Randomly initialize non-negative TT-cores
5:while stopping criteria not satisfied  do
6:     for k=1,…,d-1 do
7:          solve (4) using NNLS
8:          sum over the first and second indices of
9:         Identify zero entries find()
12:     end for
13:     for k=d,…,2 do
14:          solve (4) using NNLS
15:          sum over the second and third indices of
16:         Identify zero entries find()
19:     end for
20:end while

Once a tSPN is built, a purposedly reserved portion of the dataset input samples (not used in the SPN learning) are used as test inputs and their SPN probability outputs are used as test outputs to check the quality of the tSPN. As will be shown later in experiments, almost exact matches are observed between the probabilities derived from the learned SPN and its tSPN couterparts, with the latter having substantially fewer parameters.

We note that SPN operations such as inference and most probable explanation (MPE) etc. can be easily mapped to the tSPN scenario which, being not central to the theme and also due to space limitation, are put into the Supplemental Material of this paper.

Iv Experiments

We use fifteen benchmarking datasets from publicly available website111 and learn their corresponding SPNs by using the methods proposed in [2]. The Python implementation222 of the algorithms are used throughout our experiments. We then apply Algorithm 1 to transform the SPNs into tSPNs, and the results are summarized in Table I. All experiments were run on a desktop computer with an Intel i5 quad-core processor at 3.2GHz and 16GB RAM. Implementations of the proposed algorithm, together with necessary data files for reproducing experimental results will be made publicly available after the peer review. To measure the “closeness” of the probability distributions between an SPN and its tSPN counterpart, we use the total variation (TV) distance [21]. However, we stress that this metric (essentially an L1 norm) is only handy when the size of the problem is relatively small, say, # variables . When there are more variables like , our desktop computer fails to construct a tensor and compute the TV distance. That said, to still evaluate the tSPN conversion, we plot the probability profiles of training and testing data both in an SPN and its corresponding tSPN to visually compare their similarity.

Fig. 6: Probability profiles of the learned SPN for 20Newsgroup dataset.
Fig. 7: Probability profiles of the constructed tSPN, at fewer parameters, for 20Newsgroup dataset.

As shown in Table I, the number of parameters in tSPNs are significantly fewer than those in the original SPNs, while excellent match is still observed in their probability distributions as indicated by the TV distances and as depicted in Figs. 6 and 7. In the first two experiments, the number of variables are both no bigger than , thus we are still able to compute the TV distances from reference SPNs to tSPNs. Among other experiments, we highlight that when converting the SPN of the 20Newsgroup dataset into a tSPN, a storage reduction is witnessed. Although explicitly computing probabilities for the TV distance evaluation is prohibitive, we can compare the probabilities of only parts of the samples and non-samples. In particular, in each of Figs. 6 and 7, we plot the probabilities of four sets of input samples, namely, training/testing samples/non-samples. Comparing the upper and lower figures, we observe essentially indistinguishable difference between the probability outputs of the SPN and tSPN, and similarly for other examples not plotted here. It is interesting to note that the testing samples/non-samples are never used in the spn2tspn conversion, but the resultant tSPN still manages to well capture the probabilities for these unseen data. This demonstrates the success of putting SPNs on tSPNs, which leads to substantial reduction in modeling parameters while having virtually no loss in accuracy.

Apparently, other tensor factorizations such as the CANDECOMP/PARAFAC canonical polyadic decomposition or Tucker decomposition [6] may be used for tSPN, but according to our tests these are not as effective as the TT decomposition in preserving the depth and width semantics of SPNs. Furthermore, the identification algorithms for other tensor formats are not as simple and scalable as spn2tspn, and are therefore omitted.

Backed by the experimental results, we highlight some additional advantages of TT-based tSPNs further to those already stated in the introduction:

  • They often have small sizes via controlling the TT ranks, which translates into smaller network sizes and faster inference.

  • Depth of a tSPN is inherently high (corresponding to the number of TT-cores), while its width (corresponding to TT-ranks) is usually set to be low. In terms of neural network language, this means higher expressive efficiency.

  • The structural simplicity of a tSPN acts as a regularizer. In terms of neural network language, this implies a better inductive bias.

Of course, a natural question is whether data learning can be directly carried out on the tSPN structure, rather than going through the SPN learning followed by tSPN conversion. Research is underway along this direction and results will be reported in our upcoming work.

V Conclusions

This paper is the first to propose a tensor implementation of an SPN called tSPN. An equivalence is drawn between an SPN with boolean variables to a -way cubical tensor of dimension 2. The transformation of the latter into a tensor train (TT) representation then allows inherent sharing of originally distributed weights in an SPN tree, thereby leading to an often dramatic reduction of the number of network parameters as seen in various numerical experiments, at little or negligible loss of modeling accuracy. The TT-based tSPN also automatically guarantees a deep and narrow neural-network architecture. These promising new results have demonstrated tSPN to be a more natural form for realizing an SPN compared to its conventional tree structure.


  • [1] H. Poon and P. Domingos, “Sum-product networks: A new deep architecture,” in

    Proc. Uncertainty in Artificial Intelligence (UAI)

    , 2011, pp. 337–346.
  • [2] A. Vergari, N. D. Mauro, and F. Esposito, “Simplifying, regularizing and strengthening sum-product network structure learning,” in Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2015, pp. 343–358.
  • [3] H. Zhao, P. Poupart, and G. J. Gordon, “A unified approach for learning the parameters of sum-product networks,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2016, pp. 433–441.
  • [4] C. J. Butz, J. S. Oliveira, and A. E. dos Santos, “On learning the structure of sum-product networks,” in Proc. IEEE Symposium Series on Computational Intelligence (SSCI), 2017, pp. 1–8.
  • [5] K. Zheng, A. Pronobis, and R. P. N. Rao, “Learning graph-structured sum-product networks for probabilistic semantic maps,” in Proc. AAAI Conference on Artificial Intelligence, 2018, pp. 4547–4555.
  • [6] T. Kolda and B. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
  • [7] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. PHAN, “Tensor decompositions for signal processing applications: From two-way to multiway component analysis,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 145–163, March 2015.
  • [8] N. Cohen, O. Sharir, Y. Levine, R. Tamari, D. Yakira, and A. Shashua, “Analysis and design of convolutional networks via hierarchical tensor decompositions,” CoRR, vol. abs/1705.02302, 2017.
  • [9]

    V. Khrulkov, A. Novikov, and I. Oseledets, “Expressive power of recurrent neural networks,” in

    Proc. Intl. Conf. Learning Representations (ICLR), 2018.
  • [10]

    Z. Chen, K. Batselier, J. A. K. Suykens, and N. Wong, “Parallelized tensor train learning of polynomial classifiers,”

    IEEE Trans. Neural Networks and Learning Systems, pp. 1–12, 2018.
  • [11] R. Gens and P. Domingos, “Learning the structure of sum-product networks,” in Proc. Int. Conf. Machine Learning (ICML), 2013, pp. 873–880.
  • [12] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143.
  • [13] R. Gens and P. Domingos, “Discriminative learning of sum-product networks,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2012, pp. 3239–3247.
  • [14]

    A. Darwiche, “A differential approach to inference in bayesian networks,”

    J. ACM, vol. 50, no. 3, pp. 280–305, May 2003.
  • [15] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos, “On the latent variable interpretation in sum-product networks,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, no. 10, pp. 2030–2044, Oct 2017.
  • [16] I. V. Oseledets, “Tensor-train decomposition,” SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295–2317, 2011.
  • [17] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework,” Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, Feb 2014.
  • [18] K. Batselier, Z. Chen, and N. Wong, “Tensor Network alternating linear scheme for MIMO Volterra system identification,” Automatica, vol. 84, pp. 26–35, 2017.
  • [19] C. L. Lawson and R. J. Hanson, Solving least squares problems.   Siam, 1995, vol. 15.
  • [20] C. Y. Ko, K. Batselier, W. Yu, and N. Wong, “Fast and accurate tensor completion with tensor trains: A system identification approach,” CoRR, vol. abs/1804.06128, 2018. [Online]. Available:
  • [21] A. L. Gibbs and F. E. Su, “On choosing and bounding probability metrics,” International statistical review, vol. 70, no. 3, pp. 419–435, 2002.