Learning Signed Determinantal Point Processes through the Principal Minor Assignment Problem

11/01/2018 ∙ by Victor-Emmanuel Brunel, et al. ∙ MIT 0

Symmetric determinantal point processes (DPP's) are a class of probabilistic models that encode the random selection of items that exhibit a repulsive behavior. They have attracted a lot of attention in machine learning, when returning diverse sets of items is sought for. Sampling and learning these symmetric DPP's is pretty well understood. In this work, we consider a new class of DPP's, which we call signed DPP's, where we break the symmetry and allow attractive behaviors. We set the ground for learning signed DPP's through a method of moments, by solving the so called principal assignment problem for a class of matrices K that satisfy K_i,j=± K_j,i, i≠ j, in polynomial time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random point processes on finite spaces are probabilistic distributions that allow to model random selections of sets of items from a finite collection. For example, the basket of a random customer in a store is a random subset of items selected from that store. In some contexts, random point processes are encoded as random binary vectors, where the

coordinates correspond to the selected items. A very famous subclass of random point processes, much used in statistical mechanics, is called the Ising model, where the log-likelihood function is a quadratic polynomial in the coordinates of the binary vector. More generally, Markov random fields encompass models of random binary vectors where stochastic dependence between the coordinates of the random vector is encoded in an undirected graph. In recent years, a different family of random point processes has attracted a lot of attention, mainly for its computational tractability: determinantal point processes (DPP’s). DPP’s were first studied and used in statistical mechanics [21]. Then, following the seminal work [17], discrete DPP’s have been used increasingly in various applications such as recommender systems [11, 12], document and timeline summarization [20, 29], image search [17, 1] and segmentation [19], audio signal processing [28], bioinformatics [5] and neuroscience [26].

A DPP on a finite space is a random subset of that space whose inclusion probabilities are determined by the principal minors of a given matrix. More precisely, encode the finite space with labels

, where is the size of the space. A DPP is a random subset such that , for all fixed , where is an matrix with real entries, called the kernel of the DPP, and is the square submatrix of associated with the set . In the applications cited above, it is assumed that is a symmetric matrix. In that case, it is shown (e.g., see [18]) that a sufficient and necessary condition for

to be the kernel of a DPP is that all its eigenvalues are between

and . In addition, if is not an eigenvalue of , then the DPP with kernel is also known as an -ensemble, where the probability mass function is proportional to the principal minors of the matrix , where is the identity matrix. DPP’s with symmetric kernels, which we refer to as symmetric DPP’s, model repulsive interactions: Indeed, they imply a strong negative dependence between items, called negative association [7].

Recently, symmetric DPP’s have become popular in recommender systems, e.g., automatized systems that seek for good recommendations for users on online shopping websites [11]. The main idea is to model a random basket as a DPP and learn the kernel based on previous observations. Then, for a new customer, predict which items are the most likely to be selected next, given his/her current basket, by maximizing the conditional probability over all items that are not already in the current basket . One very attractive feature of DPP’s is that the latter conditional probability is tractable and can be computed in a polynomial time in . However, if the kernel

is symmetric, this procedure enforces diversity in the baskets that are modeled, because of the negative association property. Yet, in general, not all items should be modeled as repelling each other. For instance, say, on a website that sells household goods, grounded coffee and coffee filters should rather be modeled as attracting each other, since a user who buys grounded coffee is more likely to also buy coffee filters. In this work, we extend the class of symmetric DPP’s in order to account for possible attractive interactions, by considering nonsymmetric kernels. In the learning prospective, this extended model poses a question: How to estimate the kernel, based on past observations? For symmetric kernels, this problem has been tackled in several works

[13, 1, 22, 4, 10, 11, 12, 23, 9, 27]. Here, we assume that is nonparametric, in the sense that it is not parametrized by a low dimensional parameter. As explained in [9] in the symmetric case, the maximum likelihood approach requires to solve a highly non convex optimization problem, and even though some algorithms have been proposed such as fixed point algorithms [23], Expectation-Maximisation [13], MCMC [1], no statistical guarantees are given for these algorithms. The method of moments proposed in [27] provides a polynomial time algorithm based on the estimation of a small number of principal minors of , and finding a symmetric matrix whose principal minors approximately match the estimated ones. This algorithm is closely related to the principal minor assignment problem. Here, we are interested in learning a nonsymmetric kernel given available estimates of its principal minors; In order to simplify the exposition, we always assume that the available list of principal minors is exact, not approximate.

In Section 2, we recall the definition of DPP’s together with general properties, we characterize the set of admissible kernels under lack of symmetry and we define a new class of nonsymmetric kernels, that we call signed kernels. We tackle the question of identifiability of the kernel of a signed DPP and show that this question, together with the problem of learning the kernel, is related to the principal minor assignment problem. In Section 3, we propose a solution to the principal minor assignment problem for signed kernels, which yields a polynomial time learning algorithm for the kernel of a signed DPP.

2 Determinantal Point Processes

2.1 Definitions

Definition 1 (Discrete Determinantal Point Process).

A Determinantal Point Process (DPP) on the finite set is a random subset for which there exists a matrix such that the following holds:

(1)

where is the submatrix of obtained by keeping the columns and rows of whose indices are in . The matrix is called the kernel of the DPP, and we write .

In short, the inclusion probabilities of a DPP are given by the principal minors of some matrix . As we will see below, is not uniquely determined for a given DPP, even though, for simplicity, we say “the kernel” instead of “a kernel”.

Definition 2 (-ensembles).

An -ensemble on the finite set is a random subset for which there exists a matrix such that the following holds:

(2)

In this definition, the symbol means an equality up to a multiplicative constant that does not depend on . By simple linear algebra, it is easy to see that the multiplicative constant must be . Similarly as above, is not uniquely determined for a given -ensemble.

Proposition 1.

An -ensemble is a DPP, with kernel , where is defined in (2). Conversely, a DPP with kernel is an -ensemble if and only if is invertible. In that case, the matrix is given by .

The proof of this proposition follows the lines of [18, Theorem 2.2], which actually does not use the symmetry of and . Even when is not invertible, the probability mass function of has a simple closed form formula. For all , we denote by the diagonal matrix whose -th diagonal entry is if , otherwise, and denote by the complement of the set of in .

Lemma 1.

Let , for some . Then,

Proof.

Let . Then, by the inclusion-exclusion principle,

(3)

where the last inequality is a consequence of the multilinearity of the determinant. ∎

2.2 Admissibility of a kernel

Note that not all matrices give rise to a DPP since, for instance, the numbers from (1) must all lie in , and be nonincreasing with the set . We call a matrix admissible if there exists a DPP with kernel . When is symmetric, it is well known that it is admissible if and only if all its eigenvalues are between and (see, e.g., [18]). As a straightforward consequence of Lemma 1, we have the following proposition.

Proposition 2.

For all matrices , is admissible if and only if , for all .

Proof.

Let be admissible and let . By Lemma 1, for all . Conversely, assume for all . Denote by , for all . By a standard computation, , hence, one can define a random subset with for all . The same application of the inclusion-exclusion principle as in the proof of Lemma 1 yields that for all , hence, . ∎

Let . Assume that is invertible. Then, by Proposition 1, is admissible if and only if the matrix defines an -ensemble. In that case, Lemma 1 yields that , for all . Hence, is admissible if and only if is a -matrix, i.e., all its principal minors are nonnegative. If, in addition, is invertible, then it is admissible if and only if is a -matrix, i.e., all its principal minors are positive, if and only if is invertible for all diagonal matrices with entries in (see [16, Theorem 3.3]). In particular, it is easy to see that any matrix of the form , where is a diagonal matrix with , for some , and , is admissible.

Remark 1.

For symmetric kernels, admissibility is equivalent for all the eigenvalues to lie in . In general, the (complex) eigenvalues of an admissible kernel need not even lie in the band . For instance, if both and are invertible, admissibility of is equivalent to being a -matrix (see Section 2.2). By [14, Theorem 2.5.9.a], the eigenvalues of a -matrix can be anywhere in the wedge . Since the eigenvalues of are given by , for all eigenvalues of , we conclude that an admissible kernel can have eigenvalues arbitrarily far away from the band .

2.3 General properties

DPP’s are known to be stable under simple operations such as marginalization, conditioning: We review some of these properties, which can also be found in [18] in the case of symmetric kernels, or in [8] for -ensembles.

Proposition 3.

Let be an admissible kernel and .

  • Marginalization: For all , is admissible and is a DPP with kernel ;

  • Complement: is admissible and ;

  • Conditioning: Let such that . Then, is invertible, is admissible and conditionally on the event , .

Proof.
  • Marginalization: This property is straightforward, after noticing that for all , .

  • Complement: That is admissible follows from the fact that for all ,

    by admissibility of . Then, for all ,

  • Conditioning: Note that , hence each column has zeros and one one, on the diagonal, so column reduction yields , hence, is invertible.

    Now, for all , by Bayes’ rule, plus the fact that for all , , and finally by a column reduction of the determinant,

    where .

For a DPP with a symmetric kernel , it is known that the eigenstructure of plays a role, e.g., in sampling [18, Section 2.4.4]. For a general DPP, with non necessarily symmetric kernel , the eigenstructure of does not seem to play a significant role, either in learning or sampling. Indeed, the eigenvalues of are complex numbers and may not be diagonalizable. However, we show that the eigenvalues of , even if they may be non real, completely characterize the distribution of the size of the DPP. In the sequel, we denote by (resp. ) the real part (resp. imaginary part) of the complex number .

Lemma 2.

Let be an admissible kernel and let . Let be the real eigenvalues of , repeated according to their multiplicity and let be the eigenvalues of that have positive imaginary part, also accounting for their multiplicity. Then, for all complex numbers ,

Using the same notation as in the lemma, we note that, since is a real matrix, its eigenvalues are exactly (repeated according to their multiplicity). In particular, .

Proof.

Assume first that is invertible, so that is an -ensemble and , for all , with . Then, for all ,

(4)

The conclusion of the lemma follows by extending this computation to the case when is not invertible, by continuity. ∎

In particular, we have the following corollary.

Corollary 1.

With all the same notation as in Lemma 2, if all the non real eigenvalues of lie in the complex disk with center and radius . Then has the same distribution as , where:

  • , for all ;

  • and are , for all

  • , for all ;

  • The random variables

    and the pairs are all mutually independent.

For example, let , where is a real diagonal matrix with for all , for some , and . If , then is admissible (see above) and, by Gerschgorin’s circle theorem, all the eigenvalues of lie in one of the complex disks with center and radius , , hence, in the complex disk with center and radius .

Proof.

First, recall that by Proposition 3, all principal submatrices of both and are admissible, yielding that and are -matrices. By [14, Theorem 2.5.6], all the real eigenvalues of a -matrix are nonnegative. Since for all , (resp. ) is a -matrix, its real eigenvalues are all nonnegative; Its real eigenvalues are exactly the (resp. ), ; By taking the limit as goes to zero, all real eigenvalues of (resp. ) are nonnegative, hence, .

Moreover, note that a complex number lies in the disk with center and radius if and only if . Hence, for all , the polynomial (in ) has real and nonnegative coefficients; So, by Lemma 2

, the moment generating function of

is the moment generating function of the sum of independent random variables, namely, . ∎

It is easy to see that if for some admissible kernel , then . If is symmetric, this yields , where are the eigenvalues of . It is well known that all the eigenvalues of a symmetric admissible kernel are in (see [18, Section 2.1]). Therefore, if is symmetric, then has constant size if and only if and are the only eigenvalues of , i.e., is an orthogonal projection. The following corollary shows that this still holds true for general DPP’s, except that even if and are the only eigenvalues of an admissible kernel, it does not have to be a projection matrix in general.

Corollary 2.

Let be an admissible kernel and . Then, has almost surely constant size if and only if and are the only eigenvalues of .

Proof.

Assume that has constant size, i.e., almost surely, for some . Then, for all . Let be an eigenvalue of . If is not real, then by Lemma 2, the polynomial must divide , hence it must be a monomial, i.e., of the form for some and . Note that . Since is not real, , yielding that and and . Since , this yields that , which contradicts that is not real. Hence, all eigenvalues of must be real and, again by Lemma 2, for all eigenvalues of , must be a monomial, i.e., .

Conversely, if and are the only eigenvalues of , it is straightforward to see that Lemma 2 yields that , where is the multiplicity of the eigenvalue in , yielding that almost surely. ∎

2.4 Special classes of DPP’s

2.4.1 Symmetric DPP’s

Most commonly, DPP’s are defined with a real symmetric kernel . In that case, it is well known ([18]) that admissibility is equivalent to lie in the intersection of two copies of the cone of positive semidefinite matrices: and . DPP’s with symmetric kernels possess a very strong property of negative dependence called negative association. A simple observation is that if for some symmetric , then , for all . Moreover, if are two disjoint subsets of , then . Negative association is the property that, more generally, for all disjoint subsets and for all nondecreasing functions (i.e., ), where is the power set of . We refer to [6] for more details on the account of negative association. For their computational appeal, it is very tempting to apply DPP’s in order to model interactions, e.g., as an alternative to Ising models. However, the negative association property of DPP’s with symmetric kernels is unreasonably restrictive in several contexts, for it forces repulsive interactions between items. Next, we extend the class of DPP’s with symmetric kernels in a simple way which is yet also allowing for attractive interactions.

2.4.2 Signed DPP’s

We introduce the class of signed kernels, i.e., matrices such that for all with , , i.e., for some . We call a signed DPP any DPP with kernel . In particular, if is an admissible kernel and , then for all with , , which is of the sign of . In particular, when , this covariance is nonnegative, which breaks the negative association property of symmetric kernels.

2.4.3 Signed block DPP’s

As of particular interest, one can also consider block signed DPP’s, with kernels , where there is a partition of into pairwise disjoint, nonempty groups such that if and are in the same group (hence, and attract each other), if and are in different groups (hence, and repel each other). As particular cases of signed block DPP’s, consider those with block diagonal kernels

, where each block is skew-symmetric. It is easy to see that such DPP’s can be written as the union of disjoint and independent DPP’s, each corresponding to a diagonal block of

.

2.5 Learning DPP’s

The main purpose of this work is to understand how to learn the kernel of a nonsymmetric DPP, given i.i.d. copies of that DPP. Namely, if for some unknown , how to estimate from the observation of ? First comes the question of identifiability of : two matrices can give rise to the same DPP. To be more specific, if and only if and have the same list of principal minors. Hence, the kernel of a DPP is not necessarily unique. It is actually easy to see that it is unique if and only if it is diagonal. A first natural question that arises in learning the kernel of a DPP is the following:

What is the collection of all matrices that produce a given DPP?

Given that the kernel of is not uniquely defined, the goal is no longer to estimate exactly, but one possible kernel that would give rise to the same DPP as . The route that we follow is similar to that followed by [27], which is based on a method of moments. However, lack of symmetry of requires significantly different ideas. The idea is based on the fact that only few principal minors of are necessary in order to completely recover up to identifiability. Moreover, each principal minor can be estimated from the samples by . Since this last step is straightforward, we only focus on the problem of complete recovery of , up to identifiability, given a list of few of its principal minors. In other words, we will ask the following question:

Given an available list of prescribed principal minors, how to recover a matrix whose principal minors are given by that list, using as few queries from that list as possible?

This question, together with the one we asked for identifiability, is known as the principal minor assignment problem, which we state precisely in the next section.

2.6 The principal minor assignment problem

The principal minor assignment problem (PMA) is a well known problem in linear algebra that consists of finding a matrix with a prescribed list of principal minors [25]. Let be a collection of matrices. Typically, is the set of Hermitian matrices, or real symmetric matrices or, in this work, . Given a list of complex numbers, (PMA) asks the following two questions:

  1. Find a matrix such that , .

  2. Describe the set of all solutions of A.

A third question, which we do not address here, is to decide whether A has a solution. It is known that this would require the ’s to satisfy polynomial equations [24]. Here, we assume that a solution exists, i.e., the list is a valid list of prescribed principal minors, and we aim to answer A efficiently, i.e., output a solution in polynomial time in the size of the problem, and to answer B at a purely theoretical level. In the framework of DPP’s, A is related to the problem of estimating by a method of moments and B concerns the identifiability of , since the set of all solutions of A is the identifiable set of .

3 Solving the principal minor assignment problem for nonsymmetric DPP’s

3.1 Preliminaries: PMA for symmetric matrices

Here, we briefly describe the PMA problem for symmetric matrices, i.e., , the set of real symmetric matrices. This will give some intuition for the next section.

Fact 1.

The principal minors of order one and two of a symmetric matrix completely determine its diagonal entries and the magnitudes of its off diagonal entries.

The adjacency graph of a matrix a matrix is the undirected graph on vertices, where, for all , . As a consequence of Fact 1, we have:

Fact 2.

The adjacency graph of any symmetric solution of A can be learned by querying the principal minors of order one and two. Moreover, any two symmetric solutions of A have the same adjacency graph.

Then, the signs of the off diagonal entries of a symmetric solution of A should be determined using queries of higher order principal minors, and the idea is based on the next fact. For a matrix and a cycle in , denote by the product of entries of along the cycle , i.e., .

Fact 3.

For all matrices and all , only depends on the diagonal entries of , the magnitude of its off diagonal entries and the , for all cycles in the subgraph of where all vertices have been deleted.

Fact 3 is a simple consequence of the fundamental formula:

(5)

where is the group of permutations of , Moreover, every permutation can be decomposed as a product of cyclic permutations. Finally, every undirected graph has a cycle basis made of induced cycles, i.e., there is a small family of induced cycles such that every cycle (seen as a collection of edges) in the graph can be decomposed as the symmetric difference of cycles that belong to . Then, it is easy to see that for all cycles in the graph , can be written as the product of some , for some cycles and of some ’s, . Moreover, for all induced cycles in , can be determined from , where is the set of vertices of . Since, by Fact 2, can be learned, what remains is to find a cycle basis of , made of induced cycles only, which can be performed in polynomial time (see [15, 2]) and, for each cycle in that basis, query the corresponding principal minor of in order to learn . Finally, in order to determine the signs of the off diagonal entries of , find a sign assignment that matches with the signs of the , for in the aforementioned basis. Finding such a sign assignment consists of solving a linear system in (see Section 1 in the Supplementary Material).

3.2 PMA when , general case

We now turn to the case . First, as in the symmetric case, the diagonal entries of any matrix are given by its principal minors of order 1. Now, let and consider the principal minor of corresponding to :

Hence, and can be learned from the principal minors of corresponding to the sets and .

Note that if , one can still define its adjacency graph as in the symmetric case, since , for all . Recall that we identify a cycle of a graph with its edge set. For all and for all cycles in , let be the product of the ’s along the edges of , where is such that . Note that the condition “” in the definition of is only to ensure no repetition in the product. Now, unlike in the symmetric case, we need to be more careful when defining , for a cycle of , since the direction in which is traveled matters.

Definition 3.

A signed graph is an undirected graph where each edge is assigned a sign or .

In the sequel, we make the adjacency graph of any matrix signed by assigning to each edge of the graph. As we noticed above, the signed adjacency graph of can be learned from its principal minors of orders one and two. Unlike in the symmetric case, induced cycles might be of no help to determine the signs of the off diagonal entries of .

Figure 1:

A signed graph

Definition 4.

Let be an undirected graph and a cycle of . A traveling of is an oriented cycle of whose vertex set coincides with that of . The set of travelings of is denoted by .

For instance, an induced cycle has exactly two travelings, corresponding to the two possible orientations of .

In Figure 1, the cycle has six travelings: , , , , and .

Formally, while we identify a cycle with its edge set (e.g.,

, we identify its travelings with sets of ordered pairs corresponding to their oriented edges (e.g.,

). Also, for simplicity, we always denote oriented cycles using the symbol (e.g., as opposed to , which would stand for an unoriented cycle).

Definition 5.

Let and be a cycle in . We denote by .

For example, if the graph in Figure 1 is the adjacency graph of some and is the cycle , then,

where the oriented cycles , and are given above, and where we use the shortcut () to denote , where is the unoriented version of .

In the same example, there are only two triangles (i.e., cycles of size ) that satisfy : and .

The following result, yet a simple consequence of (5), is fundamental.

Lemma 3.

For all , can be written as a function of the ’s, for and ’s, for all cycles in , the subgraph of where all vertices are removed.

Proof.

Write a permutation as a product of cyclic permutations . For each , assume that correspond to an oriented cycle of , otherwise the contribution of to the sum (5) is zero. Then, the lemma follows by grouping all permutations in the sum (5) that can be decomposed as a product of cyclic permutations where, for all , has the same support as . ∎

As a consequence, we note that unlike in the symmetric case, the signs of the off diagonal entries can no longer be determined using a cycle basis of induced cycles, since such a basis may contain only cycles which have no contribution to the principal minors of . In the same example as above, the only induced cycles of are triangles, and any cycle basis should contain at least three cycles. However, there are only four triangles in that graph and two of them have a zero contribution to the principal minors of . Hence, in that case, it is necessary to query principal minors that do not correspond to induced cycles in order to find a solution to A.

In order to summarize, we state the following theorem.

Theorem 1.

Let . The following statements are equivalent.

  • and have the same list of principal minors.

  • and , for all with , and have the same signed adjacency graph and, for all cycles in that graph, .

Theorem 1 does not provide any insight on how to solve B efficiently, since the number of cycles in a graph can be exponentially large in the size of the graph. A refinement of this theorem, where we would characterize a minimal set of cycles, that could be found efficiently and that would characterize the principal minors of (such as a basis of induced cycles, in the symmetric case), is an open problem. However, in the next section, we refine this result for a smaller class of nonsymmetric kernels.

3.3 PMA when , dense case

In this section, we only consider matrices such that for all with , . The adjacency graph of such a matrix is a signed version of the complete graph, which we denote by . We also assume that for all pairwise distinct and all ,

(6)

Note that Condition (6) only depends on the magnitudes of the entries of . Hence, if one solution of A satisfies (6), then all the solutions must satisfy it too. Condition (6) is not a strong condition: Indeed, any generic matrix with rank at least is very likely to satisfy it.

For the sake of simplicity, we restate A and B in the following way. Let be a ground kernel satisfying the two conditions above (i.e., is dense and satisfies Condition 6), and assume that is unknown, but its principal minors are available.

  1. Find a matrix such that , .

  2. Describe the set of all solutions of A.

Moreover, recall that we would like to find a solution to A that uses few queries from the available list of principal minors of , in order to design an algorithm that is not too costly computationally.

Since is assumed to be dense, every subset of size at least 3 is the vertex set of a cycle. Moreover, for all cycles of , only depends on the vertex set of , not its edge set. Therefore, in the sequel, for the ease of notation, we denote by for any cycle with vertex set .

The main result of this section is stated in the following theorem.

Theorem 2.

A matrix