Correlated Equilibria for Approximate Variational Inference in MRFs

04/10/2016 ∙ by Luis E. Ortiz, et al. ∙ Arizona State University University of Michigan 0

Almost all of the work in graphical models for game theory has mirrored previous work in probabilistic graphical models. Our work considers the opposite direction: Taking advantage of recent advances in equilibrium computation for probabilistic inference. We present formulations of inference problems in Markov random fields (MRFs) as computation of equilibria in a certain class of game-theoretic graphical models. We concretely establishes the precise connection between variational probabilistic inference in MRFs and correlated equilibria. No previous work exploits recent theoretical and empirical results from the literature on algorithmic and computational game theory on the tractable, polynomial-time computation of exact or approximate correlated equilibria in graphical games with arbitrary, loopy graph structure. We discuss how to design new algorithms with equally tractable guarantees for the computation of approximate variational inference in MRFs. Also, inspired by a previously stated game-theoretic view of state-of-the-art tree-reweighed (TRW) message-passing techniques for belief inference as zero-sum game, we propose a different, general-sum potential game to design approximate fictitious-play techniques. We perform synthetic experiments evaluating our proposed approximation algorithms with standard methods and TRW on several classes of classical Ising models (i.e., with binary random variables). We also evaluate the algorithms using Ising models learned from the MNIST dataset. Our experiments show that our global approach is competitive, particularly shinning in a class of Ising models with constant, "highly attractive" edge-weights, in which it is often better than all other alternatives we evaluated. With a notable exception, our more local approach was not as effective. Yet, in fairness, almost all of the alternatives are often no better than a simple baseline: estimate 0.5.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Almost all of the work in graphical games has borrowed heavily from analogies to probabilistic graphical models. Yet, over-reliance on those analogies and previous standard approaches to exact inference might have led that approach to face the same computational roadblocks that plagued most exact-inference techniques.

As an example of work that heavily exploits previous work in probablistic graphical models (PGMs), Kakade et al. (2003)

designed polynomial-time algorithms based on linear programming for computing

correlated equilibria (CE) in standard graphical games with tree graphs. The approach and polynomial-time results extend to graphical games with bounded-tree-width graphs and graphical polymatrix games with tree graphs. Exact inference is tractable in PGMs whose graphs have bounded treewidth, but intractable in general (Cooper, 1990; Shimony, 1994; Istrail, 2000). In 2005, Papadimitriou and Roughgarden showed the intractability of computing the “social-welfare” optimum CE in arbitrary graphical games (see also Papadimitriou and Roughgarden (2008)). Everything seemed to point toward an eventual resignation that the approach of Kakade et al. (2003), along with any other approach to the problem for that matter, had hit the “bounded-treewidth-threshold wall.”

Yet, soon after, Papadimitriou (2005) took a radically different approach to the problem, and surprised the community with an efficient algorithm for computing CE not only in graphical games, but also in almost all known compactly representable games. Jiang and Leyton-Brown (2015a) built upon Papadimitriou’s idea to provide what most people would consider an improved polynomial-time algorithm, because of the simplification of the CE that their algorithm outputs (see also Jiang and Leyton-Brown, 2011, for a summary). 111Papadimitriou’s work has an interesting history, which Jiang and Leyton-Brown (2015a) nicely summarize. Some questions arose at the time about the technical soundness in the description of some steps in Papadimitriou’s algorithm. Jiang and Leyton-Brown (2015a) provided clarifications to those steps.

An immediate question that arises from the algorithmic results just described is, what is so fundamentally different between the problem of exact inference in graphical models and equilibrium computation that made this result possible in the context of graphical games? Of course, CE, probabilistic inference, and their variants are different problems, even within the same framework of graphical models. The question is, how different are they?

It is well-known that pure strategy Nash equilibrium (PSNE) is inherently a classical/standard discrete constraints satisfaction problem (CSP). It is also well-known that any CSP can be cast as a most-likely, or equivalently, a maximum a posteriori (MAP) assignment estimation problem in Markov random fields (MRFs)222Assuming a solution exists, of course; otherwise the resulting MRF is not well-defined. Through this connection, it is clear that there exists a MAP formulation of PSNE. But what about other, more general forms of equilibria?

We present here a formulation of the problem of equilibrium computation as a kind of local conditions for different approximations to belief inference. Similarly, we show how one can view some special games, called graphical potential games (Ortiz, 2015), as defining an equivalent MRF whose “locally optimal” solutions correspond to arbitrary equilibria of the game. Hence, Papadimitriou’s result, and later that of Jiang and Leyton-Brown, open up the possibility that at least new classes of problems in probabilistic graphical models could be solved exactly and efficiently. The question is, which classes?

While we provide specific connections between the two fields that yield immediate theoretical and computational implications, we also provide practical alternatives that result from those connections. That is, the foundation of both Papadimitriou’s and Jiang and Leyton-Brown’s algorithms is the ellipsoid method, which is one approach that leads to the polynomial-time algorithm for linear programming. This approach, while provably efficient in theory, is often seen as less practical as other alternatives such as so-called interior-point methods. This is in contrast to the simple linear programs that are possible for certain classes of graphical games (Kakade et al., 2003). Are there simpler and practically effective variants of Papadimitriou’s or Jiang and Leyton-Brown’s algorithms? While the last question is an important open question, we do not address it directly in this paper. Instead, We employ ideas from the literature of learning in games (Fudenberg and Levine, 1999)

, particularly no-regret algorithms and fictitious play, to propose two specific instances of game-theoretic inspired, practical, and effective heuristics for belief inference in MRFs. One heuristic takes a local approach, and the other takes a global approach. We evaluate our proposed algorithms within the context of the most popular, standard, and state-of-art techniques from the literature in probabilistic graphical models.

This manuscript describes our work, which starts to address some of the questions above, and reports on our progress.

1.1 Overview of the Paper

Section 2 provides preliminary material, introducing basic notation, terminology, and concepts from graphical models and game theory.

Section 3 is the main technical section of the paper. It shows reductions of different problems in belief inference in MRFs as computing equilibria in graphical potential games compactly represented as Gibbs potential games (Ortiz, 2015). The reductions presented here vary in generality from MAP assignment, marginals, and full-joint estimation to pure-strategy Nash equilibria (PSNE), mixed-strategy Nash equilibria (MSNE), and correlated equilibria (CE), respectively. We briefly discuss a connection between Papadimitriou’s algorithm, as well as Jiang and Leyton-Brown’s, and the work of Jaakkola and Jordan (1997)

on variational approximations to the problem of probabilistic inference in MRFs via mean-field mixtures. The paper also includes a discussion on the connections to previous work in computer vision on the problem of relaxation labeling, and work on game-theoretic approaches to (Bayesian) statistical estimation. We then present an alternative approach based on a more global view of the problem, in contrast to the more local approach of the formulations mentioned above. More specifically, we formulate the inference problem using a two-player potential game, inspired by the work on

tree reweighed (TRW) message-passing (Wainwright et al., 2005). We propose a special type of sequential, “hybrid” standard and stochastic fictitious play algorithm for belief inference.

Section 4 reports on our experimental evaluation. We compare our proposed algorithms to the popular, most commonly used, standard, and easily implementable approximation techniques in use today.

Section 5 discusses future work and suggests new opportunities for other potential research directions, beyond those already discussed in the main technical sections of the paper.

Section 6 concludes the paper with a summary of our contributions.

2 Preliminaries

This section introduces basic notation and concepts in graphical models and game theory used throughout the paper. It also includes brief statements on current state-of-the-art mathematical and computational results in the area.

Basic Notation.

Denote by an

-dimensional vector and by

the same vector without component . Similarly, for every set , denote by the (sub-)vector formed from using only components in , such that, letting denote the complement of , we can denote for every . If are sets, denote by , and .

Graph Terminology and Notation.

Let be an undirected graph, with finite set of vertices or nodes and a set of (undirected) edges . For each node , let be the set of neighbors of in , not including , and the set including . A clique of is a set of nodes with the property that they are all mutually connected: for all , ; in addition, is maximal if there is no other node outside that is also connected to each node in , i.e., for all , for some .

Another useful concept in the context of this paper is that of hypergraphs, which are generalizations of regular graphs. A hypergraph graph is defined by a set of nodes and a set of hyperedges . We can think of the hyperedges as cliques in a regular graph. Indeed, the primal graph of the hypergraph is the graph induced by the node set and where there is an edge between two nodes if they both belong to the same hyperedge; in other words, the primal graph is the graph induced by taking each hyperedge and forming cliques of nodes in a regular graph.

2.1 Probabilistic Graphical Models

Probabilistic graphical models are an elegant marriage of probability and graph theory that has had tremendous impact in the theory and practice of modern artificial intelligence, machine learning, and statistics. It has permitted effective modeling of large, structured high-dimensional complex systems found in the real world. The language of probabilistic graphical models allows us to capture the structure of complex interactions between individual entities in the system within a single model. The core component of the model is a graph in which each node

corresponds to a random variable and the edges express conditional independence assumptions about those random variables in the probabilistic system.

2.1.1 Markov Random Fields, Gibbs Distributions, and
the Hammersley-Clifford Theorem

By definition, a joint probability distribution

is a Markov random field (MRF) with respect to (wrt) an undirected graph if for all , for every node , In that case, the neighbors/variables form the Markov blanket of node/variable .

Also by definition, a joint distribution

is a Gibbs distribution wrt an undirected graph if it can be expressed as for some functions indexed by a clique , the set of all (maximal) cliques in , and mapping every possible value that the random variables associated with the nodes in can take to a non-negative number.

We say that a joint probability distribution is positive if it has full support (i.e., for all ). 333The positivity constraint is only necessary for the “only if” case proof of the theorem.

Theorem 1.

(Hammersley-Clifford (Hammersley and Clifford, 1971)) Let be a positive joint probability distribution. Then, is an MRF with respect to if and only if is a Gibbs distribution with respect to .

In the context of the theorem, the functions are positive, which allows us to define MRFs in terms of local potential functions over each clique in the graph. Define the function . Let us refer to any function of this form as a Gibbs potential with respect to . A more familiar expression of an MRF is

2.1.2 Some Inference-Related Problems in MRFs

One problem of interest in an MRF is to compute a most likely assignment ; that is, the most likely outcome with respect to the MRF . Another problem is to compute the individual marginal probabilities for each variable . A related problem is to compute the normalizing constant (also known as the partition function of the MRF).

Another set of problems concern so called “belief updating.” That is, computing information related to the posterior probability distribution having observed the outcome of some of the variables, also known as the evidence. For MRFs, this problem is computationally equivalent to that of computing prior marginal probabilities.

2.1.3 Brief Overview of Computational Results in
Probabilistic Graphical Models

Both the exact and approximate versions of most inference-related problems in MRFs are in general intractable (e.g., NP-hard), although polynomial-time algorithms do exists for some special cases (see, e.g., Dagum and Luby, 1993, Roth, 1996, Istrail, 2000, Wang et al., 2013, and the references therein). The complexity of exact algorithms is usually characterized by structural properties of the graph, and the typical statement is that running times are polynomial only for graphs with bounded treewidth (see, e.g., Russell and Norvig, 2003 for more information). Several deterministic and randomized approximation approaches exist (see, e.g., Jordan et al., 1999; Jaakkola, 2000; Geman and Geman, 1984). An approximation approach of particular interest in this paper is variational inference (Jordan et al., 1999; Jaakkola, 2000). Roughly speaking, the general idea is to approximate an intractable MRF by a “closest” probability distribution within a “computationally tractable” class : formally, , where is the Kullback-Leibler (KL) divergence between probability distributions and wrt . The simplest example in the so called mean-field (MF) approximation, in which consists of all possible product distributions. Even if is an IM, no closed-form solution exists for its mean-field approximation, and the most common computational scheme is based on simple axis parallel optimizations, leading to individual local conditions of optimality and potential local minima: that is, the problem is essentially reduced to finding such that for all , we have , where is the (Shannon) entropy of random variable .

2.2 Game Theory

Game theory (von Neumann and Morgenstern, 1947) provides a mathematical model of the stable behavior (or outcome) that may result from the interaction of rational individuals. This paper concentrates on noncooperative settings: individuals maximize their own utility, act independently, and do not have (direct) control over the behavior of others. 444Individual rationality here means that each player seeks to maximize their own utility. Also note that, while many parlor “win-lose”/zero-sum games involve competition, in general, noncooperative competitive: each player just wants to do the best for himself, regardless of how useful or harmful his behavior is to others.

The concept of equilibrium is central to game theory. Roughly, an equilibrium in a noncooperative game is a point of strategic stance, where no individual player can gain by unilaterally deviating from the equilibrium behavior.

2.2.1 Games and their Representation

Let denote a finite set of players in a game. For each player , let denote the set of actions or pure strategies that can play. Let denote the set of joint actions, denote a joint action, and the individual action of player in . Denote by the joint action of all the players except , such that . Let denote the payoff/utility function of player . If the ’s are finite, then is called the payoff matrix of player . Games represented this way are called normal- or strategic-form games.

There are a variety of compact representations for large games inspired by probabilistic graphical models in AI and machine learning (La Mura, 2000; Kearns et al., 2001; Koller and Milch, 2003; Leyton-Brown and Tennenholtz, 2003; Jiang and Leyton-Brown, 2008). The results of this paper are presented in the context of the following generalization of graphical games (Kearns et al., 2001), a simple but powerful model inspired by probabilistic graphical models such as MRFs previously defined by Ortiz (2014)555Connections have already been established between the different kinds of compact representations (Jiang and Leyton-Brown, 2008), which may facilitate extensions of ideas, frameworks, and results to those alternative models.

Definition 1.

A graphical multi-hypermatrix game (GMhG) is defined by

  • a directed graph in which there is a node in for each of the players in the game (i.e., ), and the set of directed edges, or arcs, defines a set of neighbors whose action affect the payoff function of (i.e., is a neighbor of if and only if there is an arc from to ); and

  • for each player ,

    • a set of actions ,

    • a hypergraph where the vertex set is its (inclusive) neighborhood and the hyperedge set is a set of cliques of players , and

    • a set of local-clique payoff (hyper)matrices.

The interpretation of a GMhG is that, for each player , the local and global payoff (hyper)matrices and of are (implicitly) defined as and , respectively.

Graphical potential games.

Graphical potential games are special instances of GMhGs. They play a key role in establishing a stronger connection between probabilistic inference in MRFs and equilibria in games than previously noted. Ortiz (2015) provides a characterization of graphical potential games, and discusses the implication of convergence of certain kinds of “playing” processes in games based on connections to the Gibbs sampler (Geman and Geman, 1984), via the Hammersley-Clifford Theorem (Hammersley and Clifford, 1971; Besag, 1974). Yu and Berthod (1995) (implicitly) used graphical potential games to establish an equivalence between local maximum-a-posteriori (MAP) inference in Markov random fields and Nash equilibria of the game, a topic revisited in Section 3.1666In the interest of brevity, please see Ortiz (2014) for a thorough discussion of GMhGs, including their compact representation size and connections to other classical classes of games in game theory.

2.2.2 Equilibria as Solution Concepts

Equilibria are generally considered the solutions of games. Various notions of equilibria exist. A pure strategy (Nash) equilibrium (PSNE) of a game is a joint action such that for all players , and for all actions , That is, no player can improve its payoff by unilaterally deviating from its prescribed equilibrium , assuming the others stick to their actions . Some games, such as the extensively-studied Prisoner’s Dilemma, have PSNE; many others, such as “playground” Rock-Paper-Scissors, do not. This is problematic because it will not be possible to “solve” some games using PSNE.

A mixed-strategy of player is a probability distribution over such that is the probability that chooses to play action 777Note that the sets of mixed strategies contain pure strategies, as we can always recover playing a pure strategy exclusively. A joint mixed-strategy is a joint probability distribution capturing the players behavior, such that is the probability that joint action is played, or in other words, each player plays action in component of . Because we are assuming that the players play independently, is a product distribution: . Denote by the joint mixed strategies of all the players except . The expected payoff of a player when some joint mixed-strategy is played is ; abusing notation, denote it by . The conditional expected payoff of a player given that he plays action is ; abusing notation again, denote it by .

A mixed-strategy Nash equilibrium (MSNE) is a joint mixed-strategy that is a product distribution formed by the individual players mixed strategies such that, for all players , and any other alternative mixed strategy for his play, Every game in normal-form has at least one such equilibrium (Nash, 1951). Thus, every game has an MSNE “solution.”

One relaxation of MSNE considers the case where the amount of gain each player can obtain from unilateral deviation is very small. This concept is particularly useful to study approximation versions of the computational problem. Given , an (approximate) -Nash equilibrium (MSNE) is defined as above, except that the expected gain condition becomes

Several refinements and generalizations of MSNE have been proposed. One of the most interesting generalizations is that of a correlated equilibrium (CE) (Aumann, 1974). In contrast to MSNE, a CE can be a full joint distribution, and thus characterize more complex joint-action behavior by players. Formally, a correlated equilibrium (CE) is a joint probability distribution over such that, for all players , , , and ,

where is the (marginal) probability that player will play according to and is the conditional given . An MSNE is CE that is a product distribution. An equivalent expression of the CE condition above is As was the case for MSNE, we can relax the condition of deviation to account for potential gains from small deviation. Given , adding the term “” to the right-hand-side of the condition above defines an (approximate) -CE888Note that approximate CE is usually defined based on this unconditional version of the CE conditions (Hart and Mas-Colell, 2000).

CE have several conceptual and computational advantages over MSNE. For instance, all players may achieve better expected payoffs in a CE than those achievable in any MSNE; 999The distinction between installing a traffic light at an intersection and leaving the intersection without one is a real-world example of this. some “natural” forms of play are guaranteed to converge to the (set of) CE (Foster and Vohra, 1997, 1999; Fudenberg and Levine, 1999; Hart and Mas-Colell, 2000, 2003, 2005); and CE is consistent with a Bayesian framework (Aumann, 1987), something not yet possible, and apparently unlikely for MSNE (Hart and Mansour, 2007).

2.2.3 Brief Overview of Results in Computational Game Theory

There has been an explosion of computational results on different equilibrium concepts on a variety of game representations and settings since the beginning of this century. The following is a brief summary. We refer the reader to a book by Nisan et al. (2007) for a (partial) introduction to this research area.

The problem for two-player zero-sum games, where the sum of the entries of both matrix is zero, and therefore only one matrix is needed to represent the game, can be solved in polynomial time: It is equivalent to linear programming (von Neumann and Morgenstern, 1947; Szép and Forgoó, 1985; Karlin, 1959). After being open for over 50 years, the problems of the complexity of computing MSNE in games was finally settled recently, following a very rapid sequence of results in the last part of 2005 (Goldberg and Papadimitriou, 2005; Daskalakis et al., 2005; Daskalakis and Papadimitriou, 2005; Daskalakis et al., 2009b; Chen and Deng, 2005b): Computing MSNE is likely to be hard in the worst case, i.e., PPAD-complete (Papadimitriou, 1994), even in games with only two players (Chen and Deng, 2005a, 2006; Chen et al., 2009; Daskalakis et al., 2009a, b). The result of Fabrikant et al. (2004) suggests that computing PSNE in succinctly representable games is also likely to be intractable in the worst case, i.e., PLS-complete (Johnson et al., 1988). A common statement is that computing MSNE, and in some cases even PSNE, with “special properties” is hard in the worst case (Gilboa and Zemel, 1989; Gottlob et al., 2003; Conitzer and Sandholm, 2008). Computing approximate MSNE is also thought to be hard in the worst case (Chen et al., 2006, 2009). We refer the reader to Ortiz and Irfan (2017), and the references therein, for recent results along this line and a brief survey of the state-of-the-art for this problem.

Most current results for computing exact and approximate PSNE or MSNE in graphical games essentially mirror those for MRFs and constraint networks: polynomial time for bounded treewidth graph; intractable in general (Kearns et al., 2001; Gottlob et al., 2003; Daskalakis and Papadimitriou, 2006; Ortiz, 2014). This is unsurprising because they were mostly inspired by analogous versions in probabilistic graphical models and constraint networks in AI, and therefore share similar characteristics. Several heuristics exist for dealing with general graphs (Vickrey and Koller, 2002; Ortiz and Kearns, 2003; Daskalakis and Papadimitriou, 2006).

In contrast, there exist polynomial-time algorithms for computing CE, both for normal-form games (where the problem reduces to a simple linear feasibility problem) and even most succinctly-representable games known today (Papadimitriou, 2005; Jiang and Leyton-Brown, 2015a), including graphical games.

3 Equilibria and Inference

The line of work presented in this section is partly motivated by the following question: Can we leverage advances in computational game theory for problems in the probabilistic graphical models community? Establishing a strong bilateral connection between both problems may help us answer this question.

The literature on computing equilibria in games has skyrocketed since the beginning of this century. As we discover techniques developed early on within the game theory community, and as new results are generated from the extremely active computational game theory community, we may be able to adapt those techniques for solving games to the inference setting. If we can establish a strong bilateral connection between inference problems and the computation of equilibria, we may be able to relate algorithms in both areas and exchange previously unknown results in each.

3.1 Pure-Strategy Nash Equilibrium and
Approximate MAP Inference

Consider an MRF with respect to graph and Gibbs potential defined by the set of potential functions . For each node , denote by the subset of cliques in that include . Note that the (inclusive) neighborhood of player is given by .

Define an MRF-induced GMhG, and more specifically, a (hyperedge-symmetric) hypergraphical game (Papadimitriou, 2005; Ortiz, 2015), with the same graph , and for each player , hypergraph with hyperedges and local-clique payoff hypermatrices for all . A few observations about the game are in order.

Property 1.

The representation size of the MRF-induced game is the same as that of the MRF: not exponential in the largest neighborhood size, but the size of the largest clique in .

Property 2.

The MRF-induced game is a graphical potential game (Ortiz, 2015) with graph and (Gibbs) potential function : i.e., for all , and ,

Remark 1.

Through the connection established by the last property, it is easy to see that sequential best-response dynamics is guaranteed to converge to a PSNE of the game in finite time, regardless of the initial play. 101010 Recall that best-response dynamics refers to the a process where at each time step, each player observes the action of others and takes an action that maximizes its payoff given that the others played . In this case, those dynamics would essentially be implementing an axis-parallel coordinate maximization over the space of assignments for the MRF, which is guaranteed to converge to a local maxima (or critical points) of the MRF. In fact, we can conclude that a joint-action is a PSNE of the game if and only if is a local maxima or a critical point of the MRF . Thus, the MRF-induced game, like all potential games (Monderer and Shapley, 1996b), always has PSNE. 111111This result should not be surprising given that other researchers have established a one-to-one relationship between the complexity class PLS (Johnson et al., 1988), which characterizes local search problems, of which finding local maxima of the MRF is an instance, and (ordinal) potential games (Fabrikant et al., 2004).

Similarly, for any potential game, one can define a game-induced MRF using the potential function of the game whose set of local maxima (and critical points) corresponds exactly to the set of PSNE of the potential game. Through this connection we can show that solving the local-MAP problem in MRFs is PLS-complete in general (Fabrikant et al., 2004)121212A direct proof of this result follows from Papadimitriou et al. (1990)

, and in particular, the result for Hopfield neural networks 

(Hopfield, 1982). A Hopfield neural network can be seen as an MRF, and more specifically, and Ising model, when the weights on the edges are symmetric. Similarly, any Hopfield neural network can be seen as a polymatrix game (Miller and Zucker, 1992); when the weights are symmetric the network can be seen as a potential game (in particular, it is an instance of a party affiliation game (Fabrikant et al., 2004)). Indeed, a stable configuration in an arbitrary Hopfield neural network is equivalent to a PSNE of a corresponding polymatrix game. (See Papadimitriou et al., 1990, and Miller and Zucker, 1992, for the relevant references.)

One question that comes to mind is whether one can say anything about the properties of the globally optimal assignment in the game-induced MRF and the payoff it supports for the players. Or whether it can be characterized by stronger notions of equilibria. For example, are strong NE, in which no coalition of players could obtain a Pareto dominated set of payoffs by unilaterally deviating, joint MAP assignments of the MFR? Or more generally, what characteristics can we assign to the MAP assignments of the game-induced MRF?

In short, we can use algorithms for PSNE as heuristics to compute locally optimal MAP assignments of and vice versa131313Note that algorithms for PSNE can in principle find critical points of . In either case, algorithms such as the max-product version of belief propagation (BP) can only provide such local-optimum/critical-point convergence guarantees in general.

Remark 2.

Daskalakis et al. (2007) extended results in game theory characterizing the number of PSNE in normal-form games (see Stanford, 1995; Rinott and Scarsini, 2000, and the references therein) to graphical games, but now taking into consideration the network structure of the game. Information about the number of PSNE in games can provide additional insight on the structure of MRFs.

For example, one of the results of Daskalakis et al. (2007) states that for graphs respecting certain expansion properties as the number of nodes/players increases, the number of PSNE of the graphical game will have a limiting distribution that is a Poisson with expected value . Also according to Daskalakis et al. (2007), a similar behavior occurs for games with graphs generated according to the Erdös-Rényi model with sufficiently high average-degree (i.e., reasonably high connectivity). Thus, either the set of MRF-induced games has significantly low measure relative to the set of all possible randomly generated games (something that seems likely), or the number of local maxima (and critical points) of the MRF will have a similar distribution, and thus that number is expected to be low. The latter would suggest that local algorithms such as the max-product algorithm may be less likely to get stuck in local maxima (or critical points) of the MRF.

In addition, there have been several results stating that PSNE are unlikely to exist in many graphs, and that, when they do exist, they are not that many (Daskalakis et al., 2007)141414

In particular, the number of PSNE has a Poisson distribution with parameter

. MRF-induced games would in that sense represent a very rich class of non-randomly generated graphical games for which the results above do not hold.

3.2 Mixed-strategy Equilibria and Belief Inference

Going beyond PSNE and MAP estimation, this subsection begins to establish a stronger, and potentially more useful connection between probabilistic inference and more general concepts of equilibria in games.

Let be a subset of the players (i.e., nodes in the graph) and denote by the (marginal) probability distribution of over possible joint actions of players in . Consider the condition for correlated equilibria (CE), which for the MRF-induced game we can express as, for all ,

Commuting the sums and simplifying we get the following equivalent condition:

(1)

This simplification is important because it highlights that, modulo expected payoff equivalence, we only need distributions over the original cliques, not the induced neighbohoods/Markov blankets, to represent CE in this class of games, in contrast to Kakade et al. (2003); thus, we are able to maintain the size of the representation of the CE to be the same as that of the game.

As an alternative, we can use the fact that the MRF-induced game is a potential game and, via some definitions and algebraic manipulation, get the following sequence of equivalent conditions, which hold for all , and .

Rewriting the last expression, we get the following equivalent condition: for all , and ,

(2)

The following are some additional remarks on the implications of the last condition. 151515

In what follows, we refer to concepts from information theory in the discussion, such as (Shannon’s) entropy, cross entropy, and relative entropy (also known as Kullback-Leibler divergence). We refer the reader to 

Cover and Thomas (2006) for a textbook introduction to those concepts.

Remark 3.

First, it is useful to introduce the following notation. For any distribution , let be the cross entropy between probability distributions and , with respect to 161616That is, (a lower bound on) the average number of bits required to transmit ”messages/events” generated according to but encoded using a scheme based on . Denote by the marginal distribution of play over the joint-actions of all players except player . Denote by the joint distribution defined as for all .

Then, condition 2 implies the following sequence of conditions, which hold for all .

As anonymous reviewer pointed out, the condition is actually that of a coarse CE (CCE) (Hannan, 1957; Moulin and Vial, 1978), which is a superset of CE and allows us to apply several simple methods for computing such equilibrium concept, as discussed later in this section. Hence, any CE of the MRF-induced game is a kind of approximate local optimum (or critical point) of an approximation of the MRF based on a special type of cross entropy minimization.

The following property summarizes this remark.

Property 3.

For any MRF , any correlated equilibria of the game induced by satisfies .

Remark 4.

Let us introduce some additional notation. For any joint distribution of play , let be its entropy. Similarly, for any player , for any marginal/individual distribution of play , let be its (marginal) entropy. For any distribution and , let be the Kullback-Leibler divergence between and , with respect to . Denote by the conditional entropy of the individual play of player given the joint play of all the players except , with respect to .

Then, we can express the condition 2 as the following equivalent conditions, which hold for all .

Hence, any CE of a MRF-induced game is a kind of approximate local optimum (or critical point) of a special kind of variational approximation of the MRF. The following property summarizes this remark.

Property 4.

For any MRF , any correlated equilibria of the game induced by satisfies .

Note that the last property implies that the approximation satisfies the local condition .

Before continuing exploring connections to CE, it is instructive to first consider MSNE.

3.2.1 Mixed-strategy Nash Equilibria and
Mean-Field Approximations

In the special case of MSNE, the joint mixed strategy is a product distribution. Denote by the (marginal) joint action of play over all the players except , and denote by the probability distribution defined such that the probability of is .

In this special case, the equilibrium conditions imply the following conditions, which hold for all : for all such that ,

Denoting by , the last condition implies that

The last condition is equivalent to

which, in turn, we can express as The last expression is also equivalent to

Hence, a NE of the game is almost a locally optimal mean-field approximation, except for the extra entropic term. In summary, for MSNE we have the following tighter condition than for arbitrary CE.

Property 5.

For any MRF , any MSNE of the game induced by satisfies , for all .

Note that the last property implies that the mean-field approximation satisfies the local condition for all .

One possible way to address the issue of the extra entropic term is to consider instead the MRF-induced infinite game, where each player has the (continuous) utility function 171717In an infinite game the sets of actions or pure strategies are uncountable. Existence of equilibria holds under reasonable conditions (i.e., each set of actions is a nonempty compact convex subset of Euclidean space, and each player utility is continuous and quasi-concave in the player’s action), all of which are satisfied by the MRF-induced infinite game considered here. (See Fudenberg and Tirole, 1991, for more information.)

and wants to maximize over its mixed-strategy given the other player mixed-strategies for all .

Property 6.

The MRF-induced infinite game defined above is an infinite Gibbs potential game with the same graph and the following potential over the set of individual (product) mixed strategies

where is the normalizing constant for . From this we can derive that the individual player mixed-strategies are a “pure strategy” equilibrium of the infinite game if and only if

Or, in other words, if is a PSNE of the infinite game, then is also a local optimum (or critical point) of the mean-field approximation of .

Remark 5.

The local payoff function defined above for the infinite game also has connections to the game theory literature on learning in games (Fudenberg and Levine, 1999). This area studies properties of processes by which players “learn” how to play in (usually repeated) games; especially properties related to the existence of convergence of the learning (or playing) dynamics to equilibria. In particular, the local payoff function is similar to that used by logistic fictitious play, a special version of a “learning” process called smooth fictitious play. The difference is that the last entropy term involving the individual player’s mixed strategy has a regularization-type factor such that players play strict best-response as . In addition, logistic fictitious play is an instance of a learning process that, if followed by a player, achieves so called approximate universal consistency (i.e., roughly, in the limit of infinite play, the average of the payoffs obtained by the player will be close to the best obtained overall during repeated play, regardless of how the other players behave), also known as Hannan consistency (Hannan, 1957), for appropriate values of depending on the desired approximation level.

Indeed, it is not hard to see that in fact the best-response mixed-strategy of player to the mixed strategies of their neighbors is

Hence, running sequential best-response dynamics in the MRF-induced infinite game is equivalent to finding a variational mean-field approximation via recursive updating of the first derivative conditions. 181818In particular, the process is called a Cournot adjustment with lock-in in the literature on learning in games (Fudenberg and Levine, 1999). The process will then be equivalent to minimizing the function by axis-parallel updates. The resulting sequence of distributions/mixed-strategies monotonically decreases the value of and is guaranteed to converge to a local optimum or a critical point of . Hence, the corresponding learning process is guaranteed to converge to a PSNE of the infinite game, which is in turn an approximate MSNE of the original game. But this is not surprising in retrospect, given the last property (Property 6). That property essentially states a broader property of all potential games: they are isomorphic to so called games with identical interests (Monderer and Shapley, 1996b), which are games where every player has exactly the same payoff function.

Remark 6.

The previous discussion suggests that we could use appropriately-modified versions of algorithms for MSNE, such as NashProp (Ortiz and Kearns, 2003), as heuristics to obtain a mean-field approximation of the true marginals.

Going in the opposite direction, the discussion above also suggests that, by treating any (graphical) potential game as an MRF, for any fixed , logistic fictitious play in any potential game converges to an approximate -MSNE of the potential game. Indeed, there has been recent work in this direction, which explores the connection between learning in games and mean-field approximations in machine learning (Rezek et al., 2008). That work proposes new algorithms based on fictitious play for simple mean-field approximation applied to statistical (Bayesian) estimation.

The game-induced MRF is a -temperature Gibbs measure. As we take , we get the limiting -temperature Gibbs measure which is a probability distribution over the set of global maxima of the potential function of the game, and probability everywhere (i.e., the support of the limiting distribution is the set of joint-actions that maximize the potential function). The support of the -temperature Gibbs measure is a subset of the “globally optimal” PSNE of the potential game. But there might be other equilibria corresponding to local optima (or critical points) of the potential function.

Are there other connections between the Nash equilibria of the game and the support of the limiting distribution?

3.2.2 Correlated Equilibria and
Higher-order Variational Approximations

Kakade et al. (2003) designed polynomial-time algorithms based on linear programming for computing CE in standard graphical games with tree graphs. The approach and polynomial-time results extend to graphical games with bounded-tree-width graphs and graphical polymatrix games with tree graphs. Ortiz et al. (2007) (see also Ortiz et al., 2006

) proposed the principle of maximum entropy (MaxEnt) for equilibrium selection of CE in graphical games. They studied several properties of the MaxEnt CE, designed a monotonically increasing algorithm to compute it, and discussed a learning-dynamics view of the algorithm.

Kamisetty et al. (2011) employed advances in approximate inference methods to propose approximation algorithms to compute CE. In all of those cases, the general approach is to use ideas from probabilistic graphical models to design algorithms to compute CE. The focus of this paper is the opposite direction: employing ideas from game theory to design algorithms for belief inference in probabilistic graphical models.

Property 4 suggests that we can use the CE for the MRF-induced game as a heuristic approximation to higher-order variational approximations. In fact, one would argue that in the context of inference, doing so is more desirable because, in principle, it can lead to better approximations that can capture more aspects of the joint distribution than a simple mean-field approximation would alone. For example, mean-field approximations are likely to be poor if the MRF is multi-modal. Motivated by this fact, Jaakkola and Jordan (1997) suggest using mixture of product distributions to improve the simple variational mean-field approximation.

3.2.3 Some Computational Implications

But, consider the algorithms of Papadimitriou (2005) or Jiang and Leyton-Brown (2015a) (see also Papadimitriou and Roughgarden, 2008, and Jiang and Leyton-Brown, 2011), which we can use to compute a CE of the MRF-induced game in polynomial time. Such CE will be, by construction, also a (polynomially-sized) mixture of product distributions. (In the case of Jiang and Leyton-Brown’s algorithm it will be a mixture of a subset of the joint-action space, which is equivalent to a probability mass function over a polynomially-sized subset of the joint-action space; said differently, a mixture of product of indicator functions, each product corresponding to particular outcomes of the joint-action space.) Hence, the algorithms of Papadimitriou and Jiang and Leyton-Brown both provide a means to obtain a heuristic estimate of a local optimum (or critical point) of such a mixture in polynomial time. The result would not be exactly the same as that obtained by Jaakkola and Jordan (1997) in general, because of the extra entropic term mentioned in the discussion earlier. Can we find alternative versions of the payoff matrices, and/or alter Papadimitriou’s algorithm, so that the resulting correlated equilibria provides an exact answer to the approximate inference problem that uses mixtures of product distributions? Regardless, at the very least one could use the resulting CE to initialize the technique of Jaakkola and Jordan (1997) without specifying an a priori number of mixtures.

Having said that, both Papadimitriou’s and Jiang and Leyton-Brown’s algorithms make a polynomial number of calls to the ellipsoid-algorithm, or more specifically, its “oracle,” to obtain each of the product distributions whose mixture will form the output CE. It is known that the ellipsoid algorithm is slow in practice. Papadimitriou (2005), Papadimitriou and Roughgarden (2008), and Jiang and Leyton-Brown (2015a) leave open the design of more practical algorithms based on interior-point methods.

Finally, this connection also suggests that we can (in principle) use any learning algorithm that guarantees convergence to the set of CE (as described in the section on preliminaries on game theory where the concept was introduced) as a heuristic for approximate inference. Several so-called “no-regret” learning algorithms satisfy those conditions. Indeed, we use two simple variants of such algorithms in our experiments. Viewed that way, such learning algorithms would be similar in spirit to stochastic simulation algorithms with a kind of “adaptivity” reminiscent of the work on adaptive importance sampling (see, e.g., Cheng and Druzdzel, 2000; Ortiz and Kaelbling, 2000; Ortiz, 2002, and the references therein). Establishing a possible stronger connection between learning in games, CE, and probabilistic inference seems like a promising direction for future research. In fact, as previously mentioned (at the end of Remark 5), there has already been some recent work in this direction, but specifically for MSNE and mean-field approximations (Rezek et al., 2008).

Later in this paper, we present the results of an experimental evaluation of the performance of a simple no-regret learning algorithm in computational game theory (Fudenberg and Levine, 1999; Blum and Mansour, 2007; Hart and Mas-Colell, 2000) in the context of probabilistic inference. Those are iterative algorithms like many other approximate inference methods such as mean field and other variational approximations, but closer in spirit to sampling/simulation-based methods such as the Gibbs sampler and other similar MCMC methods. Indeed, the running time per iteration of those algorithms is roughly the same as that of sampling-based methods. We delay the details until the Experiments section (Section 4).

3.3 Other Previous and Related Work

Earlier work on the so called “relaxation labeling” problem in AI and computer vision (Rosenfeld et al., 1976; Miller and Zucker, 1991) has established connections to polymatrix games (Janovskaja, 1968) (see also Hummel and Zucker, 1983, although the connection had yet to be recognized at that time). That work also establishes connections to inference in Hopfield networks, dynamical systems, and polymatrix games (Miller and Zucker, 1991; Zucker, 2001). A reduction of MAP to PSNE in what we call here a GMhG was introduced by Yu and Berthod (1995) in the same context (see also Berthod et al., 1996); although they concentrate on pairwise potentials, which reduce to polymatrix games in this context. Because, in addition, the ultimate goal in MAP inference is to obtain a global optimum configuration, Yu and Berthod (1995) proposed a Metropolis-Hastings-style algorithm in an attempt to avoid local minima. Their algorithm is similar to simulated annealing algorithms used for solving satisfiability problems, and other local methods such as WalkSAT (Selman et al., 1996) (see, e.g., Russell and Norvig, 2003 for more information). The algorithm can also be seen as a kind of learning-in-games scheme (Fudenberg and Levine, 1999) based on best-response with random exploration (or “trembling hand” best response). That is, at every round, some best-response is taken with some probability, otherwise the previous response is replayed. Zucker (2001) presents a modern account of that work. The connection to potential games, and all its well-known properties (e.g., convergence of best-response dynamics) does not seem to have been recognized within that literature. Also, none of the work makes connections to higher-order (i.e., beyond mean-field) inference approximation techniques or the game-theoretic notion of CE.

3.4 Approximate Fictitious Play in a Two-player
Potential Game for Belief Inference in Ising Models

This section presents a game-theoretic fictitious-play approach to estimation of node-marginal probabilities in MRFs. The approach this time is more global in terms of how we use the whole joint-distribution for the estimation of individual marginal probabilities. The inspiration for the approach presented here follows from the work of Wainwright et al. (2005). The section concentrates on Ising models, an important, special MRF instance from statistical physics with its own interesting history.

Definition 2.

An Ising model wrt an undirected graph is an MRF wrt such that

where is the set of node biases ’s and edge-weights ’s, which are the parameters defining the joint distribution over .

It is fair to say that interest on more general classes of MRFs originates from the special class of Ising models. It is also fair to say that, because of the relative simplicity and importance of Ising models for problems in statistical physics, as well as to other ML and AI applications areas such as computer vision and NLP, Ising models have become the most common platforms in which to empirically study approximation algorithms for arbitrary MRFs. In short, simplicity of presentation and empirical evaluation guide the focus of Ising models in this section: Generalizations to arbitrary MRFs are straightforward but cumbersome to present. Hence, in this manuscript, we omit the details of such generalizations.

As an outline, the current section begins with an algorithmic instantiation of the iterative approach. The exact instantiation depends on whether we are using CE or MSNE as the solution concept. The section then follows with an informal discussion of the game-theoretic foundations of the general framework behind the approach, and a discussion of immediate implications to computational properties and potential convergence.

Denote by the set of all spanning trees of connected (undirected) graph that are maximal with respect to (i.e., does not contain any spanning forests). If spanning tree , we denote by the set of edges of . To simplify the presentation of the algorithm, let

and

Initialize , and for each , . At each iteration

1:
2:
3:
4:
5:
6:for all  do
7:     
8:     
9:end for

For each Ising-model’s random-variable index , set

as the estimate of the exact Ising-model’s marginal probability .

The running time of the algorithm is dominated by the computation of the maximum spanning tree (Step 1) which is . All other steps take .

Within the literature on probabilisitic graphical models, Hamze and de Freitas (2004) propose an MCMC approach based on sampling non-overlapping trees. While our approach has a sampling flavor, its exact connection to MCMC is unclear at best. Also, the spanning trees that our algorithm generates may overlap.

The following discussion connects the algorithm above to an approximate version of fictitious play from the literature on learning in games in game theory. For the most part, we omit discussions to approximate variational inference in this manuscript, except to say that TRW message-passing (Wainwright et al., 2005) is the inspiration behind our proposed algorithm above.

The game implicit in the heuristic algorithm above is a two-player potential game between a “joint-assignment” (JA) player and a “spanning-tree” (ST) player. The potential function is The payoff functions and of the JA player and the ST player, respectively, are identical and equal the potential function : formally, . Note that the payoff function of the ST player is strategically equivalent to the function

Technically, this is a game with identical payoffs, which are known to have what Monderer and Shapley (