# Exact MAP inference in general higher-order graphical models using linear programming

This paper is concerned with the problem of exact MAP inference in general higher-order graphical models by means of a traditional linear programming relaxation approach. In fact, the proof that we have developed in this paper is a rather simple algebraic proof being made straightforward, above all, by the introduction of two novel algebraic tools. Indeed, on the one hand, we introduce the notion of delta-distribution which merely stands for the difference of two arbitrary probability distributions, and which mainly serves to alleviate the sign constraint inherent to a traditional probability distribution. On the other hand, we develop an approximation framework of general discrete functions by means of an orthogonal projection expressing in terms of linear combinations of function margins with respect to a given collection of point subsets, though, we rather exploit the latter approach for the purpose of modeling locally consistent sets of discrete functions from a global perspective. After that, as a first step, we develop from scratch the expectation optimization framework which is nothing else than a reformulation, on stochastic grounds, of the convex-hull approach, as a second step, we develop the traditional LP relaxation of such an expectation optimization approach, and we show that it enables to solve the MAP inference problem in graphical models under rather general assumptions. Last but not least, we describe an algorithm which allows to compute an exact MAP solution from a perhaps fractional optimal (probability) solution of the proposed LP relaxation.

## Authors

• 2 publications
• ### Approximate Sherali-Adams Relaxations for MAP Inference via Entropy Regularization

Maximum a posteriori (MAP) inference is a fundamental computational para...
07/02/2019 ∙ by Jonathan N. Lee, et al. ∙ 3

• ### MAP Inference via L2-Sphere Linear Program Reformulation

Maximum a posteriori (MAP) inference is an important task for graphical ...
05/09/2019 ∙ by Baoyuan Wu, et al. ∙ 0

• ### Discrete graphical models – an optimization perspective

This monograph is about discrete energy minimization for discrete graphi...
01/24/2020 ∙ by Bogdan Savchynskyy, et al. ∙ 0

• ### Automorphism Groups of Graphical Models and Lifted Variational Inference

Using the theory of group action, we first introduce the concept of the ...
09/26/2013 ∙ by Hung Bui, et al. ∙ 0

• ### Higher Order Maximum Persistency and Comparison Theorems

We address combinatorial problems that can be formulated as minimization...
05/04/2015 ∙ by Alexander Shekhovtsov, et al. ∙ 0

• ### OpenGM: A C++ Library for Discrete Graphical Models

OpenGM is a C++ template library for defining discrete graphical models ...
06/01/2012 ∙ by Bjoern Andres, et al. ∙ 0

• ### Partial Optimality by Pruning for MAP-Inference with General Graphical Models

We consider the energy minimization problem for undirected graphical mod...
10/24/2014 ∙ by Paul Swoboda, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The MAP inference problem in general higher-order graphical models (Bayesian models, Markov random field models (MRFs), and beyond), also, called the discrete higher-order multiple-partitioning (or multi-label) problem (HoMPP) can be stated as follows. Given:

1. a discrete domain of points (sites) assumed, without loss of generality, to be the integer set , where stands for an integer which is greater than, or equal to ,

2. a discrete label-set assumed, without loss of generality, to be the integer set , where stands for an integer which is greater than, or equal to ,

3. a hypersite-set consisting of subsets of with cardinality greater than, or equal to ,

4. a set of real-valued local functions .

then, the goal is to find a multi-label function of the form:

 ~x:Ω⟶Li⟼~x(i)

in a way which either minimizes, or maximizes a higher-order cost function defined for all multi-label function as:

 g(~x)=∑s∈Sgs(~x(s)) (1)

where it has been assumed that, , one has .

For the sake of convenience in the remainder, we propose to encode the multi-label function by means of a

-dimensional integer vector

, while simply bearing in mind that , one has , and we refer throughout to as the multi-label vector (MLV). The problem then amounts to finding an integer vector solution in which either globally solves the following minimization problem:

 infx∈Ln{g(x)=∑s∈Sgs(xs)} (2)

or globally solves the following maximization problem:

 supx∈Ln{g(x)=∑s∈Sgs(xs)} (3)

More generally, one might be interested in finding both modes (i.e.; the minimum and the maximum solutions) of , and we propose to denote such a problem by:

 Modesx∈Ln{g(x)=∑s∈Sgs(xs)} (4)

Furthermore, in order to rule out any trivial instances of the HoMPP, therefore, we make throughout the following mild assumptions:

• ,

• and are non-constant functions.

For the sake of clarity in the remainder, we shall be referring to minimization problem (2), maximization problem (3) and modes finding problem (4) using the acronyms MinMPP, MaxMPP, and ModesMPP, respectively. Furthermore, we want to emphasize that in practice, such a higher-order function often arises as minus the log-likelihood of a instance of a graphical model (e.g.; a Bayesian model, a MRF model, and so forth) given the observed data (up to the minus log of a normalization constant). Then, depending on the application, either one may only be interested in a MAP solution of the HoMPP (i.e.; a one which maximizes the likelihood, equivalently, which minimizes ), or in both modes of the likelihood. In fact, one of the contributions of this paper is that it also shows that both modes of are intimately related to each other (please refer to section 8, especially to the discussion which follows Theorem 11 for more details).

The remainder of this paper is structured as follows. After reviewing some the existing literature on the MAP inference problem in graphical models, we first reformulate, on stochastic grounds, both MinMPP (2) and MaxMPP (3) as expectation minimization and maximization linear programs (LPs), respectively. After that, we introduce the notion of delta-distribution, and we reformulate ModesMPP (4) as a delta-expectation minimization LP. Next, we introduce the ortho-marginal framework as a general discrete function approximation by means of an orthogonal projection in terms of linear combinations of function margins with respect to a given hypersite-set, though, as mentioned in the abstract, we rather use the latter for the purpose of modeling local consistency from a global perspective. Then, we proceed in a traditional way for obtaining useful LP relaxations of the HoMPP, by merely enforcing locally the probability and the delta-probability axioms, respectively. Having in mind the two mathematical tools above, namely, the notion of delta-distribution and the ortho-marginal framework, we reformulate the proposed LP relaxations from a global viewpoint, and we show that their optimal solutions coincide with the ones of their original (hard) versions. Last but not least, since one is only guaranteed to recover a set of optimal marginal distributions (resp. a set of optimal marginal delta-distributions) of the HoMPP, we also develop an algorithm allowing to compute modes of

from a perhaps fractional solution of its LP relaxation. Before moving to the crux of the approach, we want to emphasize that the present paper is self-contained, moreover, all the presented results throughout are shown using rather simple algebraic techniques widely accessible to anyone who is familiar with the basic concepts of linear algebra, linear programming, and probability theory.

## 2 Related work

Maximum-A-Posteriori (MAP) estimation in higher-order probabilistic graphical models

[LauritzenLauritzen1991, BishopBishop2006, Wainwright  JordanWainwright  Jordan2008]

, also referred to in the operational research and computer vision literatures as the higher-order multiple-partitioning (or the multi-label) problem (HoMPP), has been, for many decades, a central topic in the literature of AI and related fields (statistics, machine learning, data-mining, natural language processing, computer vision, coding theory, operations research, computational biology, to name a few). In the culture of mathematical programming, the HoMPP is nothing else than unconstrained integer programming

[Nemhauser  WolseyNemhauser  Wolsey1988, Grootschel, Lovasz,  SchrijverGrootschel et al.1993]

, whereas in the culture of data science, the HoMPP often arises as an inference (or an inverse) problem, in the sense that one is interested in finding the most likely configuration of model parameters which explains the observed data. The choice of a graphical model for a given practical situation may be motivated by the nature of the (random) process which generates the data, but may also be severely constrained by available computing resources and/or real-time considerations. Thus, factorable graphical models have arisen as an almost inescapable AI tool both for modeling and solving a variety of AI problems, above all, due to their modularity, flexibility as well as their ability to model a variety of real-world problems. In this regard, two popular classes of graphical models are the Bayesian graphs (or the directed graphical models)

[PearlPearl1982, Pearl  RussellPearl  Russell2002], and the Markov random field (MRF) graphs (or the undirected graphical models) [Hammersley  CliffordHammersley  Clifford1971, Kinderman  SnellKinderman  Snell1980]. Historically, MRFs had long been known in the field of statistical physics [IsingIsing1925, Ashkin  TellerAshkin  Teller1943, PottsPotts1952], before they were first introduced in computer science [BesagBesag1974] and later popularized by many other authors [Geman  GemanGeman  Geman1984, BesagBesag1986, Geman  GraffigneGeman  Graffigne1986, LiLi1995]. Nowadays, graphical models are a branch in its own right of statistical and probability theories, and of which use in AI is ubiquitous.

With that being said, exact MAP inference, or even approximate MAP inference in general graphical models is a hard combinatorial problem [KarpKarp1972, CooperCooper1990, Dagum  LubyDagum  Luby1993, ShimonyShimony1994, RothRoth1996, ChickeringChickering1996, CipraCipra2000, MegretskiMegretski1996, Boykov, Veksler,  ZabihBoykov et al.2001, Park  DarwichePark  Darwiche2004, Cohen, Cooper, Jeavons,  KrokhinCohen et al.2006]. As a matter of fact, unless P = NP, one may not even hope achieving an approximate polynomial-time algorithm for computing the modes of an arbitrary instance of a graphical model. Therefore, except in particular cases which are known to be solvable exactly and in polynomial-time [HammerHammer1965, Greig, Porteous,  SeheultGreig et al.1989, Boykov, Veksler,  ZabihBoykov et al.1998, IshikawaIshikawa2003, SchlesingerSchlesinger2007, Osokin, Vetrov,  KolmogorovOsokin et al.2011]

, the MAP inference problem in graphical models has been mostly dealt with, so far, by using heuristical approaches, and which may be ranked in three main categories. First, probability-sampling-based approaches also called the Markov Chain Monte Carlo (MCMC) methods

[HastingsHastings1970, GreenGreen1995, Gelfand  SmithGelfand  Smith1990] have been among the firstly used MAP estimation algorithms in MRF models [Geman  GemanGeman  Geman1984, BesagBesag1986], and their good practical performances both in terms of computational efficiency and accuracy have been, extensively, reported in the literature [Baddeley  Van LieshoutBaddeley  Van Lieshout1993, WinklerWinkler1995, DescombesDescombes2011]. Graph-theory based approaches which are mostly variants of the graph-cut algorithm have been also extensively used for optimizing a plethora of MRF instances which are mainly encountered in computer vision [Boykov, Veksler,  ZabihBoykov et al.2001, Kolmogorov  ZabihKolmogorov  Zabih2004, Liu  VekslerLiu  Veksler2010, VekslerVeksler2012]. More recently, fostered by the important breakthroughs in linear programming [ChvàtalChvàtal1983, DantzigDantzig1990, KarmarkarKarmarkar1984, Bertsimas  TsitsiklisBertsimas  Tsitsiklis1997] and, more generally, in convex programming [YeYe1989, Nesterov  NemirovskyNesterov  Nemirovsky1994, NesterovNesterov2004, NesterovNesterov2009, LesajaLesaja2009, Beck  TeboulleBeck  Teboulle2009], as well as by the important recent surge in high-performance computing, such as multi-processor and parallel computing (GPU) technologies [Bolz, Farmer, Grinspun,  SchrooderBolz et al.2003, Li, Lu, Hu,  JiangLi et al.2011], linear and convex programming relaxation approaches–including spatially-continuous continuous approaches [Nikolova, Esedoglu,  ChanNikolova et al.2006, Cremers, Pock, Kolev,  ChambolleCremers et al.2011, Lellmann  SchnorrLellmann  Schnorr2011, Nieuwenhuis, Toeppe,  CremersNieuwenhuis et al.2013, Zach, Hane,  PollefeysZach et al.2014] and spatially-discrete ones [SchlesingerSchlesinger1976, Hummel  ZuckerHummel  Zucker1983, Hammer, Hansen,  SimeoneHammer et al.1984, PearlPearl1988, Sherali  AdamsSherali  Adams1990, Koster, Van Hoesel,  KolenKoster et al.1998, Chekuri, Khanna, Naor,  ZosinChekuri et al.2005, Kingsford, Chazelle,  SinghKingsford et al.2005, KolmogorovKolmogorov2006a, WernerWerner2007, CooperCooper2012]– have arisen as a promising alternative both to graph-theory based and MCMC based MAP estimation approaches in graphical models. Generally speaking, the latter category of approaches may also be seen as an approximate marginal inference approach in graphical models [Wainwright, Jaakkola,  WillskyWainwright et al.2005, Wainwright  JordanWainwright  Jordan2008], in the sense that, one generally attempts to optimize the objective over a relaxation of the marginal polytope constraints, in such a way that, an approximate MAP solution may be found by a mere rounding procedure, or by means of a more sophisticated message passing algorithm [Wainwright, Jaakkola,  WillskyWainwright et al.2005, KolmogorovKolmogorov2006b, Komodakis, Paragios,  TziritasKomodakis et al.2011, Sontag  JaakkolaSontag  Jaakkola2008]. In fact, the approach which is described in this paper belongs to the latter category of approaches, yet, it may solve the MAP inference problem in an arbitrary graphical model instance.

## 3 The HoMPP expectation optimization framework

The goal of this section is to transform both MinMPP (2) and MaxMPP (3) into equivalent continuous optimization problems, and, eventually, into linear programs by means of the expectation-optimization framework.

Therefore, we first assume MinMPP (2), and in order to fix ideas once and for all throughout, we propose to develop from scratch the expectation minimization (EM) approach, allowing to recast any instance of MinMPP (2) as a linear program (LP). In the introduction section, we have assumed that the labeling process is purely deterministic, but unknown. Therefore in this section, we rather advocate a random multi-label process, consisting in randomly drawing vector samples with a certain probability, then assigning to each site realization of its random label. Let us stress that randomization serves here only temporarily for developing the EM approach which is deterministic. Therefore, suppose a random multi-label vector (RMLV) , with value domain , and consider the stochastic (random) version of the objective function of MinMPP (2) expressing as:

 g(X)=∑s∈Sgs(Xs)

Then, one writes the expectation of as:

 E[g(X)]=∑x∈Lng(x)P(X=x)=∑x∈Ln(∑s∈Sgs(xs))P(X=x)=∑s∈S∑xs∈L|s|gs(xs)P(Xs=xs)

Please observe that expresses solely in terms of the marginal distributions of the random vectors . Next, suppose that one is rather given a set (or a family) of candidate probability distributions of RMLV 111But more formally speaking, one would rather consider a set of independent copies of , each of which is endowed with its own distribution in ., such that:

 ∀p∈P,⎧⎨⎩p:Ln⟶R,p(x)≥0,∀x∈Ln,∑x∈Lnp(x)=1.

and the goal is to choose among

the joint distribution of RMLV

which solves the following minimization problem:

 minp∈P {Ep[g(X)]=∑x∈Lng(x)p(x)} (5)

We refer to minimization problem (5) as EMinMPP, standing for expectation minimization multiple-partitioning problem. Now, in order to see how EMinMPP (5) relates to MinMPP (2), one may write : , where stands for the indicator function of , defined as:

 ∀y∈Ln,1x(y)={1,if y=x,0,else.

in such a way that, by denoting by which stands for the set of indicator functions of the integer vector set , one may completely reformulate MinMPP (2) as an instance of EMinMPP (5), with . Furthermore, since , writes as some convex combination of the elements of the set , one derives immediately that:

 (6)

which means that EMinMLP (5) is an upper-bound for MinMPP (2). Then, Theorem 1 below gives a sufficient condition under which EMinMPP (5) exactly solves MinMPP (2), and how one may obtain, accordingly, an optimal vector solution of MinMPP (2) from a perhaps fractional optimal probability solution of EMinMPP (5).

###### Theorem 1

Suppose that . Then EMinMPP (5) achieves an optimal objective value equal to . Furthermore, if is an optimal (probability) solution of EMinMPP (5), then any , such that, , is an optimal solution of MinMPP (2).

###### Proof 1

The assumption that guarantees that a strict equality is achieved in formula (6). Moreover, if a distribution of RMLV is optimal for problem (5), then so must be any indicator function which is expressed with a strictly positive coefficient in the convex combination of in terms of indicator functions of the set , in other words, any integer vector sample of must also be optimal for MinMPP (2).

Clearly, Theorem 1 is nothing else than the probabilistic counterpart of the well-known convex hull reformulation in integer programming [Sherali  AdamsSherali  Adams1990, Grootschel, Lovasz,  SchrijverGrootschel et al.1993, Bertsimas  TsitsiklisBertsimas  Tsitsiklis1997, Wainwright  JordanWainwright  Jordan2008]. Having said that, in the remainder, we will assume that stands for the entire convex set of candidate joint distributions of which is given by:

 P={p:Ln→[0,1], s.t.,∑x∈Lnp(x)=1} (7)

Clearly, one has , and coincides with the convex hull of . One may reexpress EMinMPP (5), accordingly, as a linear program (LP) as follows:

 min{∑s∈S∑xs∈L|s|gs(xs)ps(xs)}⎧⎪⎨⎪⎩ps(xs)=∑i∉s∑xi∈Lp(x1,…,xi,…,xn),∀xs∈L|s|,∀s∈S∑x∈Lnp(x)=1p(x)≥0,∀x∈Ln (8)

Equally, one finds that the following LP:

 max{∑s∈S∑xs∈L|s|gs(xs)ps(xs)}⎧⎪⎨⎪⎩ps(xs)=∑i∉s∑xi∈Lp(x1,…,xi,…,xn),∀xs∈L|s|,∀s∈S∑x∈Lnp(x)=1p(x)≥0,∀x∈Ln (9)

completely solves MaxMPP (3). Throughout, we shall refer to LP (8) and LP (9) using the acronyms EMinMLP and EMaxMLP, respectively. We conclude this section by merely saying that both EMinMLP (8) and EMaxMLP (9) are untractable in their current form, and the goal in the remainder of this paper is to develop their efficient LP relaxations.

## 4 The HoMPP delta-expectation minimization framework

In this section, we develop the delta-expectation minimization framework for addressing ModesMPP (4), in other words, both MinMPP (2) and MaxMPP (3) in a common minimization framework.

### 4.1 Joint delta-distribution

###### Definition 1 (Joint delta-distribution)

We call a joint delta-distribution of RMLV any function which can write in terms of the difference of two (arbitrary) joint distributions of RMLV as:

 q(x)=p(x)−p′(x),∀x∈Ln

where both and stand for (ordinary) joint distributions of RMLV .

Theorem 2 provides a useful alternative definition of a joint delta-distribution of RMLV , without resorting to its ordinary joint distributions.

###### Theorem 2

A function defines a joint delta-distribution of RMLV , if and only if, satisfies the following two formulas:

1. ,

2. .

The proof of Theorem 2 is detailed in Appendix section A.1.

Interestingly, one has managed to get rid of the pointwise sign constraint of ordinary distributions of RMLV by means of its joint delta-distributions. One then notes that the decomposition of a joint delta-distribution of RMLV in terms of the difference of its two ordinary joint distributions is, generally, non-unique, hence Proposition 1 which fully characterizes joint delta-distributions of RMLV admitting such a unique decomposition.

###### Proposition 1

A joint delta-distribution of RMLV admits a unique decomposition of the form , where both and stand for joint distributions of RMLV , if and only if, one has , in which case, and are uniquely given by:

The proof of Proposition 1 is sketched in Appendix section A.3.

Last but not least, Proposition 2 below establishes that any zero-mean function defines, at worst, up to a multiplicative scale, a joint delta-distribution of RMLV .

###### Proposition 2

Suppose a nonzero function , such that, . Then, there exists , such that, , the normalized function defined as:

 ~qλ(x)=λq(x),∀x∈Ln

defines a joint delta-distribution of RMLV .

The proof of Proposition 2 is sketched in Appendix section A.2.

### 4.2 Reformulation of a HoMPP as a delta-expectation minimization problem

We begin by introducing the notion of delta-expectation of a real-valued random function of RMLV .

###### Definition 2 (Delta-expectation)

Suppose is a joint delta-distribution of RMLV , and suppose a real-valued function . Then, one defines the delta-expectation of random function as:

 ΔEq[f(X)]=∑x∈Lnf(x)q(x) (10)

Next, similarly to the EM framework, one rather assumes a set of candidate joint delta-distributions of RMLV denoted by , and considers the delta-expectation minimization problem:

 minq∈Q{ΔEq[g(X)]} (11)

In the remainder, we take as the entire (convex) set of joint delta-distributions of RMLV which, according to Theorem 2, is defined as:

 Q={q:Ln→R, s.t.,∑x∈Lnq(x)=0, and,∑x∈Ln|q(x)|≤2} (12)

thus enabling delta-expectation minimization problem (11) to be expressed as a LP as follows:

 (13)

which may also expand, using the marginal delta-distributions of , as:

 min{∑s∈S∑xs∈L|s|gs(xs)qs(xs)}⎧⎪ ⎪⎨⎪ ⎪⎩qs(xs)=∑i∉s∑xi∈Lq(x1,…,xi,…,xn),∀xs∈L|s|,∀s∈S∑x∈Lnq(x)=0∑x∈Ln|q(x)|≤2 (14)

In the remainder, we refer to problem (13) using the acronym DEMinMLP. Theorem 3 below may be seen as the delta-distribution analog of Theorem 1.

###### Theorem 3

Suppose is an optimal solution of DEMinMLP (13). It follows that:

1. achieves an optimal objective value which is equal to: ,

2. , ,

3. , .

Moreover, satisfies , thereby, admitting a unique decomposition of the form , where both and stand for two joint distributions of RMLV which are given by:

and which are optimal for EMinMLP (8) and EMaxMLP (9), respectively.

The proof of Theorem 3 is easily established by using the definition of a delta-distribution followed by the use of the result of Theorem 1.

## 5 The ortho-marginal framework

In this section, we describe an algebraic approach (called the ortho-marginal framework) for general discrete function approximation via an orthogonal projection in terms of linear combinations of function margins with respect to a given hypersite-set. Nevertheless, the main usefulness of such an approach in the present paper is that it enables to model any set of locally constant functions (see Definition 8) in terms of a global (yet non-unique) mother function . Therefore, in order to fix ideas once and for all in the remainder, subsection 5.1 is devoted to the introduction of all the useful definitions to the development of the ortho-marginal framework, and subsection 5.2 is devoted to the description of its main results.

Beforehand, we want to note that, throughout this section, we assume that is a hypersite-set with respect to , moreover, we assume some order (e.g.; a lexicographic order) on the elements of which means that, whatever , if , then either one has , or one has .

### 5.1 Definitions

###### Definition 3 (Maximal hypersite-set)

One says that is maximal, if and only if:

 ∀c,c′∈C,c′⊆c⇒c′=c

or, in plain words, if one may not find in both a hypersite, and any of its subsets.

###### Definition 4 (Frontier hypersite-set)

One defines the frontier of , denoted by , as the smallest maximal hypersite-set which is contained in . In plain words, is the hypersite-set which contains all the hypervertices in which are not included in any of its other hypervertices.

###### Definition 5 (Ancestor hypersite)

Suppose a hypersite (if any). Then, we call an ancestor hypersite of , any hypersite , such that, .

###### Definition 6 (Ancestry function)

We call an ancestry function with respect to , any function:

 anc:C/Front(C)⟶Front(C)c⟼anc(c)

such that, , is an ancestor of in .

Please note that the ancestor of some hypersite may not be unique, hence, the function may not be unique too.

###### Remark 1

It does not take much effort to see that higher-order function (1) may rewrite solely in terms of local functions with respect as:

 g(x)=∑s∈Front(S)g′s(xs),∀x∈Ln

by simply merging each term of with respect to any with the term corresponding to any of its ancestors in .

###### Definition 7 (Margin)

Suppose a function , and a hypersite . Then, one defines the margin of with respect to as the function defined as:

 uc(xc)=∑i∈Ω/c∑xi∈Lu(x1,…,xi,…,xn),∀xc∈L|c| (15)
###### Definition 8 (Pseudo-marginal)

One says that a set of local functions of the form is a pseudo-marginals-set (or a set of locally consistent functions) with respect to , if and only if, it satisfies the following identities:

 {∀c,t∈Front(C),c∩t≠∅⇒∑i∈c/t∑xi∈Luc(xc)=∑i∈t/c∑xi∈Lut(xt)∀c∈C/Front(C),uc(xc)=∑i∈anc(c)/c∑xi∈Luanc(c)(xanc(c)),∀xc∈L|c| (16)

where stands for the hypersite of which sites belong to but do not belong to , and stands for an arbitrary ancestor function with respect to (see Definition 6).

Clearly, any set of actual margins with respect to of an arbitrary function also defines a pseudo-marginals-set with respect to .

###### Convention 1

We abuse of notation by denoting by the empty hypersite (i.e.; a one which does not contain any site), and we convene, henceforth, that whatever a function , the margin of with respect to , simply, denoted by , is the real quantity .

###### Definition 9 (Frontier-closure of a hypersite-set)

One defines the frontier-closure of as the hypersite-set with respect to , denoted by , such that:

1. ,   ,

2. , .

Algorithm 1 in Appendix section B.16 then shows how one may iteratively construct the frontier-closure of an arbitrary hypersite-set.

### 5.2 Main results of the ortho-marginal framework

First of all, Theorem 4 below establishes that marginalization of any function with respect to is intimately related to an orthogonal projection of .

###### Theorem 4

Let stand for an arbitrary real-valued function. Then, may write as a direct sum of two functions , and as: , such that:

1. the margins-set with respect to of coincides with the one of ,

2. all the margins of with respect to are identically equal to zero.

3. the closed-form expression of function is given by:

 u(x)=∑c∈Fclos∩(C)ρcfc(xc)Ln−|c|,∀x∈Ln

where , stands for the margin of with respect to , and the integer coefficients are iteratively given by:

 ρc={1, if c∈Front(C),1−∑t∈Fclos∩(C) s.t. c⊂tρt, if c∈Fclos∩(C)/% Front(C) (17)

Furthermore, introduce operator denoted by and defined as:

 (OCf)(x)=∑c∈Fclos∩(C)ρcfc(xc)Ln−|c|,∀x∈Ln (18)

Then, is an orthogonal projection.

The proof of Theorem 4 is sketched in A.4.

###### Notation 1

We refer in the remainder to the operator as the ortho-marginal operator with respect to hypersite-set .

Theorem 5 below builds on the result of Theorem 4 for establishing that any pseudo-marginals-set with respect to may be viewed as the actual margins-set with respect to of a global, yet non-unique, function .

###### Theorem 5

Suppose is a pseudo-marginals-set, thus verifying identities (16). Then, whatever a function , the function defined as:

 u(x)=(v(x)−(OCv)(x))+∑c∈Fclos∩(C)ρcuc(xc)Ln−|c|,∀x∈Ln (19)

verifies that its margins-set with respect to coincides with the set , said otherwise, one has:

 ∑i∈Ω/c∑xi∈Lu(x1,…,xi,…,xn)=uc(xc),∀xc∈L|c|,∀c∈C

where the linear coefficients are defined according to formula (17) above.

The proof of Theorem 5 is sketched in A.5.

###### Definition 10 (Ortho-marginal space)

The ortho-marginal space with respect to denoted by , is defined as the linear function space which is given by:

 MC={u:Ln→R,%s.t.,OCu≡u}

We also denote by the complement space of , defined as:

 ¯MC={v:Ln→R, s.t.,OCv≡0}
###### Remark 2

One notes that any function , reflexively, writes in terms of its margins with respect to as:

 u(x)=∑c∈Fclos∩(C)ρcuc(xc)Ln−|c|,∀x∈Ln

where , stands for the margin of with respect to .

###### Proposition 3

Suppose a real-valued function . Then, one has , if and only if, there exists a set of local functions (not to be confused here with the margins of with respect to ), such that:

 h(x)=∑c∈Chc(xc),∀x∈Ln

The proof of Proposition 3 is sketched in A.6.

###### Proposition 4

One has:

 ∀h,h′∈MC,h≡h′⇔hc(xc)=h′c(xc),∀xc∈L|c|,∀c∈Front(C)

where , and stand for the margins with respect to of and , respectively.

###### Proof 2

The proof of Proposition 4 follows immediately from the definition of , since if , then both and write as a linear combination of their respective margins with respect to , which then must coincide if , and vice-versa.

## 6 LP relaxation of the HoMPP over the local marginal-polytope

In order to fix ideas throughout, thus, this section consists of subsection 6.1 in which we introduce some (or better said, we recall some already known) useful definitions, and subsection 6.2 where we develop the LP relaxation approach of the HoMPP.

### 6.1 Definitions

###### Definition 11 (Pseudo-marginal probability set)

Suppose is a pseudo-marginals-set which, thus, satisfies identities (16). If, moreover, verifies the following identities:

 ∀s∈Front(S),{∑xs∈L|s|ps(xs)=1ps(xs)≥0,∀xs∈L|s| (20)

then is called a pseudo-marginal probability set with respect to .

###### Definition 12 (Pseudo-marginal polytope)

The pseudo- (or the local-) marginal polytope with respect to denoted by is defined as the space of all the pseudo-marginal probability sets with respect to .

###### Definition 13 (Pseudo-marginal delta-probability set)

Suppose is a pseudo-marginals-set which, thus, satisfies identities (16). If, moreover, verifies the identities:

 ∀s∈Front(S),{∑xs∈L|s|qs(xs)=0∑xs∈L|s|∣∣qs(xs)∣∣≤2 (21)

then is called a pseudo-marginal delta-probability set with respect to .

###### Definition 14 (pseudo-marginal delta-polytope)

The pseudo-marginal delta-polytope with respect to denoted by is defined as the space of all the pseudo-marginal delta-probability sets with respect to .

###### Remark 3

Let us note that the system of identities that defines either a pseudo-marginal probability set, or a pseudo-marginal delta-probability set necessarily presents many redundancies, thus, making it prone to further simplifications. For the sake of example, by taking into account the developed arguments in section 5, one may see immediately that the identities of the form may be reduced to a single identity of the form , equally, the identities of the form may be reduced to a single identity of the form , with standing for an arbitrary hypersite in , and so on. Nevertheless, for the sake of simplicity, we will not proceed to such simplifications in this paper, though, the latter may turn out to be desirable in practice, above all, for bigger values of .

### 6.2 Relaxation

One proceeds in a traditional way for obtaining LP relaxations of EMinMLP (8) and EMaxMLP (9), hence of MinMPP (2) and MaxMPP (3), respectively, by just enforcing locally the probability axioms, as follows:

 min{∑s∈S∑xs∈L|s|gs(xs)ps(xs)}{ps:L|s|→R,∀s∈S}∈~PS (22)

and

 max{∑s∈S∑xs∈L|s|gs(xs)ps(xs)}{ps:L|s|→R,∀s∈S}∈~PS (23)

where stands for the pseudo-marginal polytope (see Definition 12).

Equally, one may obtain a useful LP relaxation of DEMinMLP (13), hence of ModesMPP (4), by just enforcing just enforcing locally the delta-probability axioms, as follows:

 min{∑s∈S∑xs∈L|s|gs(xs)qs(xs)}{qs:L|s|→R,∀s∈S}∈~QS (24)

where stands for the pseudo-marginal delta-polytope (see Definition 14.

In the remainder, we refer to LP (22), LP (23), and LP (24) using the acronyms PseudoEMinMLP, PseudoEMaxMLP, and PseudoEMinMLP, respectively. One then easily checks that all of PseudoEMinMLP (22), PseudoEMaxMLP (23), and PseudoEMinMLP (24) are bounded, moreover, they constitute a lower-bound for EMinMLP (8), an upper bound for EMaxMLP (9), and a lower bound for DEMinMLP (13), respectively.

## 7 Optimality study of the LP relaxations

This section is divided into two main subsections. First, subsection 7.1 develops equivalent global reformulations of the described LP relaxations in section 5, thereby, setting the stage for their optimality study in subsection 7.2.

### 7.1 Global reformulation of the LP relaxations

The main result in this section regarding the equivalent global reformulation of PseudoEMinMLP (22), PseudoEMaxMLP (23), and PseudoEMinMLP (24) is highlighted in Theorem 6 below.

###### Theorem 6
1. PseudoEMinMLP (22) is equivalent to the following LP:

 min{∑x∈Lng(x)p(x)}{∑i∉s∑xi∈Lp(x1,…,xi,…,xn)≥0,∀xs∈L|s|,∀s∈% Front(S)∑x∈Lnp(x)=1 (25)
2. PseudoEMaxMLP (23) is equivalent to the following LP:

 max{∑x∈Lng(x)p(x)}{∑i∉s∑xi∈Lp(x1,…,xi,…,xn)≥0,∀xs∈L|s|,∀s∈% Front(S)∑x∈Lnp(x)=1 (26)
3. PseudoEMinMLP (24) is equivalent to the following LP:

 min{∑x∈Lng(x)q(x)}{∑xs∈Ln|∑i∉s∑xi∈Lq(x1,…,xi,…,xn)|≤2,∀s∈Front(S)∑x∈Lnq(x)=0 (27)

in the sense that any of the global LP reformulations above:

1. achieves the same optimal objective value as its local reformulation counterpart,

2. the margins set with respect to of any of its feasible solutions is a feasible solution of its local reformulation counterpart,

3. conversely, whatever a feasible solution of its local counterpart, any function of which margins set with the respect to is its feasible solution and achieves an objective value equal to the one achieved by the former in its local counterpart.

The proof of Theorem 6 is sketched in Appendix section A.7

Throughout, we refer to LP (25), LP (26), and LP (27) using the acronyms GlbPseudoEMinMLP, GlbPseudoEMaxMLP, and GlbPseudoEMinMLP, respectively.

### 7.2 Main optimality results

One begins by observing an interesting phenomenon which is as follows. First of all, consider the LP which stands for the difference of GlbPseudoEMinMLP (25) and GlbPseudoEMaxMLP (26), in that order, as follows:

 min{∑x∈Lng(x)(p(x)−p′(x))}⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩∑i∉s∑xi∈Lp(x)≥0,∀xs∈L|s|,∀s∈Front(S)∑i∉s∑xi∈Lp′(x)≥0,∀xs∈L|s|,∀s∈Front(S)∑x∈Lnp(x)=1∑x∈Lnp′(x)=1 (28)

Clearly, solving both GlbPseudoEMinMLP (25) and GlbPseudoEMaxMLP (26) amounts to solving LP (28) once, and vice-versa, this is on the one hand. On the other hand, suppose is a feasible solution of GlbPseudoEMinMLP (27), and and are feasible solutions of GlbPseudoEMinMLP (25) and GlbPseudoEMaxMLP (26), respectively, hence of LP (28) too. It follows, by Proposition 2, that both and are, at worst, up to a multiplicative scale (greater than, or equal to ), feasible solutions of DEMinMLP (13). But, since only their respective orthogonal-projection parts, namely and are, in fact, effective in GlbPseudoEMinMLP (27) and LP (28), respectively, as one may write:

 {⟨g,q⟩=⟨g,OSq⟩⟨g,p−p′⟩=⟨g,OS(p−p′)⟩

plus, by Theorem 5, the margins with respect to of coincide with the ones of , and the margins with respect to of coincide with the ones of , then, in the light of the result of Theorem 3, one would want to know to which extent at least one of the following two max-min problems:

 (29)

and

 (30)

achieves an optimal objective which is equal to , as by Theorem 3, this would immediately imply that one may efficiently solve any HoMPP instance by means of its LP relaxation. Furthermore, it is easy to check that max-min problem (29) is an upper bound for max-min problem (30), implying that, if the latter achieves an optimal objective value which is equal to , then the former will also achieve an optimal objective value which is equal to . But nevertheless, we will establish, hereafter, two separate results for each of max-min problems (29) and (30) above, just in order to stress on the fact that, for finding the modes of , one actually has the choice between solving two LP instances, namely, PseudoEMinMLP (22) and PseudoEMaxMLP (23), or solving a single LP instance, namely, PseudoEMinMLP (24), as both choices above turn out to be equivalent.

###### Theorem 7

The optimal objective value of max-min problem (29) is equal to .

###### Theorem 8

The optimal objective value of max-min problem (30) is equal to .

The proofs of Theorem 8 and Theorem 7 are described in Appendix sections A.8 and A.9, respectively.

In short, Theorem 7 and Theorem 8 establish exactness of the claim that we have just made above which is that both feasible sets of PseudoEMinMLP (24) and LP (28) are within the “tolerance interval” which is allowed by Theorem 3 in order to hope solving the HoMPP by means of its LP relaxation. Said otherwise, by taking into account the arguments that we have developed above, either result of Theorem 7 or of Theorem 8 is enough to guarantee that one may completely solve the HoMPP by means of PseudoEMinMLP (22) (equivalently, by means of PseudoEMaxMLP (23)) or by means of PseudoEMinMLP (24). Therefore, we summarize the latter findings in Theorem 9 and Theorem 10 below.

###### Theorem 9

PseudoEMinMLP (22) completely solves EMinMLP (8), in the sense that:

1. they both achieve the same optimal objective value equal to ,

2. any optimal solution of PseudoEMinMLP (22) defines an actual marginals-set with respect to which is originated from a joint distribution of RMLV which is optimal for EMinMLP (8).

Similar conclusions as above are, obviously, drawn regarding PseudoEMaxMLP (23), on the one hand, and EMaxMLP (9), on the other hand, which then achieve an optimal objective value equal to .

The proof of Theorem 9 is described in Appendix section A.10.

###### Theorem 10

PseudoEMinMLP (24) exactly solves DEMinMLP (13), in the sense that:

1. PseudoEMinMLP (24) achieves the same optimal objective value as DEMinMLP (13), which is equal to ,

2. any optimal solution of PseudoEMinMLP (24) defines an actual delta-marginals-set with respect to which is originated from a joint delta-distribution of RMLV which is optimal for DEMinMLP (13).

The proof of Theorem 10 is described in Appendix section A.11.

## 8 Computation of a full integral MAP solution of the HoMPP

It might be the case that a HoMPP instance has multiple MAP solutions (i.e.; might have multiple minima and/or multiple maxima), thus, the resolution either of PseudoEMinMLP (22), or PseudoEMaxMLP (23) (resp. of PseudoEMinMLP (24)) might only yield the marginals with respect to of a fractional (i.e.; non-binary) optimal distribution (resp. a fractional (i.e.; non signed-binary) delta-distribution) happening to be some convex combination of optimal binary distributions (resp. delta distributions). Thus in such a case, one moreover needs join the pieces in order to obtain an full MAP solution of a HoMPP instance. Therefore, the goal in the remainder of this section is to address the latter problem under general assumptions about a HoMPP instance.

### 8.1 Theory

For the sake of example, assume PseudoEMinMLP (22) of which resolution has yielded an optimal solution denoted by . Then by Theorem 9, stands for a set of marginal distributions of being originated from a joint distribution of RMLV denoted by which is, thus, optimal for EMinMLP (8). Moreover, by Theorem 1, obtaining a full optimal solution of MinMPP (2) amounts to obtaining a sample from , however, for the sake of computational efficiency, one wants to avoid accessing (which is hard). Therefore, in the remainder of this section, we describe an approach for computing a sample of directly from .

Then, a first naive (yet, polynomial-time) algorithm for achieving the aforementioned goal is based on the result of Proposition 5 below.

###### Proposition 5

Suppose and , such that, . Then, there exists , such that, and .

###### Proof 3

Such a result of Proposition 5 follows immediately from the identity:

 P(Xs=xs)=∑i∉s∑xi∈LP(X=x),∀xs∈L|s|

as otherwise, i.e.; if , such that, , one had , then one would have , which is a contradiction with the assumption that .

Based on such a result of Proposition 5, one may proceed as follows. Suppose and , such that, . Thus, if one replaced in the value of with its optimal value , solved a new instance of PseudoEMinMLP (22) accordingly, and repeated this procedure with respect to some , then with respect to some , and so on, until all the variables of are exhausted, one would be guaranteed to ultimately obtain in polynomial-time a full mode of . Obviously, such an algorithm is utterly slow, as it requires solving multiple instances of PseudoEMinMLP (22), successively (yet, with less variables each time). On the other hand, sampling from a general probability distribution by sole access to its marginals is not a straightforward procedure. Fortunately, as it will be shown hereafter, distributions of RMLV which are candidates for optimality in EMinMLP (8) (equivalently, in EMaxMLP (9)) are not any (see Proposition 6 below), thereby, making it possible to efficiently compute their samples by sole access to their marginals-sets with respect to .

Let us then begin by introducing the sign function defined as:

 ∀a∈