# PLUME: Polyhedral Learning Using Mixture of Experts

In this paper, we propose a novel mixture of expert architecture for learning polyhedral classifiers. We learn the parameters of the classifierusing an expectation maximization algorithm. Wederive the generalization bounds of the proposedapproach. Through an extensive simulation study, we show that the proposed method performs comparably to other state-of-the-art approaches.

## Authors

• 6 publications
• 11 publications
• 14 publications
• ### Expectation-Maximization for Adaptive Mixture Models in Graph Optimization

Non-Gaussian and multimodal distributions are an important part of many ...
11/12/2018 ∙ by Tim Pfeifer, et al. ∙ 0

• ### Quantum Expectation-Maximization Algorithm

Clustering algorithms are a cornerstone of machine learning applications...
08/19/2019 ∙ by Hideyuki Miyahara, et al. ∙ 0

• ### Hierarchical Routing Mixture of Experts

In regression tasks the distribution of the data is often too complex to...
03/18/2019 ∙ by Wenbo Zhao, et al. ∙ 0

• ### Identification of Probability weighted ARX models with arbitrary domains

Hybrid system identification is a key tool to achieve reliable models of...
09/29/2020 ∙ by Alessandro Brusaferri, et al. ∙ 0

• ### Machine learning based digital twin for dynamical systems with multiple time-scales

Digital twin technology has a huge potential for widespread applications...
05/12/2020 ∙ by Souvik Chakraborty, et al. ∙ 35

• ### MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering

We present Mixture of Contrastive Experts (MiCE), a unified probabilisti...
05/05/2021 ∙ by Tsung Wei Tsai, et al. ∙ 0

• ### Dynamic Stacked Generalization for Node Classification on Networks

We propose a novel stacked generalization (stacking) method as a dynamic...
10/16/2016 ∙ by Zhen Han, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In a binary classification problem, if all the class examples are concentrated in a single convex region with the class examples being all around that region, then the region of class can be well captured by a polyhedral set. A polyhedral set is a convex set which is formed by an intersection of a finite number of closed halfspaces (Rockafellar, 1997). An essential property of polyhedral sets is that they can be used to approximate any convex connected subset of

. This property of polyhedral sets makes the learning of polyhedral regions an interesting problem in pattern recognition. Polyhedral classifiers are useful in many real-world applications, e.g., text classification

(Sati & Ordin, 2018), cancer detection (Dundar et al., 2008), visual object detection, and classification (Cevikalp & Triggs, 2017) etc.

To learn a classifier in this case, we need to find a closed connected set (e.g., an enclosing ball) which contain all positive examples leaving all the negative examples outside the set. Support vector data description method

(Tax & Duin, 2004) does this task by fitting a minimum enclosing hypersphere in the feature space to include most of the class examples inside the hypersphere while considering all the class

examples as outliers. In such techniques, the nonlinearity in the data is captured by choosing an appropriate kernel function. With a non-linear kernel function, the final classifier may not provide good geometric insight on the class boundaries in the original feature space. Such insights are useful to understand the local behavior of the classifier in different regions of the feature space.

A well-known approach to learn polyhedral sets is the top-down decision tree method. In a binary classification problem, a top-down decision tree represents each class region as a union of polyhedral sets

(Breiman et al., 1984; Duda et al., 2000; Manwani & Sastry, 2011). Here all positive examples belong to a single polyhedral set. However, top-down decision tree algorithms, due to greedy nature, may not learn a single polyhedral set well.

Neither SVM nor top-down decision trees (CART) can learn classifier as a polyhedral set. Even when we want a classifier for general data, neither CART nor SVM can learn a classifier representable as compactly as that by PLUME. In the context of explainable AI, a classifier whose decision is based on two hyperplanes in the original feature space is certainly more understandable than a large decision tree or a classifier that is a linear combination of many kernel functions.

Unlike such general purpose approaches, there are many specialized approaches for learning polyhedral classifiers. In such methods, we first fix the structure of the polyhedral classifier and determine the optimal parameters of this fixed structure. In the case of polyhedral classifiers, we can adjust the structure by choosing the number of hyperplanes.

One can formulate a constrained optimization problem to learn polyhedral classifiers (Astorino & Gaudioso, 2002; Dundar et al., 2008; Orsenigo & Vercellis, 2007; Sati & Ordin, 2018). This optimization problem minimizes the sum of classification errors over the training set subject to the separability conditions. Conceptually, learning a polyhedral set requires learning each of the hyperplanes constituting it. But we cannot solve these linear problems (of learning individual hyperplanes) separately because the available training set cannot be easily transformed into training sets for learning individual hyperplanes. While points of would be positive examples for each of the linear problems, it is not known apriori which points of are negative examples for learning which hyperplane. In (Astorino & Gaudioso, 2002)

, this problem is solved by first enumerating all possibilities for misclassified negative examples (e.g., which hyperplane is responsible for a negative example to get misclassified and for each negative example there could be many such hyperplanes) and then solving a linear program for each possibility to find descent direction. This approach becomes computationally very expensive.

(Dundar et al., 2008) assume that for every linear subproblem, a small subset of negative examples is known and propose a cyclic optimization algorithm. Their assumption of knowing a subset of negative examples corresponding to every linear subproblem is not realistic in many practical applications. Zhou et al. (2016) propose a method in which the positive class is enclosed using the intersection of non-linear surfaces using kernel methods. By using linear kernels it becomes same as polyhedral learning. However, they use the same objective function as other constrained optimization problems.

Manwani & Sastry (2010)

propose a logistic function based posterior probability model for polyhedral learning. They learn the parameters using alternating minimization. The method is shown to perform well experimentally though there are no theoretical guarantees about convergence or generalization errors. A large margin framework for polyhedral learning is discussed in

(Kantchelian et al., 2014)

In this paper, we propose a mixture of experts model for learning polyhedral classifiers. The mixture of experts (Jacobs et al., 1991) model contains several linear experts (classifiers) and each of the expert champions one of the regions of the feature space. It also includes a gating function which decides which expert to use for a particular example. Even though mixture of experts is a generic approach, in the context of learning polyhedral classifiers it carries a unique structure which requires a lesser number of parameters. We see that it does not need separate parameters to model the gating function. It uses experts parameters themselves for modeling the gating function. As far as our knowledge is concerned, this is the first attempt in this direction. We make the following contributions in this paper.

1. We propose a novel mixture of experts architecture to model polyhedral classifiers. We propose an expectation maximization (EM) algorithm using this model to learn the parameters of the polyhedral classifiers.

2. We derive data dependent generalization error bounds for the proposed model with specific constraints that the gating function uses the same parameters as experts.

3. We do extensive simulations on various datasets and compare with state of the art approaches to show that our approach learns polyhedral classifiers efficiently.

The rest of the paper is organized as follows. In Section 2, we state the definitions of polyhedral separability and polyhedral classifiers. We describe the mixture of experts model in Section 3 and corresponding EM algorithm in Section 4. We derive the generalization error bounds for the proposed model in Section 5.We describe the experimental results in Section 6. We conclude the paper with some remarks in Section 7

## 2 Polyhedral Classification

Let be the training dataset, where . Let be the set of points for which and let be the set of points for which .

### 2.1 Polyhedral Separability

Two sets and in are said to be -polyhedral separable if there exists a set of hyperplanes having parameters, (), such that

This means that two sets and are -polyhedral separable if is contained in a convex polyhedral set which is formed by intersection of half spaces and the points of set are outside this polyhedral set. Here all the positive examples satisfy each of a given set of linear inequalities (that defines the half spaces whose intersection is the polyhedral set). However, each of the negative examples fail to satisfy one (or more) of these inequalities and we do not know apriori which inequality each negative example fails to satisfy. Thus constraint on each of the negative examples is logical ‘OR’ of the linear constraints which makes the optimization problem non-convex.

### 2.2 Polyhedral Classifier

Let be the parameters of the hyperplanes which form the polyhedral set. Here . Let be defined as:

 h(x)=mink∈{1,…,K}(wTkx+bk) (1)

Clearly if , then the condition is satisfied for all and the point can be assigned to set . Similarly, if , there exists at least one , for which and the point can be assigned to set . Thus, the polyhedral classifier can be expressed as

 f(x)=sign(h(x))=sign[mink∈{1,…,K}(wTkx+bk)] (2)

Let and . Now on, we will express as .

## 3 Mixture of Experts Model for Polyhedral Classifier

We propose a new mixture of experts architecture for learning polyhedral classifiers. We model the posterior probability as a mixture of logistic functions where is the number of hyperplanes associated with the polyhedral classifier. We write the posterior probability of the class labels as

 pΘ(y|x)=K∑k=1p(k|x)p(y|x,k)=K∑k=1gk(x,Θ)σ(y~wTk~x) (3)

where and

. Each expert models the posterior probability using the logistic regression. The parameter vector associated with the

expert is . is the gating function which decides how much weightage should be given to expert for an example . To ensure that eq.(3) describes a valid posterior probability model, we choose such that and . For learning polyhedral classifiers, we construct the gating function using softmax function as follows:

 (4)

where is a user defined parameter which decides how fast goes to 0 or 1. Note that the proposed gating function depends on experts parameters only. Moreover,

 limγ→∞gk(x,Θ)=I{k=argminj∈{1,…,K}~wTj~x}

and hence

 limγ→∞pΘ(y|x)=11+e−yminj∈{1,…,K}~wTj~x.

For a polyhedrally separable data, we know that . Thus, will be close to 1 if . Similarly, will be close to 1 if . Thus, described in eq.(3) is a valid probability model for polyhedral learning. Note that the model proposed in (Manwani & Sastry, 2010) is a limiting case of the model proposed in eq.(3). The advantage with the proposed model are twofold. As the proposed posterior probability function is a smooth function, the resulting EM formulation will satisfy smoothness conditions required for convergence. This is in contrast to the hard partitioning model in Manwani & Sastry (2010). The second advantage that we will see is that the proposed model is also better suited for capturing smooth convex boundaries.

## 4 EM Algorithm for Learning Polyhedral Classifier

As the posterior probability model is a mixture model, we propose an EM approach for learning the parameters. Recall that is the training data set, where . In the EM framework, we think of as incomplete data and have (given by equation (3)) as the model for incomplete data. In this problem, we don’t know which expert should be used to classify example . This is the missing information. We represent the missing information corresponding to example as , where each , such that . Moreover,

 P(znk=1|xn,Θ)=gk(xn,Θ)=e−γ~wTk~xn∑Kj=1e−γ~wTj~xn. (5)

Let be the complete data. The complete-data log likelihood is given as follows.

 lcomplete (Θ;¯S)=ln[N∏n=1K∏k=1[P(yn,znk|xn,Θ)]znk]

E-Step: In the E-step, we find which is the expectation of complete-data log likelihood.

 QN(Θ,Θc)=E{z1,…,zN}[lcomplete(Θ;¯S)|Θc]

where found as follows.

 πk(xn,Θ) =gk(xn,Θ)σ(yn~xTn~wk)∑Kj=1gj(xn,Θ)σ(yn~xTn~wj) =e−γ~xTn~wkσ(yn~xTn~wk)∑Kj=1e−γ~xTn~wcjσ(yn~xTn~wj) (6)

It is easy to see that is a concave function of .

M-Step: In the M-step, we maximize with respect to to find the new parameter set . Since is a concave function of , there exists a unique maxima of it. However, we do not get the closed form solution for the maximization with respect to . Thus, we find by moving in ascent direction of starting from . We can use one of the following approaches.

 Θc+1=Θc+αc∇gc (7)

where is step size at iteration and is -dimensional gradient vector at iteration.

 g=(∂QN(Θ,Θc)∂~w1T∂QN(Θ,Θc)∂~w2T…∂QN(Θ,Θc)∂~wKT)T

can be found as follows.

 ∂QN(Θ,Θc)∂~wk=−N∑n=1[γ{πk(xn,Θc)−gk(xn,Θ)} −{ynπk(xn,Θc)(1−σ(yn~xTn~wk))}]~xn (8)
• Newton Method: The Newton method updates the parameters as follows.

 Θc+1=Θc+αc(Hc)−1gc (9)

where is Hessian matrix at iteration. is defined as follows.

 H=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∂2QN(Θ,Θc)∂~w21∂2QN(Θ,Θc)∂~w1~w2…∂2QN(Θ,Θc)∂~w1~wK∂2QN(Θ,Θc)∂~w2~w1∂2QN(Θ,Θc)∂~w22…∂2QN(Θ,Θc)∂~w2~wK⋮⋮⋱⋮∂2QN(Θ,Θc)∂~wK~w1∂2QN(Θ,Θc)∂~wK~w2…∂2QN(Θ,Θc)∂~w2K⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

is as given earlier. Also,

 Hkk =∂2QN(Θ,Θc)∂~w2k =−N∑n=1γ2gk(xn,Θ)(1−gk(xn,Θ)~xn~xTn −N∑n=1πk(xn,Θc)σ(~wTk~xn)(1−σ(~wTk~xn))~xn~xTn Hkr =∂2QN(Θ,Θc)∂~wk~wr =γ2N∑n=1gk(xn,Θ)gr(xn,Θ)~xn~xTn,∀r≠k
• The BFGS Algorithm: In the BFGS method (Chong & Zak, 2013), the Hessian matrix is not evaluated directly and its inverse is approximated using rank-two updates specified by gradient evaluations. Let , then

 Bc+1 −BcΔgcΔΘcT+ΘcΔgcTBcTΔgcTΔΘc

where and . Then the update equation for weight vectors is as follows:

 Θc+1=Θc+αcBcgc (10)

where is the step size.

In the above mentioned iterative optimization approaches, we need to choose the step size appropriately to ensure the convergence of our approach. There are many ways to find the step size. For our work, we use the backtracking line search to find the step size.

## 5 Data Dependent Generalization Error Bounds

In this section, we will prove data dependent generalization error bounds. The generalization bounds show the utility of our formulation modelling the posterior probability function as a smooth function. These bounds shows that the method has proper asymptotic properties.

In this problem, we are finding the parameters by maximizing the likelihood which is equivalent to minimizing empirical risk under cross entropy loss specified as below.

 ϕ(y,f(x))=−ylnpΘ(y|x)−(1−y)ln(1−pΘ(y|x))

where is specified in eq.(3). Define -risk as

 Rϕ(f)=E[ϕ(y,f(x))] (11)

Define empirical risk as

 ^Rϕ(f)=1NN∑n=1ϕ(yn,f(xn))

Let be a class of real-valued functions with domain . Here, we derive data dependent error bounds using Rademacher complexity. The empirical Rademacher complexity is defined as

 ^RN(F)=Eϵ{supf∈F1NN∑n=1ϵnf(xn)}

where

is a vector of independent Rademacher random variable. The Rademacher complexity is the expected value of the empirical Rademacher complexity over all training sets of size

i.e., .
We formalize our assumption as below to prove data dependent error bound.

###### Assumption 1

Let and .

To simplify the notation, we introduce as and .

We start our discussion regarding data dependent error bounds with the following result from (Bartlett & Mendelson, 2003).

###### Result 2

For every and positive integer , every satisfies

 Rϕ(f)≤^Rϕ(f)+2Lϕ^RN(F)+3ϕmax√ln2δ2N (12)

with probability at least .

In Result 2, we will first upper bound where .

###### Lemma 3

Let . Then, .

Proof: To simplify the notation, we write as .

 ^RN(F) =Eϵ{supΘ1NN∑n=1ϵnK∑k=1gk(xn,Θ)σ(yn~wTk~xn)} ≤Eϵ{K∑k=1supΘ1NN∑n=1ϵngk(xn,Θ)σ(yn~wTk~xn)} =K∑k=1Eϵ{supΘ1NN∑n=1ϵngk(xn,Θ)σ(yn~wTk~xn)} =K∑k=1^RN(Fk)

Now, in order to bound , using Lemma 3, we will bound each of individually.

###### Lemma 4

Suppose Assumption 1 holds then

Proof: we decompose as follows.

 ^RN(Fk) =Eϵ{supΘ1NN∑n=1ϵngk(xn,Θ)σ(yn~wTk~xn)} = Eϵ{supΘ1NN∑n=1ϵngk(xn,Θ)(σ(yn~wTk~xn)−0.5) +1NN∑n=10.5ϵngk(xn,Θ)} ≤ Eϵ{supΘ1NN∑n=1ϵngk(xn,Θ)(σ(yn~wTk~xn)−0.5)} +Eϵ{supΘ1NN∑n=10.5ϵngk(xn,Θ)}

We define and . We further define .

We can easily check that is closed under negation. Define the class . Thus,

 ^RN(Fk)≤^RN(G3k)+0.5^RN(G1k)

Using Lemma 2 from (Azran & Meir, 2004) we observe that

 ^RN(G3k)≤M1^RN(G1k)+M2^RN(~G2k)

where and . Thus,

 ^RN(Fk) ≤^RN(G1k)+0.5^RN(~G2k)+0.5^RN(G1k) =1.5^RN(G1k)+0.5^RN(~G2k)

As is the same as , we can rewrite above equation as follows.

 ^RN(Fk)≤1.5^RN(G1k)+0.5^RN(G2k) (13)

To bound , we will first bound using vector contraction inequality from (Maurer, 2016). One can verify that Lipschitz constant of with respect to vector is . Using vector contraction inequality from (Maurer, 2016), we can write as

 ^RN(G1k)≤√2(√K−1K)Eϵ{supΘ1NN∑n=1K∑k=1ϵnk(wTkxn+bk)} ≤√2(K−1)KEϵ{K∑k=1{supwk,bk1NN∑n=1ϵnk(wTkxn+bk)}} =√2(K−1)KEϵ{K∑k=1{supwk1NwTkN∑n=1ϵnkxn}}

where is an independent Rademacher sequence. Using the Cauchy-Schwartz and Jensen inequalities,

 ^RN(G1k) ≤√2(K−1)KK∑k=1Eϵ{1NWmaxk∥∥ ∥∥N∑n=1ϵnkxn∥∥ ∥∥} ≤√2(K−1)KK∑k=1WmaxkR√N (14)

Now, we will bound second term of eq.(13).

 ^RN(G2k) =Eϵ{supwk1NN∑n=1ϵnσ(yn~wTk~xn)}

The Lipschitz constant of sigmoid function is 1. Using Theorem 12 of

(Bartlett & Mendelson, 2003), we can write above equation as

 ^RN(G2k) ≤Eϵ{supwk,bk1NN∑n=1ϵnyn(wTkxn+bk)} =Eϵ{supwk1NN∑n=1ϵnynwTkxn}

As , we can redefine . Using new definition of , we can rewrite above equation as

 ^RN(G2k) ≤Eϵ{supwk1NN∑n=1ϵnwTkxn} =Eϵ{supwk1NwTkN∑n=1ϵnxn}

Using the Cauchy-Schwartz and Jensen inequalities,

 ^RN(G2k)≤1NEϵ{Wmaxk∥∥ ∥∥N∑n=1ϵnxn∥∥ ∥∥}≤WmaxkR√N (15)

Putting values of and from eq.(5) and (15) in eq.(13), we will get desired result for Lemma 4.

Now, we will present main theorem containing data dependent generalization error bounds for our approach.

###### Theorem 5

Suppose assumption 1 holds. Then, for any function , there holds

 R(f)≤^Rϕ(f,S)+c1K∑k=1WmaxkR√N+c2√ln2δ2N

with probability at least where and .

Proof: Using results of Lemma 3 and Lemma 4, we can bound as follows.

 ^RN(F) ≤K∑k=1^RN(Fk) ≤3√(K−1)√2K∑k=1WmaxkR√N+K∑k=1WmaxkR√N ≤(3√(K−1)√2+1)K∑k=1WmaxkR√N

If assumption 1 holds then the Lipschitz constant of cross-entropy loss is bounded by . When assumption 1 holds, the maximum value of cross-entropy loss will be bounded by . Putting bound in eq.(12), we will get the following generalization error bound for our approach.

 R(f)≤^Rϕ(f,S)+c1K∑k=1WmaxkR√N+c2√ln2δ2N

where and