# Confidence-Constrained Maximum Entropy Framework for Learning from Multi-Instance Data

Multi-instance data, in which each object (bag) contains a collection of instances, are widespread in machine learning, computer vision, bioinformatics, signal processing, and social sciences. We present a maximum entropy (ME) framework for learning from multi-instance data. In this approach each bag is represented as a distribution using the principle of ME. We introduce the concept of confidence-constrained ME (CME) to simultaneously learn the structure of distribution space and infer each distribution. The shared structure underlying each density is used to learn from instances inside each bag. The proposed CME is free of tuning parameters. We devise a fast optimization algorithm capable of handling large scale multi-instance data. In the experimental section, we evaluate the performance of the proposed approach in terms of exact rank recovery in the space of distributions and compare it with the regularized ME approach. Moreover, we compare the performance of CME with Multi-Instance Learning (MIL) state-of-the-art algorithms and show a comparable performance in terms of accuracy with reduced computational complexity.

## Authors

• 1 publication
• 2 publications
• 14 publications
• 10 publications
• ### A bag-to-class divergence approach to multiple-instance learning

In multi-instance (MI) learning, each object (bag) consists of multiple ...
03/07/2018 ∙ by Kajsa Møllersen, et al. ∙ 0

• ### Dynamic Programming for Instance Annotation in Multi-instance Multi-label Learning

Labeling data for classification requires significant human effort. To r...
11/14/2014 ∙ by Anh T. Pham, et al. ∙ 0

• ### Distributionally Robust Multi-instance Learning with Stable Instances

Multi-instance learning (MIL) deals with tasks where data consist of set...
02/13/2019 ∙ by Weijia Zhang, et al. ∙ 0

• ### Stable multi-instance learning visa causal inference

Multi-instance learning (MIL) deals with tasks where each example is rep...
02/13/2019 ∙ by Weijia Zhang, et al. ∙ 0

• ### Bag Reference Vector for Multi-instance Learning

Multi-instance learning (MIL) has a wide range of applications due to it...
12/03/2015 ∙ by Hanqiang Song, et al. ∙ 0

• ### Multi-Instance Learning by Utilizing Structural Relationship among Instances

Multi-Instance Learning(MIL) aims to learn the mapping between a bag of ...
02/03/2021 ∙ by Yangling Ma, et al. ∙ 0

• ### Learning and Interpreting Multi-Multi-Instance Learning Networks

We introduce an extension of the multi-instance learning problem where e...
10/26/2018 ∙ by Alessandro Tibo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In the multi-instance data representation, objects are viewed as bags of instances. For example, a document can be viewed as a bag of words, an image can be viewed as bag of segments, and a webpage can be viewed as a bag of links (see Fig. 1).

Multi-instance data representation has been used in many areas in machine learning and signal processing, e.g., drug activity detection [1], multi-task learning[2], text classification [3], music analysis [4], object detection in image [5], and content-based image categorization [6]. Machine learning algorithms are described as either supervised or unsupervised

. Multi-Instance Learning (MIL) refers to the prediction or supervised learning problem

[7, 3]

in which the main goal is to predict the label of an unseen bag, given the label information of the training bags. On the other hand, unsupervised learning (also referred to as grouped data modeling

[8]

) can be applied to unlabeled multi-instance data with the goal of uncovering an underlying structure and a representation for each bag in a collection of multiple instance bags. In supervised MIL, each bag is associated with a class label and the goal is to predict the label for an unseen bag given all the instances inside the bag. Due to the ambiguity of the label information related to instances, supervised MIL is a challenging task. Since the introduction of the MIL approach in machine learning and signal processing, numerous algorithms have been proposed either by extending traditional algorithms to MIL, e.g., citation kNN

[7], MI-SVM and mi-SVM [3]

, neural network MIL

[9], or devising a new algorithm specifically for MIL, e.g., axis-parallel rectangles (APR) [1], diverse density (DD) [10], EM-DD [11], and MIBoosting [12]. MIL has been studied in an unsupervised setting in [13].

Many of the aforementioned algorithms may compute bag-level similarly metrics (e.g., Haussdoff distance or Mahalanobis distance) based on instance-level similarity [14]. Instance-level metrics can become computationally expensive. Computation of pairwise similarities between of pairs of instance from two given bags involves a computational complexity that increases quadratically in the number of instances in each bag. Moreover, instance level metrics may not reflect the structure similarity defined at the bag level and it is difficult to identify the characteristics of each bag using instance-level similarities [15]. Some kernel approaches have been proposed to measure the similarity at the bag-level [15]. This approach enables the use of kernel based methods for single instance representation to be extened to the bag-level. However, the kernel computation is quadratic in the number of instances per bag. The problem of computational complexity associated with instance-level metrics has been alleviated by representing each bag with few samples in a very high dimension, e.g., single-blob-with-neighbors (SBN) representation for each image [16].

In this paper, we consider the problem of associating each bag with a probability distribution obtained by the principle of maximum entropy. Assuming that each instance in a bag is generated

from an unknown distribution, we fit to each bag a distribution while maintaining a common shared structure among bags (see Fig. 2).

This approach has several advantages over existing approaches. First, the problem can be solved in a convex framework. Second, it maps each bag of instances into a point in the probability distribution space providing a summarized representation to the data. In this framework, each bag is parametrized by a vector that carries all the information about the instances inside each bag. Third, a meaningful metric can be defined over the space of distributions to measure the similarity among bags. Moreover, the computational complexity significantly drops from quadratic to linear in the number of instances inside each bag. The joint density estimation framework with regularization facilitates dimension reduction in the distribution space and introduces a sparse representation of bases which span the space of distributions. Using sparse representations of basis functions to learn the space of distributions in a non-parametric framework has been studied in

[17].

Our contributions in this paper are: we introduce a new framework for learning from multi-instance data using the principle of maximum entropy, a metric defined over the space of the distributions is introduced to measure the similarities among bags in multi-instance data, we propose the confidence-constrained maximum entropy (CME) method to learn the space of distributions jointly, an accelerated proximal gradient approach is proposed to solve the resulting convex optimization problem, the performance of the proposed approach is evaluated in terms of exact rank recovery in the space of distributions and compared with regularized ME, and we examine the classification accuracy of CME on four real world datasets and compared the results with MIL state-of-the-art algorithms.

## Ii Problem statement

We are given bags . The th bag given by , is a set of feature vectors. Each feature vector and . In addition, we assume that the bags are statistically independent and that instances within the same bag are independent and identically distributed (). More specifically, for the th bag we assume that for are drawn

. We consider the problem of unsupervised learning of distribution for each bag using the maximum entropy framework. The goals are to provide a latent representation for each bag using a generative model obtained by maximum entropy and provide a joint probability framework with a regularization which takes into account the model complexity and limited number of samples.

## Iii Maximum entropy framework for learning from multi-instance data

We consider the maximum entropy framework for modeling multi-instance data by associating multi-instance bags with probability distributions. We are interested in a framework that will allow convenient incorporation of structure (e.g., geometric, low-dimension) in the distribution space. The problem of density estimation can be define as follows. Given an set of samples from an unknown density function , find an estimator for . We use the framework of maximum entropy to estimate [18, 19]

. In the maximum entropy framework one is interested in identifying a unique distribution given a set of constraints on generalized moments of the distributions:

where is an -dimensional vector of basis functions defined over instance space , i.e., and . Note that the basis function ,

, can be any real valued function such as polynomials, splines or trigonometric series. For example, in a Gaussian distribution,

and . Additionally, is the expected value of the th feature function. For example, in a Gaussian distribution, and . This framework has the advantage of not restricting the class of the distribution to a specific density and considers a wide range of density functions in the class of exponential family hence, has a good approximation capability. In fact, it is shown that with a rich set of basis function , the approximation error decreases in order of where is the number of basis [20]. We explain the maximum entropy approach below.

### Iii-a Single density estimation (SDE)

ME framework for density estimation was first proposed by Janes [21]

and has been applied in many areas of computer science and signal processing including natural language processing

[22], species distribution modeling [23], text classification [24], and image processing[25]. ME framework finds a unique probability density function over that satisfies the constraints . In principle, many density functions can satisfy the constraints. The maximum entropy approach selects a unique distribution among them that has the maximum entropy. The problem of single density estimation in the maximum entropy framework can be formulated as:

 maximize   H(p) (1) subject to   Ep[ϕ]=α ∫Xp(x)dx=1,

where is the entropy of and . For now, we ignore the non-negativity constraints and later show that the optimal solution, despite the exclusion of these constraints, is non-negative. Note that the objective function in (1) is strictly convcave (Hessian ) therefore, it has a unique global optimum solution. To find the optimum solution to (1) , we first construct the Lagrangian , where and are dual variables corresponding to constraints imposed on the solution. Then, we hold and constant and maximize the Lagrangian w.r.t. . This yields an expression for in terms of the dual variables. Finally, we substitute back into the Lagrangian, solving for the optimal value of and ( and , respectively).

The derivative of the Lagrangian w.r.t. to given and are fixed is given by

 ∂L∂p(x)=−(logp(x)+1)+λTϕ(x)−γ. (2)

Equating (2) to zero and solving for , we obtain

 p∗(x;λ,γ)=exp(λTϕ(x))exp(−γ−1). (3)

Equation (3) represents w.r.t. the dual variables and . We can integrate and set it to one to obtain . Hence, , where . Therefore, can be written as

 p∗λ(x)=exp(λTϕ(x)−Z(λ)). (4)

Note that the maximum entropy distribution given in (4), satisfies the non-negativity, since the exponential function is always non-negative. If we substitute and back into the Lagrangian, we obtain the dual function as

 g(λ) = L(p∗λ(x),λ,γ∗) (5) = Z(λ)−λTα

Therefore,

 λ∗=argminλg(λ) (6)

Since is a smooth convex function, setting yeilds optimum solution . Note that for . Based on the fundamental principle in the theory of Lagrangian multiplier called, Karush-Kuhn-Tucker theorem, which asserts under the conditions that the cost function is convex and all the equality constraints are affine (both holds for (1)), the optimal solution for dual is equal to the primal optimal solution [26]. In other words, of (3) with given by (6) is in fact the optimum solution to (1). There is a one-to-one mapping between in (1) and in (6). Therefore, . For example, consider and . By (4), , where , or . Given , .

We will now derive the maximum-likelihood (ML) estimator for the parameter in given observations . The log likelihood for (4) can be written as

 L(λ)=logp(x1,x2,…,xn)=n∑i=1(λTϕ(xi)−Z(λ)) =n(λTE^p[ϕ(x)]−Z(λ)), (7)

where denotes the empirical average of given by . Thus, we can write the negative log-likelihood function as follows:

 −L(λ) = −nE^p[λTϕ(x)−Z(λ)] (8) = nE^p[log^p]−nE^p[λTϕ(x)−Z(λ)]−nE^p[log^p] = nD(^p∥pλ)+Υ,

where is a constant w.r.t. and . Therefore, maximizing the log-likelihood in (III-A) w.r.t. is equivalent to minimizing the KL-divergence in (8) w.r.t. . Thus, can be obtained as a result of the following optimization problem:

 ^λ = argminλnD(^p∥pλ) (9) = argminλn(Z(λ)−λTE^p[ϕ(x)]).

There are several algorithms for solving ME, e.g., iterative scaling [27] and its variants [23], gradient descent, Newton, and quasi-Newton approach [28]. The ML optimization problem is convex in terms of and can be solved efficiently using Newton’s method. Newton’s method requires the first and second derivative of the objective function w.r.t. . The derivatives of are:

 ∇λ=n(Epλ[ϕ]−E^p[ϕ]) ∇2λ=n(Epλ[ϕ]Epλ[ϕ]T−Epλ[ϕϕT]).

Algorithm 1 provides the details for Newton’s method implementation.

ME can overfit data due to low number of samples or large number of basis functions [23]. Regularized ME (RME) is proposed to overcome the issue of overfitting in ME [29, 30]. RME can be either formulated as relaxing the equality in (1) or putting a prior on the p.d.f. in (1) [31] (Laplace prior yields regularization and Gaussian prior yields regularization). Algorithms for solving RME are proposed in [31, 23]. Convergence analysis for RME is provided in [20, 23]. The problem of single density estimation is presented to introduce the maximum entropy framework for density estimation. In the next section, the principle of maximum entropy is applied to multiple density estimation.

### Iii-B Multiple density estimation (MDE)

Multiple density estimation (MDE) for multi-instance data can be done following the same principle as explained for single density estimation in the previous section. In MDE each bag is represented by one distribution, i.e., and the cost function for MDE, due to bag independence, is the sum of the individual bags negative log-likelihood. MDE can be solved using the following minimization:

 ^Λ = argminΛN∑i=1niD(^pi∥pλi) (10) = argminΛN∑i=1ni(Z(λi)−λTiE^p[ϕi]),

where , , is total number of bags, and is total number of instances in the th bag. The objective function in (10) is expressed as the sum of the functions of individual variables which only depends on the parameters of single density. Hence, MDE formulation proposed in (10) considers the density estimation for each bag individually. This individual estimate addresses the nature of each dataset separately and ignores the fact that the underlying structure of the data can be shared among all datasets. This might cause a poor generalization performance due to the low number of samples for some bags [32]. To address this, we use a joint regularization on the parameter space to simultaneously learn the structure of the distribution space and infer each distribution while keep the origin of each data uninfluenced. Hierarchical density estimation [30] formulates the problem of MDE using regularization. The regularization defined on each data separately and on the group of the data defined in the hierarchy. Note that the hierarchical structure of the data is a prior information. However, in most cases in real world applications the relations among the datasets are unknown beforehand, e.g., in text or image datasets. In the following, we proposed a framework for learning jointly in the space of distributions using the principle of maximum entropy.

## Iv Structured multiple density estimation

To improve the power of estimation in maximum entropy framework, we choose a large number of basis functions . This results in a large dimensional space for the parameter which cause overfitting and poor generalization performance. To reduce the effect of overfitting for density estimation, an efficient way is to reduce the dimensionality of the parameter space. This low dimensional space corresponds to the hidden structure of the data. Rank minimization is an approach in dimension reduction which finds a linear subspace of the observed data by constraining the dimension of the given matrix. Rank minimization introduces structures in the parameter space. In the following we first define rank recovery in the space of distributions and then show how it can be formulated in the maximum entropy framework.

### Iv-a Rank recovery in the space of distributions

The dimension of the space of distributions is controlled by the size of the basis . Often the size of is large to allow accurate approximation of the distribution space. Hence, we are interested in finding a smaller basis that provides a fairly accurate replacement to the original basis . We consider the problem of finding a new basis in the span of . Suppose a smaller basis can be obtained by , where and A is a matrix, where . Instead of using involving terms, one can use involving only terms. In this case, , where and , which results in such that and . Hence is a low-rank matrix.

### Iv-B Regularized MDE (RMDE) using maximum entropy

To obtain a low-rank solution for , we can solve a regularized nuclear norm MDE. The nuclear norm of a matrix

is defined as the sum of the singular values of matrix

. The nuclear norm is a special class of Schatten norm which is defined as . When is equal to the nuclear norm. Nuclear norm enforces sparsity on the singular values of matrix

, which results in a low-rank structure. The heuristic replacement of rank with nuclear norm has been proposed for various applications such as matrix completion

[33, 34], collaborative filtering[35], and multi-task learning[36].

In RMDE, a regularized nuclear norm is added to the objective function in (10) yielding:

 minimizeN∑i=1ni(Z(λi)−λTiE^p[ϕi])+η∥Λ∥∗, (11)

where is the regularization parameter. RMDE can be viewed as maximum a posteriori (MAP) criterion using a prior distribution over matrix of the form . This is similar to the interpretation of -regularization for sparse recovery as MAP with a Laplacian prior. Recently, We proposed a quasi-Newton approach to solve RMDE [37]. RMDE can also be formulated as a constrained MDE as follows:

 minimizeN∑i=1ni(Z(λi)−λTiE^p[ϕi]), subject to∥Λ∥∗≤ν, (12)

where is a tuning parameter. For each value of in (11) there is a value of in (IV-B) which produces the same solution [38]. One of the main challenges in regularized and constrained MDE is the choice of regularization parameters and . Often, the regularization parameter is chosen based on cross-validation which is computationally demanding and is always biased toward the noise in the validation set. There is an extensive discussion in [39] for model selection in topic model. We propose the concept of confidence-constrained rank minimization for jointly learning the space of distributions which overcome the issues of parameter tuning with regularized and constrained MDE.

## V Confidence-constrained maximum entropy (CME)

We propose the framework of confidence-constrained maximum entropy (CME) for learning from multi-instance data. The difficulties in tuning the regularization parameters in regularized MDE will be addressed in CME by solving a constrained optimization problem where the constraint only depends to the dimension of the data (i.e., number of instances and number of bags). Using the properties of the maximum entropy framework, an in-probability bound on the objective function in (10

) can be obtained. The probability bound on the log-likelihood function allows us to define a confidence set. A confidence set is a high-dimensional generalization of the confidence interval that we use to restrict the search space of the problem. Search for a low-rank

inside the confidence set guarantees a low-rank solution with high probability. Hence, in this approach the roles of ML objective and rank constraint are reversed. We consider rank minimization subject to ML objective constraint. The CME is given by:

 minimizeRank(Λ) subject toN∑i=1niD(p^λi∥pλi)≤ϵ(ωa), (13)

where is an in-probability bound for the estimation error. is total number of bags and is total number of feature functions . Note in this formulation the tuning parameter can be obtained by bounding using the following theorem.

###### Theorem 1

Let defined in (9). With probability at least :

 p(N∑i=1niD(p^λi∥pλi)≥ϵ(ωa))≤1a.

(For proof, see Appendix A). This theorem suggests that the original low-rank representation distributions associated with the bags , , and can be found within an -ball (as in (V)) around the rank-unrestricted ML estimates , , and with high probability. Additionally, is free of any tuning parameters. It only depends on the dimensions of dataset which is available prior to observing the data. Since (V) involves rank minimization which is non-convex, we provide an alternative convex relaxation to (V) in the following.

### V-a Confidence-constrained maximum entropy nuclear norm minimization (CMEN)

Constrained rank recovery of an unknown matrix has been studied extensively in the literature in the communities of signal processing, control system, and machine learning in problems such as matrix completion and matrix decomposition [40]. In general, rank minimization problems are NP hard [41]. Various algorithms have been proposed to solve the general rank minimization problem locally (e.g., see [42]). To solve the rank minimization problem proposed in (V), we propose to apply the widely adopted approach of replacing the rank minimization with the tractable convex optimization problem of nuclear norm minimization. In the following, CME nuclear norm minimization is proposed as a convex alternative to (V):

 minimize    ∥Λ∥∗ subject to    N∑i=1niD(p^λi∥pλi)≤ϵ. (14)

We denote the solution to (V-A) by . Since the nuclear norm is a convex function, and the set of the inequality and equality constraints construct a convex set, (V-A

) is a convex optimization problem. This nuclear norm regularization encourages a low-rank representation to feature space, i.e., all features can be represented as a linear combination of a few alternative features. Consider the singular value decomposition of

, then

 λTiϕ(x) = k∑j=1sj(eTivj)(uTjϕ(x)) = k∑j=1sj(eTivj)ψj(x)=βTiψ(x)

where is the rank of matrix . Similar to principle component analysis, where each data point can be approximated as a linear combination of a few principle components, each bag can be represented as a distribution using a linear combination of only a few basis functions . This method facilitates a dimension reduction in the space of distributions by representing each distribution with a lower number of basis functions ().

## Vi Confidence-constrained maximum entropy nuclear norm minimization algorithm (CMENA)

The optimization problem in (V-A) can be written as follows:

 minimize    f(Λ) subject to    g(Λ)≤ϵ, (15)

where and . The Lagrangian of (VI) is

 L(Λ,z)=f(Λ)+z(g(Λ)−ϵ), (16)

where is the Lagrangian multiplier. The next step is to minimize the Lagrangian (16) with respect to the primal variable . Define as:

 Λ∗(z)=argminΛL(Λ,z). (17)

By replacing in the Lagrangian (16), we obtain the dual:

 y(z)=L(Λ∗(z),z).

The dual formulation is given by the following optimization

 maximize   y(z) subject to  z≥0.

To optimize the Lagrangian with respect to the primal variable , we propose to use the proximal gradient approach. In the following, we introduce the proximal gradient algorithm and then show how it can be applied to solve (17).

Consider a general unconstrained nonsmooth convex optimization problem in the form of the following:

 minimize   P(X)\coloneqqf(X)+g(X), (18)

where is a convex, lower semicontinuous (lsc) [43] function and is a smooth convex function (i.e., continuously differentiable). Assume is Lipschitz continuous on the domain of , i.e.,

 ∥∇g(X)−∇g(Y)∥F≤τg∥X−Y∥F,  ∀X,Y∈Rm×n,

where is some positive scalar. Therefore, a quadratic approximation of at point can be provided as follows:

 g(X) ≤ g(X0)+⟨X−X0,∇g(X0)⟩+τg2∥X−X0∥2F.

Instead of minimizing in (18), we minimize an upper bound on , i.e.,

 P(X)≤f(X)+g(X0)+⟨X−X0,∇g(X0)⟩ +τg2∥X−X0∥2F =Q(X,X0),

where is plus a simple quadratic local model of around .

To proceed further, we need to define the proximal mapping (operator). A proximal mapping is an operator defined for a convex function as . For example, if is the indicator function of set the proximal mapping is the projection into set and if is the proximal mapping is the soft thresholding operator [43].

Since can be reformulated as

 Q(X,X0)=f(X)+τg2∥X−(X0−1τg∇g(X0))∥2F+g(X0) −12τg∥∇g(X0)∥2F,

then the minimum of is

 X∗ = argminQ(X,X0) = argminf(X)+τg2∥X−(X0−1τg∇g(X0))∥2F = Π(X′).

where . The proximal operator is given by . Moreover, it can be found in closed form for some nonsmooth convex functions (e.g., nuclear norm) which is an advantage of algorithm to solve large scale optimization problem [44]. Note that if then , i.e., the proximal gradient algorithm reduces to the standard gradient algorithm. The convergence rate for the proximal gradient algorithm is where is the number of iterations (i.e., see [44] Theorem 2.1).

### Vi-B Proximal gradient algorithm to solve CMEN

Given is Lipschitz continuous with parameter (see Appendix B), where is total number of bags and is total number of feature functions, a quadratic upper bound for (16) can be written as:

 L(Λ,z)≤∥Λ∥∗+z(τg2∥Λ−Λ′∥2F+g(Λ0) −12τg∥∇g(Λ0)∥2F−ϵ) =Q(Λ,Λ0)

where . The solution to the minimization of w.r.t. is

 ^Λ∗(z) = argminQ(Λ,Λ0) = D1τgz(Λ′)

where is the soft-thresholding operator on the singular values of matrix (for proof see [45]) defined by , where is the SVD of . To find we have to maximize w.r.t. . Since parameter is a scalar, we propose a greedy search approach to find the optimum (see Algorithm 2).

### Vi-C step size

In the proximal gradient approach, will be updated in each iteration based on . In fact, plays the role of step size. However, in practice it is usually very conservative to set a constant step size [44]. As long as the inequality is hold, the step size can be increased. Therefore, a linesearch-like algorithm is proposed to find a smaller value for which satisfies the inequality (see Algorithm 2). The pseudo code for CMENA is proposed in Algorithm 2.

### Vi-D Acceleration

The convergence rate for the proximal gradient approach is where is the number of iteration [44, 43]. The convergence rate of the gradient approach can be speed up to using the extrapolation technique proposed in [46] given the fact that the is Lipschitz continuous with (see Appendix B). We define the extrapolated solution as follows:

 ¯Λk=Λk+ak−1−1ak(Λk−Λk−1),

where . The only costly part of the proximal algorithm is the evaluation of the singular values in each iteration. Note that in each iteration of soft-thresholding operator we need to know the number of singular values greater than a threshold. As in [44, 45, 47, 39], we use the PROPACK package to compute a partial SVD. Because PROPACK can not automatically calculate the singular values which are greater than specific value , we use the following procedure. To facilitate the computation of singular value at a time, we set and update for as follows:

 bl+1={Rank(Λk+1)if Rank(Λk+1)

This procedure stops when . Partial SVD calculation reduces the cost of the computation significantly, especially in the low-rank setting. The pseudo code for calculating SVD is in Algorithm 3.

## Vii Experiments

In this section, we evaluate both theoretical and computational aspect of CMEN ans compare to RMDE for rank recovery in the space of distributions. For the theoretical part, we provide a phase diagram analysis to evaluate the performance of both CMEN and RMDE in exact rank recovery. We then provide an illustration of distribution space dimension reduction using CMEN. Moreover, we show that CMEN introduces a metric which can be used in object similarity recognition in image processing.

### Vii-a Phase diagram analysis

We use the notion of phase diagram [48] to evaluate probability of exact rank recovery using CMEN and RMDE for a wide range of matrices of different dimensions (i.e., features size number of bags) and different values for the rank of matrix . We construct distributions using low-rank matrix and draw samples using rejection sampling (data are generated in space). For the basis functions used in constructing the maximum entropy distribution space, we propose and , where i.i.d for . In [49], a similar transformation is used to approximate Gaussian kernels. Figure 3 shows the contour plot of the first distributions used in our experiments. For the random samples drawn from the constructed distributions, we obtain by maximum likelihood estimation (10). Note that is a noisy version of matrix and is full rank.

We consider two different setups for number of bags: and . We would like to illustrate the performance of CMEN and RMDE in small () and large () scale problems in terms of exact rank recovery. For bags, we vary the number of features and rank of matrix over a grid of with (number of features) ranging through equispaced points in the interval and (rank of matrix ) ranging through equispaced points in the interval (see Fig. 4). Each pixel intensity in the phase diagram corresponds to the empirical evaluation of the probability of exact rank recovery. For each pixel in the phase diagram we produce realization of . We run CMEN and RMDE for each of realization of and compare the rank of the obtained matrix with the rank of the true . The rank evaluation is done by counting the number of singular values of matrix exceeding a threshold. The threshold is defined based on the empirical distribution of the smallest nonzero singular values of the true matrix

(i.e., mean minus three times the standard deviation). To find the regularization parameter

in RMDE (11), we consider both a cross-validation approach and a continuation technique [44, 50]. The continuation technique in nuclear norm minimization is similar to the path following algorithm in solving regularized regression (LASSO) proposed in [51]. Convergence analysis of the continuation technique is shown in [52]. For cross-validation, we consider a range of regularization parameter . For each value of , we separate data into training and test sets ( training and test), and evaluate the test error using the objective function in (11), then select as the value corresponding to the lowest test error. For the continuation technique, we set to a large value and repeatedly solve the optimization problem (11) with a decreasing sequence of until we reach the target value () where . Due to large value of in the beginning of the algorithm, matrix is low-rank and in each iteration we increase the rank of . Note that the value of constant in and in is set manually based on preliminary experiments. The stopping criterion for CMEN is the combination of MaxIter , objTol , and consTol where MaxIter is the maximum number of iteration of main algorithm, objTol is the tolerance of objective function , and consTol is the tolerance for violating the confidence constraint . The stopping criteria for RMDE is the same as for CMEN except that consTol is not used. Figure 4, 4, and 4 show the phase diagram results for exact rank recovery with CMEN, RMDE (cross-validation), and RMDE (continuation technique) for .

The white region in Fig. 4 and Fig. 4 correspond to the probability of exact rank recovery obtained by CMEN and RMDE, respectively. The white area in Fig. 4 is wider than the white areas in Fig. 4 and Fig. 4 which means that CMEN is more successful in exact rank recovery compare to RMDE. This is due to the fact that in RMDE, is obtained based on the generalization performance (minimum test error) which does not necessarily guarantees exact rank recovery. Moreover, in CMENA we use a quadratic bound on the main objective function which results in a closed-form expression for the proximal operator. Based on Eckart-Young [53] a low-rank matrix has the lower error in terms of quadratic cost function. Another observation is that the white area in RMDE with continuation technique is slightly wider than RMDE with cross-validation technique. This could be due to the fact that in the continuation technique we start with a very low-rank matrix and increase the rank gradually until we reach a targeted value, whereas in the cross-validation technique we keep the regularization parameter constant throughout the optimization.

For , we scan the number of features and rank of matrix over a grid of with ranging through equispaced points in the interval and ranging through equispaced points in the interval . Due to the high computational complexity of scanning through different values of in RMDE with cross-validation, and better result in terms of exact rank recovery in RMDE with continuation technique on small scale data (), we compare rank recovery between CMEN and RMDE with continuation technique in this case. Figure 5 and 5 show the phase diagram results for exact rank recovery with CMEN and RMDE (continuation technique) for . We observe that the white area in CMEN approach is wider than the white areas in RMDE approach (better performance in terms of exact rank recovery for CMEN compare to RMDE).

### Vii-B Parameter estimation error

We compare the test error vs. runtime for both CMEN and RMDE on a synthetic dataset. We construct a low-rank matrix and generate samples from the low-rank distribution and estimate matrix using maximum likelihood estimation. Then we obtain matrix using CMEN and RMDE. We consider , , , and . We randomly choose of the data as a training set and test on the rest of the data over different realizations. The test error is evaluated as , where indexes all bags in the test set. Figure 6 shows the results of test error vs. runtime 111We run all algorithms on a standard desktop computer with GHz CPU (dual core) and GB of memory implemented in MATLAB.. Figure 6 shows the result for . Since initially finding the true model with correct rank in CMEN is computationally expensive (due to dual variable update), we observe that RMDE has lower generalization test error than CMEN in the beginning. However, we observe that overall the generalization test error in CMEN decreases faster than RMDE. In Fig. 6, the result is shown for . We see that by increasing the complexity of the model, it takes longer for CMEN and RMDE to find the correct model.

### Vii-C Dimension reduction

The purpose of this section is to illustrate how dimension reduction can be achieved using the obtained by CME. Since all the datasets are high dimensional, we use PCA as a preprocessing step. Figure 8 depicts the whole process of implementing our approach for one image in the Corel1000 dataset [54]. We use the block representation of the image followed by PCA to reduce the dimension. The image is represented as a bag of instances where each instance corresponds to a small rectangular patch of pixels. The feature vector describing each patch is the raw pixel intensities (RGB) with PCA applied to reduce the dimension. We perform the CMEN approach to learn a p.d.f. over the block representation of the image.

After performing the nuclear norm minimization in (V-A) on the Corel1000 dataset, we select one image as an example. Then, we choose the first few bases of matrix obtained by (V-A) to represent the image as a linear combination of these basis functions. Figure 7 shows that the contour plots of these basis functions. To provide intuitive understanding, we name each basis following the content of the image corresponding to instances near the peaks of (concentration of data points). The first column of Fig. 7 is an image and its corresponding estimated density. The other columns show each and the part of the image that corresponds to that .

### Vii-D KL-divergence similarity

For classification and retrieval, it is useful to have a similarity measure between bags. The Kullback-Leibler (KL) divergence between two estimated distributions provides such a similarity measures [5]. The KL divergence between two distributions obtained by the maximum entropy approach has a closed form:

 D(pλi∥pλj)=(λi−λj)TEpλi[Φ]−(Z(λi)−Z(λj)).

We symmetrize the divergence by adding .

 D(pλi∥pλj)+D(pλj∥pλi)=(λi−λj)T(Epλi[Φ]−Epλj[Φ]).

Figure 9 shows a set of images and their nearest images identified by KL-divergence similarity. We observe that by using the KL-divergence similarity, the nearest neighbor images resemble the main images which validates the efficacy of the proposed similarity measure. Figure 10 shows failure examples in which the nearest neighbor image comes from a different class than the original image. We hypothesis that this is due to the dominance of the color features.

### Vii-E Datasets

We also evaluate the classification accuracy of the proposed KL-divergence based similarity measure when used in distance-based multi-instance algorithms such as Citation-kNN [7] and bag-level kernel SVM [15]. We compare KL-divergence to bag-level distance measures that rely on pairwise instance-level comparisons, namely average Hausdorff distance [7] and the RBF set kernel [15], both in terms of accuracy and runtime. The comparison is conducted over four datasets, i.e., the Corel1000 image dataset [54] Musk1, Musk2 [1], and Flowcytometry [55]. The Corel1000 [54] image dataset consists of different classes each containing images. We use randomly subsampled images from classes: ‘buildings’, ‘buses’, ‘flowers’, and ‘elephants’. We represent each image (bag) as a collection of instances, each of which corresponds to a pixel block, and is described by a feature vector of all pixel intensities in color channels (RGB). The Musk1 dataset [1] describes a set of 92 molecules of which 47 are judged by human expert to be musks and the remaining 45 molecules are judged to be non-musk. The Musk2 dataset [1] is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The Flowcytometry dataset consists of vector reading of multiple blood cell samples for each one of patients. For each patient, we have two similar cell characteristics with respect to the antigens surface which are called chronic lymphocytic leukemia (CLL) or mantle cell lymphoma (MCL). Each patient is associated with one bag of multiple cells (instances). Table I summarizes the properties of each dataset.

### Vii-F Experimental setup

We use classification accuracy as an evaluation metric. In all experiments, we use the preprocessed datasets obtained by PCA. We perform

-fold cross-validation over all datasets. As baselines, we implement a modified version of Citation-kNN [7] replacing the Hausdorff distance with KL-divergence, and a bag-level SVM with the kernel for two bags and defined as , , and the RBF set kernel used by [15]. Below we state the ranges of all tunning parameters for these algorithms used in our experiments. We compared CMEN with RMDE with cross-validation and RMDE with continuation technique. We use a grid of