Diversity in Machine Learning

07/04/2018 ∙ by Zhiqiang Gong, et al. ∙ Apple, Inc. 0

Machine learning methods have achieved good performance and been widely applied in various real-world applications. It can learn the model adaptively and be better fit for special requirements of different tasks. Many factors can affect the performance of the machine learning process, among which diversity of the machine learning is an important one. Generally, a good machine learning system is composed of plentiful training data, a good model training process, and an accurate inference. The diversity could help each procedure to guarantee a total good machine learning: diversity of the training data ensures the data contain enough discriminative information, diversity of the learned model (diversity in parameters of each model or diversity in models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a plausible result. However, there is no systematical analysis of the diversification in machine learning system. In this paper, we systematically summarize the methods to make data diversification, model diversification, and inference diversification in machine learning process, respectively. In addition, the typical applications where the diversity technology improved the machine learning performances have been surveyed, including the remote sensing imaging tasks, machine translation, camera relocalization, image segmentation, object detection, topic modeling, and others. Finally, we discuss some challenges of diversity technology in machine learning and point out some directions in future work. Our analysis provides a deeper understanding of the diversity technology in machine learning tasks, and hence can help design and learn more effective models for specific tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Traditionally, machine learning methods can learn model’s parameters automatically with the training samples and thus it can provide models with good performances which can satisfy the special requirements of various applications. Actually, it has achieved great success in tackling many real-world artificial intelligence and data mining problems

[1], such as object detection [2, 3], natural image processing [5], autonomous car driving [6]

, urban scene understanding

[7], machine translation [4], and web search/information retrieval [8], and others. A success machine learning system often requires plentiful training data which can provide enough information to train the model, a good model learning process which can better model the data, and an accurate inference to discriminate different objects. However, in real-world applications, limited number of labelled training data are available. Besides, there exist large amounts of parameters in the machine learning model. These would make the ”over-fitting” phenomenon in the machine learning process. Therefore, obtaining an accurate inference from the machine learning model tends to be a difficult task. Many factors can help to improve the performance of the machine learning process, among which the diversity in machine learning plays an important role.

Diversity shows different concepts depending on context and application [39]. Generally, a diversified system contains more information and can better fit for various environments. It has already become an important property in many social fields, such as biological system, culture, products and so on. Particularly, the diversity property also has significant effects on the learning process of the machine learning system. Therefore, we wrote this survey mainly for two reasons. First, while the topic of diversity in machine learning methods has received attention for many years, there is no framework of diversity technology on general machine learning models. Although Kulesza et al. discussed the determinantal point processes (DPP) in machine learning which is only one of the measurements for diversity[39]. [11] mainly summarized the diversity-promoting methods for obtaining multiple diversified search results in the inference phase. Besides, [48, 68, 65]

analyzed several methods on classifier ensembles, which represents only a specific form of ensemble learning. All these works do not provide a full survey of the topic, nor do they focus on machine learning with general forms. Our main aim is to provide such a survey, hoping to induce diversity in general machine learning process. As a second motivation, this survey is also useful to researchers working on designing effective learning process.

Here, the diversity in machine learning works mainly on decreasing the redundancy between the data or the model and providing informative data or representative model in the machine learning process. This work will discuss the diversity property from different components of the machine learning process, including the training data, the learned model, and the inference. The diversity in machine learning tries to decrease the redundancy in the training data, the learned model as well as the inference and provide more information for machine learning process. It can improve the performance of the model and has played an important role in machine learning process. In this work, we summarize the diversification of machine learning into three categories: the diversity in training data (data diversification), the diversity of the model/models (model diversification) and the diversity of the inference (inference diversification).

Data diversification can provide samples with enough information to train the machine learning model. The diversity in training data aims to maximize the information contained in the data. Therefore, the model can learn more information from the data via the learning process and the learned model can be better fit for the data. Many prior works have imposed the diversity on the construction of each training batch for the machine learning process to train the model more effectively [9]. In addition, diversity in active learning can also make the labelled training data contain the most information [13, 14] and thus the learned model can achieve good performance with limited training samples. Moreover, in special unsupervised learning method by [38], diversity of the pseudo classes can encourage the classes to repulse from each other and thus the learned model can provide more discriminative features from the objects.

Model diversification comes from the diversity in human visual system. [16, 17, 18]

have shown that the human visual system represents decorrelation and sparseness, namely diversity. This makes different neurons in the human learning respond to different stimuli and generates little redundancy in the learning process which ensures the high effectiveness of the human learning. However, general machine learning methods usually perform the redundancy in the learned model where different factors model the similar features

[19]. Therefore, diversity between the parameters of the model (D-model) could significantly improve the performance of the machine learning systems. The D-model tries to encourage different parameters in each model to be diversified and each parameter can model unique information [20, 21]. As a result, the performance of each model can be significantly improved [22]. However, general machine learning model usually provides a local optimal representation of the data with limited training data. Therefore, ensemble learning, which can learn multiple models simultaneously, becomes another hot machine learning methods to provide multiple choices and has been widely applied in many real-world applications, such as the speech recognition [24, 25], and image segmentation [27]. However, general ensemble learning usually makes the learned multiple base models converge to the same or similar local optima. Thus, diversity among multiple base models by ensemble learning (D-models) , which tries to repulse different base models and encourages each base model to provide choice reflecting multi-modal belief [23, 27, 26], can provide multiple diversified choices and significantly improve the performance.

Instead of learning multiple models with D-models, one can also obtain multiple choices in the inference phase, which is generally called multiple choice learning (MCL). However, the obtained choices from usual machine learning systems presents similarity between each other where the next choice will be one-pixel shifted versions of others [28]. Therefore, to overcome this problem, diversity-promoting prior can be imposed over the obtained multiple choices from the inference. Under the inference diversification, the model can provide choices/representations with more complement information [29, 31, 30, 32]. This could further improve the performance of the machine learning process and provide multiple discriminative choices of the objects.

This work systematically covers the literature on diversity-promoting methods over data diversification, model diversification, and inference diversification in machine learning tasks. In particular, three main questions from the analysis of diversity technology in machine learning have arisen.

  • How to measure the diversity of the training data, the learned model/models, and the inference and enhance these diversity in machine learning system, respectively? How do these methods work on the diversification of the machine learning system?

  • Is there any difference between the diversification of the model and models? Furthermore, is there any similarity between the diversity in the training data, the learned model/models, and the inference?

  • Which real-world applications can the diversity be applied in to improve the performance of the machine learning models? How do the diversification methods work on these applications?

Although all of the three problems are important, none of them has been thoroughly answered. Diversity in machine learning can balance the training data, encourage the learned parameters to be diversified, and diversify the multiple choices from the inference. Through enforcing diversity in the machine learning system, the machine learning model can present a better performance. Following the framework, the three questions above have been answered with both the theoretical analysis and the real-world applications.

The remainder of this paper is organized as Fig. 1 shows. Section II discusses the general forms of the supervised learning and the active learning as well as a special form of unsupervised learning in machine learning model. Besides, as Fig. 1 shows, Sections III, IV and V introduce the diversity methods in machine learning models. Section III outlines some of the prior works on diversification in training data. Section IV reviews the strategies for model diversification, including the D-model, and the D-models. The prior works for inference diversification are summarized in Section V. Finally, section VI introduces some applications of the diversity-promoting methods in prior works, and then we do some discussions, conclude the paper and point out some future directions.

Fig. 1: The basic framework of this paper. The main body of this paper consists of three parts: General Machine Learning Models in Section II, Diversity in Machine Learning in Section III-V, and Extensive Applications in Section VI.
Fig. 2: Flowchart for training process of general machine learning (including the active learning, supervised learning and unsupervised learning). We can find that when the training data is labelled, the training process is supervised. In contrast, the training process is unsupervised. Besides, it should be noted that when both the labelled and unlabelled data are used for training, the training process is semi-supervised.

Ii General Machine Learning Models

Traditionally, machine learning consists of supervised learning, active learning, unsupervised learning, and reinforcement learning. For reinforcement learning, training data is given only as the feedback to the program’s actions in a dynamic environment, and it does not require accurate input/output pairs and the sub-optimal actions need not to be explicitly correct. However, the diversity technologies mainly work on the model itself to improve the model’s performance. Therefore, this work will ignore the reinforcement learning and mainly discuss the machine learning model as Fig.

2 shows. In the following, we’ll introduce the general forms of supervised learning and a representative form of active learning as well as a special form of unsupervised learning.

Ii-a Supervised Learning

We consider the task of general supervised machine learning models, which are commonly used in real-word machine learning tasks. Fig. 2 shows the flowchart of general machine learning methods in this work. As Fig. 2 shows, the supervised machine learning model consists of data pre-processing, training (modeling), and inference. All of the steps can affect the performance of the machine learning process.

Let denote the set of training samples and is the corresponding label of , where ( is the set of class labels, is the number of the classes, and is the number of the labelled training samples). Traditionally, the machine learning task can be formulated as the following optimization problem [33, 34]:

(1)

where

represents the loss function and

is the parameters of the machine learning model. Besides, is the constraint of the parameters of the model. Then, the Lagrange multiplier of the optimization can be reformulated as follows.

(2)

where is a positive value. Therefore, the machine learning problem can be seen as the minimization of .

Figs. 3 and 4 show the flowchart of two special forms of supervised learning models, which are generally used in real-world applications. Among them, Fig. 3 shows the flowchart of a special form of supervised machine learning with a single model. Generally, in the data-preprocessing stage, the more diversification and balance each training batch has, the more effectiveness the training process is. In addition, it should be noted that the factors in the same layer of the model can be diversified to improve the representational ability of the model (which is called D-model in this paper). Moreover, when we obtain multiple choices from the model in the inference, the obtained choices are desired to provide more complement information. Therefore, some works focus on the diversification of multiple choices (which we call inference diversification). Fig. 4 shows the flowchart of supervised machine learning with multiple parallel base models. We can find that a best strategy to diversify the training set for different base models can improve the performance of the whole ensemble (which is called D-models). Furthermore, we can diversify these base models directly to enforce each base model to provide more complement information for further analysis.

Fig. 3: Flowchart of a special form of supervised machine learning with single model. Since diversity mainly occurs in the training batch in the data-preprocessing, this work will mainly discuss the diversity of samples in the training batch for data diversification. Generally, the more diversification and balance each training batch is, the more effectiveness the training process is. In addition, it should be noted that the factors in the same layer of the model can be diversified to improve the representational ability of the model (which is called D-model in this paper). Moreover, when we obtain multiple choices from the model, the obtained choices are desired to provide more complement information. Therefore, some works focus on the diversification of multiple choices (which we call inference diversification).
Fig. 4: Flowchart of supervised machine learning with multiple parallel models. We can find that a best strategy to diversify the training set for different models can improve the performance of multiple models. Furthermore, we can diversify different models directly to enforce different model to provide more complement information for further analysis.

Ii-B Active Learning

Since labelling is always cost and time consuming, it usually cannot provide enough labelled samples for training in real world applications. Therefore, active learning, which can reduce the label cost and keep the training set in a moderate size, plays an important role in the machine learning model [35]. It can make use of the most informative samples and provide a higher performance with less labelled training samples.

Through active learning, we can choose the most informative samples for labelling to train the model. This paper will take the Convex Transductive Experimental Design (CTED) as a representative of the active learning methods [36, 37].

Denote as the candidate unlabelled samples for active learning, where represents the number of the candidate unlabelled samples. Then, the active learning problem can be formulated as the following optimization problem [37]:

(3)

where is the -th entry of , and is a positive tradeoff parameter. represents the Frobenius norm (F-norm) which calculates the root of the quadratic sum of the items in a matrix. As is shown, CTED utilizes a data reconstruction framework to select the most informative samples for labelling. The matrix contains reconstruction coefficients and

is the sample selection vector. The

-norm makes the learned to be sparse. Then, the obtained is used to select samples for labelling and finally the training set is constructed with the selected training samples. However, the selected samples from CTED usually make similarity from each other, which leads to the redundancy of the training samples. Therefore, diversity property is also required in the active learning process.

Ii-C Unsupervised Learning

As discussed in former subsection, limited number of the training samples will limit the performance of the machine learning process. Instead of the active learning, to solve the problem, unsupervised learning methods provide another way to train the machine learning model without the labelled training samples. This work will mainly discuss a special unsupervised learning process developed by [38], which is an end-to-end self-supervised method.

Denote as the center points which is used to formulate the pseudo classes in the training process where represents the number of the pseudo classes. Just as subsection II-B, represents the unlabelled training samples and denotes the number of the unsupervised samples. Besides, denote as the features of extracted from the machine learning model. Then, the pseudo label of the data can be defined as

(4)

Then, the problem can be transformed to a supervised one with the pseudo classes. As shown in subsection II-A, the machine learning task can be formulated as the following optimization [38]

(5)

where denotes the optimization term and

is used to minimize the intra-class variance of the constructed pseudo-classes.

demonstrates the constraints in Eq. 2. With the iteratively learning of Eq. 4 and Eq. 5, the machine learning model can be trained unsupervisedly.

Since the center points play an important role in the construction of the pseudo classes, diversifying these center points and repulsing the points from each other can better discriminate these pseudo classes. This would show positive effects on improving the effectiveness of the unsupervised learning process.

Ii-D Analysis

As former subsections show, diversity can improve the performance of the machine learning process. In the following, this work will summarize the diversification in machine learning from three aspects: data diversification, model diversification, and inference diversification.

To be concluded, diversification can be used in supervised learning, active learning, and unsupervised learning to improve the model’s performance. According to the models in II-A and II-B, the diversification technology in machine learning model has been divided into three parts: data diversification (Section III), model diversification (Section IV), and inference diversification (Section V). Since the diversification in training batch (Fig. 3) and the diversification in active learning and unsupervised learning mainly consider the diversification in training data, we summarize the prior works in these diversification as data diversification in section III. Besides, the diversification of the model in Fig. 3 and the multiple base models in Fig. 4 mainly focus on the diversification in the machine learning model directly, and thus we summarize these works as model diversification in section IV. Finally, the inference diversification in Fig. 3 will be summarized in section V. In the following section, we’ll first introduce the data diversification in machine learning models.

Iii Data Diversification

Obviously, the training data plays an important role in the training process of the machine learning models. For supervised learning in subsection II-A, the training data provides more plentiful information for the learning of the parameters. While for active learning in subsection II-B, the learning process would select the most informative and less redundant samples for labelling to obtain a better performance. Besides, for unsupervised learning in subsection II-C, the pseudo classes can be encouraged to repulse from each other and the model can provide more discriminative features unsupervisedly. The following will introduce the methods for these data diversification in detail.

Iii-a Diversification in Supervised Learning

General supervised learning model is usually trained with mini-batches to accurately estimate the training model. Most of the former works generate the mini-batches randomly. However, due to the imbalance of the training samples under random selection, redundancy may occur in the generated mini-batches which shows negative effects on the effectiveness of the machine learning process. Different from classical stochastic gradient descent (SGD) method which relies on the uniformly sampling data points to form a mini-batch,

[9, 10] proposes a non-uniformly sampling scheme based on the determinantal point process (DPP) measurement.

A DPP is a distribution over subsets of a fixed ground set, which prefers a diverse set of data other than a redundant one [39]. Let denote a continuous space and the data . Then, the DPP denotes a positive semi-definite kernel function on ,

(6)

where denotes the kernel matrix and the pairwise is the pairwise correlation between the data and . denotes the determinant of matrix.

is an identity matrix. Since the space

is constant, is a constant value. Therefore, the corresponding diversity prior of transition parameter matrix modeled by DPP can be formulated as

(7)

In general, the kernel can be divided into the correlation and the prior part. Therefore, the kernel can be reformulated as

(8)

where is the prior for the data and

denotes the correlation of these data. These kernels would always induce repulsion between different points and thus a diverse set of points tends to have higher probability. Generally, the vectors are supposed to be uniformly distributed variables. Therefore, the prior

is a constant value, and then, the kernel

(9)

The DPPs provide a probability measure over every configuration of subsets on data points. Based on a similarity matrix over the data and a determinant operator, the DPP assigns higher probabilities to those subsets with dissimilar items. Therefore, it can give lower probabilities to mini-batches which contain the redundant data, and higher probabilities to mini-batches with more diverse data [9]. This simultaneously balances the data and generates the stochastic gradients with lower variance. Moreover, [10] further regularizes the DPP (R-DPP) with an arbitrary fixed positive semi-definite matrix inside of the determinant to accelerate the training process.

Besides, [12] generalizes the diversification of the mini-batch sampling to arbitrary repulsive point processes, such as the Stationary Poisson Disk Sampling (PDS). The PDS is one type of repulsive point process. It can provide point arrangements similar to DPP but with much more efficiency. The PDS indicates that the smallest distance between each pair of sample points should be at least with respect to some distance measurement [12], such as the Euclidean distance and the heat kernel. The measurement can be formulated as

Euclidean distance:

(10)

Heat kernel:

(11)

where is a positive value. Given a new mini-batch , and the algorithm of PDS can work as follows in each iteration.

  • Randomly select a data point .

  • If , throw out the point; otherwise add in batch .

The computational complexity of PDS is much lower than that of the DPP.

Under these diversification prior, such as the DPP and the PDS, each mini-batch consists of the training samples with more diversity and information, which can train the model more effectively, and thus the learned model can exact more discriminative features from the objects.

Iii-B Diversification in Active Learning

As section II-B shows, active learning can obtain a good performance with less labelled training samples. However, some selected samples with CTED are similar to each other and contain the overlapping and redundant information. The highly similar samples make the redundancy of the training samples, and this further decreases the training efficiency, which requires more training samples for a comparable performance.

To select more informative and complement samples with the active learning method, some prior works introduce the diversity in the selected samples obtained from CTED (Eq. 3) [14, 13]. To promote diversity between the selected samples, [14] enhances CTED with a diversity regularizer

(12)

where , represents the F-norm, and the similarity matrix is used to model the pairwise similarities among all the samples, such that larger value of demonstrates the higher similarity between the th sample and the th one. Particularly, [14]

chooses the cosine similarity measurement to formulate the diversity term. And the diversity term can be formulated as

(13)

As [22] introduces, tends to be zero when and tends to be uncorrelated.

Similarly, [13] denotes the diversity term in active learning with the angular of the cosine similarity to obtain a diverse set of training samples. The diversity term can be formulated as

(14)

Obviously speaking, when the two vectors become vertical, the vectors tend to be uncorrelated. Therefore, under the diversification, the selected samples would be more informative.

Besides,[15] takes advantage of the well-known RBF kernel to measure the diversity of the selected samples, the diversity term can be calculated by

(15)

where is a positive value. Different from Eqs. 13 and Eq. 14 which measure the diversity from the angular view, Eq. 15 calculates the diversity from the distance view. Generally, given two data, if they are similar to each other, the term will have a large value.

Through adding diversity regularization over the selected samples by active learning, samples with more information and less redundancy would be chosen for labelling and then used for training. Therefore, the machine learning process can obtain comparable or even better performance with limited training samples than that with plentiful training samples.

Iii-C Diversification in Unsupervised Learning

As subsection II-C shows, the unsupervised learning in [38] is based on the construction of the pseudo classes with the center points. By repulsing the center points from each other, the pseudo classes would be further enforced to be away from one another. If we encourage the center points to be diversified and repulse from each other, the learned features from different classes can be more discriminative. Generally, the Euclidean distance can be used to calculate the diversification of the center points. The pseudo label of is also calculated by Eq. 4. Then, the unsupervised learning method with the diversity-promoting prior can be formulated as

(16)

where is a positive value which denotes the tradeoff between the optimization term and the diversity term. Under the diversification term, in the training process, the center points would be encouraged to repulse from each other. This makes the unsupervised learning process be more effective to obtain discriminative features from samples in different classes.

Iv Model Diversification

In addition to the data diversification to improve the performance with more informative and less redundant samples, we can also diversify the model to improve the representational ability of the model directly. As introduction shows, the machine learning methods aim to learn parameters by the machine itself with the training samples. However, due to the limited and imbalanced training samples, highly similar parameters would be learned by general machine learning process. This would lead to the redundancy of the learned model and negatively affect the model’s representational ability.

Therefore, in addition to the data diversification, one can also diversify the learned parameters in the training process and further improve the representational ability of the model (D-model). Under the diversification prior, each parameter factor can model unique information and the whole factors model a larger proportional of information [22]. Another method is to obtain diversified multiple models (D-models) through machine learning. Traditionally, if we train the multiple models separately, the obtained representations from different models would be similar and this would lead to the redundancy between different representations. Through regularizing the multiple base models with the diversification prior, different models would be enforced to repulse from each other and each base model can provide choices reflecting multi-modal belief [27]. In the following subsections, we’ll introduce the diversity methods for D-model and D-models in detail separately.

Iv-a D-Model

Fig. 5: Effects of D-model on improving the performance of the machine learning model. Under the model diversification, each parameter factor of the machine learning model tends to model unique information and the whole machine learning model can model more useful information from the objects. Thus, the representational ability can be improved. The figure shows the results from the image segmentation task in [31]. As showed in the figure, the extracted features from the model can better discriminate different objects.

The first method tries to diversify the parameters of the model in the training process to directly improve the representational ability of the model. Fig. 5 shows the effects of D-model on improving the performance of the machine learning model. As Fig. 5 shows, under the D-model, each factor would model unique information and the whole factors model a larger proportional of information and then the information will be further improved. Traditionally, Bayesian method and posterior regularization method can be used to impose diversity over the parameters of the model. Different diversity-promoting priors have been developed in prior works to measure the diversity between the learned parameter factors according to the special requirements of different tasks. This subsection will mainly introduce the methods which can enforce the diversity of the model and summarize these methods occurred in prior works.

Iv-A1 Bayesian Method

Traditionally, diversity-promoting priors can be used to measure the diversification of the model. The parameters of the model can be calculated by the Bayesian method as

(17)

where denotes the parameters in the machine learning model, is the number of the parameters, represents the likelihood of the training set on the constructed model and stands for the prior knowledge of the learned model. For the machine learning task at hand, describes the diversity-promoting prior. Then, the machine learning task can be written as

(18)

The log-likelihood of the optimization can be formulated as

(19)

Then, Eq. 19 can be written as the following optimization

(20)

where represents the optimization objective of the model, which can be formulated as in subsection II-A. the diversity-promoting prior aims to encourage the learned factors to be diversified. With Eq. 20, the diversity prior can be imposed over the parameters of the learned model.

Iv-A2 Posterior Regularization Method

In addition to the former Bayesian method, posterior regularization methods can be also used to impose the diversity property over the learned model [138]. Generally, the regularization method can add side information into the parameter estimation and thus it can encourage the learned factors to possess a specific property. We can also use the posterior regularization to enforce the learned model to be diversified. The diversity regularized optimization problem can be formulated as

(21)

where stands for the diversity regularization which measures the diversity of the factors in the learned model. represents the optimization term of the model which can be seen in subsection II-A. demonstrates the tradeoff between the optimization and the diversification term.

From Eqs. 20 and 21, we can find that the posterior regularization has the similar form as the Bayesian method. In general, the optimization (20) can be transformed into the form (21). Many methods can be applied to measure the diversity property of the learned parameters. In the following, we will introduce different diversity priors to realize the D-model in detail.

Measurements Papers
Cosine Similarity [22, 20, 44, 19, 43, 47, 134, 135, 136, 137, 125, 46]
Determinantal Point Process [121, 130, 131, 132, 114, 39, 129, 87, 92, 122, 123, 145, 146, 147, 148, 127, 149]
Submodular Spectral Diversity [54]
Inner Product [43, 51]
Euclidean Distance [40, 41, 42]
Heat Kernel [2, 3, 143]
Divergence [40]
Uncorrelation and Evenness [55]
[56, 57, 58, 133, 128, 139, 140, 141, 142]
TABLE I: Overview of most frequently used diversification method in D-model and the papers in which example measurements can be found.

Iv-A3 Diversity Regularization

As Fig. 5

shows, the diversity regularization encourages the factors to repulse from each other or to be uncorrelated. The key problem with the diversity regularization is the way to calculate the diversification of the factors in the model. Prior works mainly impose the diversity property into the machine learning process from six aspects, namely the distance, the angular, the eigenvalue, the divergence, the

, and the DPP. The following will introduce the measurements and further discuss the advantages and disadvantages of these measurements.

Distance-based measurements. The simplest way to formulate the diversity between different factors is the Euclidean distance. Generally, enlarging the distances between different factors can decrease the similarity between these factors. Therefore, the redundancy between the factors can be decreased and the factors can be diversified. [40, 41, 42] have applied the Euclidean distance as the measurements to encourage the latent factors in machine learning to be diversified.

In general, the larger of the Euclidean distance two vectors have, the more difference the vectors are. Therefore, we can diversify different vectors through enlarging the pairwise Euclidean distances between these vectors. Then, the diversity regularization by Euclidean distance from Eq. 21 can be formulated as

(22)

where is the number of the factors which we intend to diversify in the machine learning model. Since the Euclidean distance uses the distance between different factors to measure the similarity of these factors , generally the regularizer in Eq. 22 is variant to scale due to the characteristics of the distance. This may decrease the effectiveness of the diversity measurement and cannot fit for some special models with large scale range.

Another commonly used distance-based method to encourage diversity in the machine learning is the heat kernel [2, 3, 143]. The correlation between different factors is formulated through Gaussian function and it can be calculated as

(23)

where is a positive value. The term measures the correlation between different factors and we can find that when and are dissimilar, tends to zero. Then, the diversity-promoting prior by the heat kernel from Eq. 20 can be formulated as

(24)

The corresponding diversity regularization form can be formulated as

(25)

where is a positive value. Heat kernel takes advantage of the distance between the factors to encourage the diversity of the model. It can be noted that the heat kernel has the form of Gaussian function and the weight of the diversity penalization is affected by the distance. Thus, the heat kernel presents more variance with the penalization and shows better performance than general Euclidean distance.

All the former distance-based methods encourage the diversity of the model by enforcing the factors away from each other and thus these factors would show more difference. However, it should be noted that the distance-based measurements can be significantly affected by scaling which can limit the performance of the diversity prior over the machine learning.

Angular-based measurements. To make the diversity measurement be invariant to scale, some works take advantage of the angular to encourage the diversity of the model. Among these works, the cosine similarity measurement is the most commonly used [22, 20]. Obviously, the cosine similarity can measure the similarity between different vectors. In machine learning tasks, it can be used to measure the redundancy between different latent parameter factors [22, 20, 44, 19, 43]. The aim of cosine similarity prior is to encourage different latent factors to be uncorrelated, such that each factor in the learned model can model unique features from the samples.

The cosine similarity between different factors and can be calculated as [45, 46]

(26)

Then, the diversity-promoting prior of generalized cosine similarity measurement from Eq. 20 can be written as

(27)

It should be noted that when is set to 1, the diversity-promoting prior over different vectors by cosine similarity from Eq. 20 can be formulated as

(28)

where is a positive value. It can be noted that under the diversity-promoting prior in Eq. 28, the is encouraged to be 0. Then, and tend to be orthogonal and different factors are encouraged to be uncorrelated and diversified. Besides, the diversity regularization form by the cosine similarity measurement from Eq. 21 can be formulated as

(29)

However, there exist some defects in the former measurement where the measurement is variant to orientation. To overcome this problem, many works use the angular of cosine similarity to measure the diversity between different factors [21, 19, 47].

Since the angular between different factors is invariant to translation, rotation, orientation and scale, [21, 19, 47]

develops the angular-based diversifying method for Restricted Boltzmann Machine (RBM). These works use the variance and mean value of the angular between different factors to formulate the diversity of the model to overcome the problem occurred in cosine similarity. The angular between different factors can be formulated as

(30)

Since we do not care about the orientation of the vectors just as [21], we prefer the angular to be acute or right. From the mathematical view, two factors would tend to be uncorrelated when the angular between the factors enlarges. Then, the diversity function can be defined as [48, 21, 50, 49]

(31)

where

In other words, denotes the mean of the angular between different factors and represents the variance of the angular. Generally, a larger indicates that the weight vectors in are more diverse. Then, the diversity promoting prior by the angular of cosine similarity measurement can be formulated as

(32)

The prior in Eq. 32 encourages the angular between different factors to approach , and thus these factors are enforced to be diversified under the diversification prior. Moreover, the measurement is invariant to scale, translation, rotation, and orientation.

Another form of the angular-based measurements is to calculate the diversity with the inner product [43, 51]. Different vectors present more diversity when they tend to be more orthogonal. The inner product can measure the orthogonality between different vectors and therefore it can be applied in machine learning models for more diversity. The general form of diversity-promoting prior by inner product measurement can be written as [43, 51]

(33)

Besides, [63] uses the special form of the inner product measurement, which is called exclusivity. The exclusivity between two vectors and is defined as

(34)

where denotes the Hadamard product, and denotes the norm. Therefore, the diversity-promoting prior can be written as

(35)

Due to the non-convexity and discontinuity of norm, the relaxed exclusivity is calculated as [63]

(36)

where denotes the norm. Then, the diversity-promoting prior based on relaxed exclusivity can be calculated as

(37)

The inner product measurement takes advantage of the characteristics among the vectors and tries to encourage different factors to be orthogonal to enforce the learned factors to be diversified. It should be noted that the measurement can be seen as a special form of cosine similarity measurement. Even though the inner product measurement is variant to scale and orientation, in many real-world applications, it is usually considered first to diversify the model since it is easier to implement than other measurements.

Instead of the distance-based and angular-based measurements, the eigenvalues of the kernel matrix can also be used to encourage different factors to be orthogonal and diversified. Recall that, for an orthogonal matrix, all the eigenvalues of the kernel matrix are equal to 1. Here, we denote

as the kernel matrix of . Therefore, when we constrain the eigenvalues to 1, the obtained vectors would tend to be orthogonal [52, 53]. Three ways are generally used to encourage the eigenvalues to approach constant 1, including the submodular spectral diversity (SSD) measurement, the uncorrelation and evenness measurement, and the log-determinant divergence (LDD). In the following, the two form of the eigenvalue-based measurements will be introduced in detail.

Eigenvalue-based measurements. As the former denotes, stands for the kernel matrix of the latent factors. Two commonly used methods to promote diversity in the machine learning process based on the kernel matrix would be introduced. The first method is the submodular spectral diversity (SSD), which is based on the eigenvalues of the kernel matrix. [54]

introduces the SSD measurement in the process of feature selection, which aims to select a diverse set of features. Feature selection is a key component in many machine learning settings. The process involves choosing a small subset of features in order to build a model to approximate the target concept well.

The SSD measurement uses the square distance to encourage the eigenvalues to approach 1 directly. Define as the eigenvalues of the kernel matrix. Then, the diversity-promoting prior by SSD from Eq. 20 can be formulated as [54]

(38)

where is also a positive value. From Eq. 21, the diversity regularization can be formulated as

(39)

This measurement regularizes the variance of the eigenvalues of the matrix. Since all the eigenvalues are enforced to approach 1, the obtained factors tend to be more orthogonal and thus the model can present more diversity.

Another diversity measurement based on the kernel matrix is the uncorrelation and evenness [55]. This measurement encourages the learned factors to be uncorrelated and to play equally important roles in modeling data. Formally, this amounts to encouraging the kernel matrix of the vectors to have more uniform eigenvalues. The basic idea is to normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small Kullback-Leibler (KL) divergence with the uniform distribution [55]. Then, the diversity-promoting prior by uniform eigenvalues from Eq. 20 is formulated as

(40)

subject to ( is positive definite matrix) and , where is the kernel matrix. Besides, the diversity-promoting uniform eigenvalue regularizer (UER) from Eq. 21 is formulated as

(41)

where is the dimension of each factor.

Besides, [53] takes advantage of the log-determinant divergence (LDD) to measure the similarity between different factors. The diversity-promoting prior in [53] combines the orthogonality-promoting LDD regularizer with the sparsity-promoting regularizer. Then, the diversity-promoting prior from Eq. 20 can be formulated as

(42)

where denotes the matrix trace. Then, the corresponding regularizer from Eq. 21 is formulated as

(43)

The LDD-based regularizer can effectively promote nonoverlap [53]. Under the regularizer, the factors would be sparse and orthogonal simultaneously.

These eigenvalue-based measurements calculate the diversity of the factors from the kernel matrix view. They not only consider the pairwise correlation between the factors, but also take the multiple correlation into consideration. Therefore, they generally present better performance than the distance-based and angular-based methods which only consider the pairwise correlation. However, the eigenvalue-based measurements would cost more computational sources in the implementation. Moreover, the gradient of the diversity term which is used for back propagation would be complex to compute and usually requires special processing methods, such as projected gradient descent algorithm [55] for the uncorrelation and evenness.

DPP measurement. Instead of the eigenvalue-based measurements, another measurement which takes the multiple correlation into consideration is the determinantal point process (DPP) measurement. As subsection III-A shows, the DPP on the parameter factors has the form as

(44)

Generally, it can encourage the learned factors to repulse from each other. Therefore, the DPP-based diversifying prior can obtain machine learning models with a diverse set of the learned factors other than a redundant one. Some works have shown that the DPP prior is usually not arbitrarily strong for some special case when applied into machine learning models [60]. To encourage the DPP prior strong enough for all the training data, the DPP prior is augmented by an additional positive parameter . Therefore, just as section III-A, the DPP prior can be reformulated as

(45)

where denotes the kernel matrix and demonstrates the pairwise correlation between and . The learned factors are usually normalized, and thus the optimization for machine learning can be written as

(46)

where represents the diversity term for machine learning. It should be noted that different kernels can be selected according to the special requirements of different machine learning tasks [61, 62]. For example, in [62], the similarity kernel is adopted for the DPP prior which can be formulated as

(47)

When we set the cosine similarity as the correlation kernel , from geometric interpretation, the DPP prior can be seen as the volume of the parallelepiped spanned by the columns of [39]. Therefore, diverse sets are more probable because their feature vectors are more orthogonal, and hence span larger volumes. It should be noted that most of the diversity measurements consider the pairwise correlation between the factors and ignore the multiple correlation between three or more factors. While the DPP measurement takes advantage of the merits of the DPP to make use of the multiple correlation by calculating the similarity between multiple factors.

measurement. While all the former measurements promote the diversity of the model from the pairwise or multiple correlation view, many prior works prefer to use the for diversity since can take advantage of the group-wise correlation and obtain a group-wise sparse representation of the latent factors [56, 57, 58, 59].

It is well known that the -norm leads to the group-wise sparse representation of . can also be used to measure the correlation between different parameter factors and diversify the learned factors to improve the representational ability of the model. Then, the prior from Eq. 20 can be calculated as

(48)

where means the th entry of . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model. Besides, the diversity term based on from Eq. 21 can be formulated as

(49)

where is the dimension of each factor . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model.

In most of the machine learning models, the parameters of the model can be looked as the vectors and diversity of these factors can be calculated from the mathematical view just as these former measurements. When the norm of the vectors are constrained to constant 1, we can also take these factors as the probability distribution. Then, the diversity between the factors can be also measured from the Bayesian view.

Divergence measurement. Traditionally, divergence, which is generally used Bayesian method to measure the difference between different distributions, can be used to promote diversity of the learned model [40].

Each factor is processed as a probability distribution firstly. Then, the divergence between factors and can be calculated as

(50)

subject to .

The divergence can measure the dissimilarity between the learned factors, such that the diversity-promoting regularization by divergence from Eq. 21 can be formulated as [40]

(51)

The measurement takes advantage of the characteristics of the divergence to measure the dissimilarity between different distributions. However, the norm of the learned factors need to satisfy which limits the application field of the diversity measurement.

In conclusion, there are numerous approaches to diversify the learned factors in machine learning models. A summary of the most frequently encountered diversity methods is shown in Table I. Although most papers use slightly different specifications for their diversification of the learned model, the fundamental representation of the diversification is similar. It should also be noted that the thing in common among the studied diversity methods is that the diversity enforced in a pairwise form between members strikes a good balance between complexity and effectiveness [63]. In addition, different applications should choose the proper diversity measurements according to the specific requirements of different machine learning tasks.

Iv-A4 Analysis

These diversity measurements can calculate the similarity between different vectors and thus encourage the diversity of the machine learning model. However, there exists the difference between these measurements. The details of these diversity measurements can be seen in Table II. It can be noted from the table that all these methods take advantage of the pairwise correlation except which uses the group-wise correlation between different factors. Moreover, the determinantal point process, submodular spectral diversity, and uncorrelation and evenness can also take advantage of correlation among three or more factors.

Another property of these diversity measurement is scale invariant. Scale invariant can make the diversity of the model be invariant w.r.t. the norm of these factors. The cosine similarity measurement calculates the diversity via the angular between different vectors. As a special case for DPP, the cosine similarity can be used as the correlation term in DPP and thus the DPP measurement is scale invariant. Besides, for divergence measurement, since the factors are constrained with , the measurement is scale invariant.

Measurements Pairwise Correlation Multiple Correlation Group-wise Correlation Scale Invariant
Cosine Similarity
Determinantal Point Process
Submodular Spectral Diversity
Euclidean Distance
Heat Kernel
Divergence
Uncorrelation and Evenness
Inner Product
TABLE II: Comparisons of Different Measurements. represents that the measurement possess the property while means the measurement does not possess the property.

These measurements can encourage diversity within different vectors. Generally, the machine learning models can be looked as the set of latent parameter factors, which can be represented as the vectors. These factors can be learned and used to represent the objects. In the following, we’ll mainly summarize the methods to diversify the ensemble learning (D-models) for better performance of machine learning tasks.

Iv-B D-Models

The former subsection introduces the way to diversify the parameters in single model and improve the representational ability of the model directly. Much efforts have been done to obtain the highest probability configuration of the machine learning models in prior works. However, even when the training samples are sufficient, the maximum a (MAP) solution could also be sub-optimal. In many situations, one could benefit from additional representations with multiple models. As Fig. 4 shows, ensemble learning (the way for training multiple models) has already occurred in many prior works. However, traditional ensemble learning methods to train multiple models may provide representations that tend to be similar while the representations obtained from different models are desired to provide complement information. Recently, many diversifying methods have been proposed to overcome this problem. As Fig. 6 shows, under the model diversification, each base model of the ensemble can produce different outputs reflecting multi-modal belief. Therefore, the whole performance of the machine learning model can be improved. Especially, the D-models play an important role in structured prediction problems with multiple reasonable interpretations, of which only one is the groundtruth [27].

Fig. 6: Effects of D-models for improving the performance of the machine learning model. The figure shows the image segmentation task from the prior work [27]. The single model often produce solutions with low expected loss and step into the sub-optimal results. Besides, general ensemble learning usually provide multiple choices with great similarity. Therefore, this work summarizes the methods which can diversify the ensemble learning (D-models). As the figure shows, under the model diversification, each model of the ensemble can produce different outputs reflecting multi-modal belief [27].

Denote and as the parameters and the inference from the th model where is the number of the parallel base models. Then, the optimization of the machine learning to obtain multiple models can be written as

(52)

where represents the optimization term of the th model and denotes the training samples of the th model. Traditionally, the training samples are randomly divided into multiple subsets and each subset trains a corresponding model. However, selecting subsets randomly may lead to the redundancy between different representations. Therefore, the first way to obtain multiple diversified models is to diversify these training samples over different base models, which we call sample-based methods.

Another way to encourage the diversification between different models is to measure the similarity between different base models with a special similarity measurement and encourage different base models to be diversified in the training process, which is summarized as the optimization-based methods. The optimization of these methods can be written as

(53)

where measures the diversification between different base models. These methods are similar to the methods for D-model in former subsection.

Finally, some other methods try to obtain large amounts of models and select the top- models as the final ensemble, which is called the ranking-based methods in this work. In the following, we’ll summarize different methods for diversifying multiple models from the three aspects in detail.

Methods Measurements Papers
Optimization-based Divergence [69, 26]
Renyi-entropy [76]
Cross Entropy [77, 78]
Cosine Similarity [66, 63]
[58]
NCL [82, 73, 74, 83]
Others [64, 69, 70, 65, 66, 71, 72]
Sample-based - [88, 89, 90, 27, 87]
Ranking-based - [91, 23, 92]
TABLE III: Overview of most frequently used diversification method in D-models and the papers in which example measurements can be found.

Iv-B1 Optimization-Based Methods

Optimization-based methods are one of the most commonly used methods to diversify multiple models. These methods try to obtain multiple diversified models by optimizing a given objective function as Eq. 53 shows, which includes a diversity measurement. Just as the diversity of D-model in prior subsection, the main problem of these methods is to define diversity measurements which can calculate the difference between different models.

Many prior works [65, 64, 66, 67, 68, 48] have summarized some pairwise diversity measurements, such as Q-statistics measure [69, 48], correlation coefficient measure [69, 48], disagreement measure [70, 64, 127], double-fault measure [71, 64, 127], statistic measure [72], Kohavi-Wolpert variance [65, 127], inter-rater agreement [65, 127], the generalized diversity [65] and the measure of ”Difficult” [65, 127]. Recently, some more measurements have also been developed, including not only the pairwise diversity measurement [26, 69, 66] but also the measurements which calculate the multiple correlation and others [58, 75, 74, 73]. This subsection will summarize these methods systematically.

Bayesian-based measurements. Similar to D-model, Bayesian methods can also be applied in D-models. Among these Bayesian methods, divergence is the most commonly used one. As former subsection shows, the divergence can measure the difference between different distributions. The way to formulate the diversity-promoting term by the divergence method over the ensemble learning is to calculate the divergence between different distributions from the inference of different models, respectively [26, 69]. The diversity-promoting term by divergence from Eq. 53 can be formulated as

(54)

where represents the th entry in . denotes the distributions of the inference from the th model. The former diversity term can increase the difference between the inference obtained from different models and would encourage the learned multiple models to be diversified.

In addition to the divergence measurements, Renyi-entropy which measures the kernelized distances between the images of samples and the center of ensemble in the high-dimensional feature space can also be used to encourage the diversity of the learned multiple models [76]. The Renyi-entropy is calculated based on the Gaussian kernel function and the diversity-promoting term from Eq. 53 can be formulated as

(55)

where is a positive value and represents the Gaussian kernel function, which can be calculated as

(56)

where denotes the dimension of . Compared with the divergence measurement, the Renyi-entropy measurement can be more fit for the machine learning model since the difference can be adapted for different models with different value . However, the Renyi-entropy would cost more computational sources and the update of the ensemble would be more complex.

Another measurement which is based on the Bayesian method is the cross entropy measurement[77, 78, 124]. The cross entropy measurement uses the cross entropy between pairwise distributions to encourage two distributions to be dissimilar and then different base models could provide more complement information. Therefore, the cross-entropy between different base models can be calculated as

(57)

where is the inference of the th model and is the probability of the sample belonging to the th class. According to the characteristics of the cross entropy and the requirement of the diversity regularization, the diversity-promoting regularization of the cross entropy from Eq. 53 can be formulated as

(58)

We all know that the larger the cross entropy is, the more difference the distributions are. Therefore, under the cross entropy measurement, different models can be diversified and provide more complement information. Most of the former Bayesian methods promote the diversity in the learned multiple base models by calculating the pairwise difference between these base models. However, these methods ignore the correlation among three or more base models.

To overcome this problem, [75]

proposes a hierarchical pair competition-based parallel genetic algorithm (HFC-PGA) to increase the diversity among the component neural networks. The HFC-PGA takes advantage of the average of all the distributions from the ensemble to calculate the difference of each base model. The diversity term by HFC-PGA from Eq.

53 can be formulated as

(59)

It should be noted that the HFC-PGA takes advantage of multiple correlation between the multiple models. However, the HFC-PGA method uses the fix weight to calculate the mean of the distributions and further calculate the covariance of the multiple models which usually cannot fit for different tasks. This would limit the performance of the diversity promoting prior.

To deal with the shortcomings of the HFC-PGA, negative correlation learning (NCL) tries to reduce the covariance among all the models while the variance and bias terms are not increased [82, 73, 74, 83]. The NCL trains the base models simultaneously in a cooperative manner that decorrelates individual errors. The penalty term can be designed in different ways depending on whether the models are trained sequentially or parallelly. [82] uses the penalty to decorrelate the current learning model with all previously learned models

(60)

where represents the target function which is a desired output scalar vector. Besides, define where . Then, the penalty term can also be defined to reduce the correlation mutually among all the learned models by using the actual distribution obtained from each model instead of the target function [73, 74, 68].

(61)

This measurement uses the covariance of the inference results obtained from the multiple models to reduce the correlation mutually among the learned models. Therefore, the learned multiple models can be diversified. In addition, [84] further combines the NCL with sparsity. The sparsity is purely pursued by the norm regularization without considering the complementary characteristics of the available base models.

Most of the Bayesian methods promote diversity in ensemble learning mainly by increasing the difference between the probability distributions of the inference of different base models. There exist other methods which can promote diversity over the parameters of each base model directly.

Cosine similarity measurement. Different from the Bayesian methods which promote diversity from the distribution view, [66] introduces the cosine similarity measurements to calculate the difference between different models from geometric view. Generally, the diversity-promoting term from Eq. 53 can be written as

(62)

In addition, as a special form of angular-based measurement, a special form of inner product measurement, termed as exclusivity, has been proposed by [63] to obtain diversified models. It can jointly suppress the training error of ensemble and enhance the diversity between bases. The diversity-promoting term by exclusivity (see Eq. 37 for details) from Eq. 53 can be written as

(63)

These measurements try to encourage the pairwise models to be uncorrelated such that each base model can provide more complement information.

measurement. Just as the former subsection, norm can also be used as the diversification of multiple models[58]. the diversity-promoting regularization by from Eq. 53 can be formulated as

(64)

The measurement uses the group-wise correlation between different base models and favors selecting diverse models residing in more groups.

Some other diversity measurements have been proposed for deep ensemble. [85] reveals that it may be better to ensemble many instead of all of the neural networks at hand. The paper develops an approach named Genetic Algorithm based Selective Ensemble (GASEN) to obtain different weights of each neural network. Then, based on the obtained weights, the deep ensemble can be formulated. Moreover, [86] also encourages the diversity of the deep ensemble by defining a pair-wise similarity between different terms.

These optimization-based methods utilize the correlation between different models and try to repulse these models from one another. The aim is to enforce these representations which are obtained from different models to be diversified and thus these base models can provide outputs reflecting multi-modal belief.

Iv-B2 Sample-Based Methods

In addition to diversify the ensemble learning from the optimization view, we can also diversify the models from the sample view. Generally, we randomly divide the training set into multiple subsets where each base model corresponds to a specific subset which is used as the training samples. However, there exists the overlapping between the representations of different base models. This may cause the redundancy and even decrease the performance of the ensemble learning due to the reduction of the training samples over each model by the division of the whole training set.

To overcome this problem and provide more complement information from different models, [27] develops a novel method by dividing the training samples into multiple subsets by assigning the different training samples into the specified subset where the corresponding learned model shows the lowest predict error. Therefore, each base model would focus on modeling the features from specific classes. Besides, clustering is another popular method to divide the training samples for different models [87]. Although diversifying the obtained multiple subsets can make the multiple models provide more complement information, the less of training samples by dividing the whole training set will show negative effects over the performance.

To overcome this problem, another way to enforce different models to be diversified is to assign each sample with a specified weight [88]. By training different base models with different weights of samples, each base model can focus on complement information from the samples. The detailed steps in [88] are as follows:

  • Define the weights over each training sample randomly, and train the model with the given weights;

  • Revise the weights over each training sample based on the final loss from the obtained model, and train the second model with the updated weights;

  • Train models with the aforementioned strategies.

The former methods take advantage of the labelled training samples to enforce the diversity of multiple models. There exists another method, namely Unlabeled Data to Enhance Ensemble (UDEED) [89], which focuses on the unlabelled samples to promote diversity of the model. Unlike the existing semi-supervised ensemble methods where error-prone pseudo-labels are estimated for unlabelled data to enlarge the labelled data to improve accuracy. UDEED works by maximizing accuracies of base models on labelled data while maximizing diversity among them on unlabelled data. Besides, [90] combines the different initializations, different training sets and different feature subsets to encourage the diversity of the multiple models.

The methods in this subsection process on the training sets to diversify different models. By training different models with different training samples or samples with different weights, these models would provide different information and thus the whole models could provide a larger proportional of information.

Iv-B3 Ranking-Based Methods

Another kind of methods to promote diversity in the obtained multiple models is ranking-based methods. All the models is first ranked according to some criterion, and then the top- are selected to form the final ensemble. Here, [91] focuses on pruning techniques based on forward/backward selection, since they allow a direct comparison with the simple estimation of accuracy from different models.

Cluster can be also used as ranking-based method to enforce diversity of the multiple models [92]. In [92], each model is first clustered based on the similarity of their predictions, and then each cluster is then pruned to remove redundant models, and finally the remaining models in each cluster are finally combined as the base models.

In addition to the former mentioned methods, [23] provides multiple diversified models by selecting different sets of multiple features. Through multi-scale or other tricks, each sample will provide large amounts of features, and then choose top- multiple features from the all the features as the base features (see [23] for details). Then, each base feature from the samples is used to train a specific model, and the final inference can be obtained through the combination of these models.

In summary, this paper summarizes the diversification methods for D-models from three aspects: optimization-based methods, sample-based methods, and ranking-based methods. The details of the most frequently encountered diversity methods is shown in Table III. Optimization-based methods encourage the multiple models to be diversified by imposing diversity regularization between different base models while optimizing these models. In contrary, sample-based methods mainly obtain diversified models by training different models with specific training sets. Most of the prior works focus on diversifying the ensemble learning from the two aspects. While the ranking-based methods try to obtain the multiple diversified models by choosing the top- models. The researchers can choose the specific method for D-models based on the special requirements of the machine learning tasks.

V Inference Diversification

The former section summarizes the methods to diversify different parameters in the model or multiple base models. The D-model focuses on the diversification of parameters in the model and improves the representational ability of the model itself while D-models tries to obtain multiple diversified base models, each of which focus on modeling different features from the samples. These works improve the performance of the machine learning process in the modeling stage (see Fig. 2 for details). In addition to these methods, there exist some other works focusing on obtaining multiple choices in the inference of the machine learning model. This section will summarize these diversifying methods in the inference stage. To introduce the methods for inference diversification in detail, we choose the graph model as the representation of the machine learning models.

Fig. 7: Effects of inference diversification for improving the performance of the machine learning model. The results come from prior work [31]. Through inference diversification, multiple diversified choices can be obtained. Then, under the help of other methods, such as re-ranking in [31], the final solution can be obtained.

We consider a set of discrete random variables

, each taking value in a finite label set . Let (, ) describe a graph defined over these variable. The set denotes a Cartesian product of sets of labels corresponding to the subset of variables. Besides, denote , () as the functions which define the energy at each node and edge for the labelling of variables in scope. The goal of the MAP inference is to find the labelling of the variables that minimizes this real-valued energy function:

(65)

However, usually converges to the sub-optimal results due to the limited representational ability of the model and the limited training samples. Therefore, multiple choices, which can provide complement information, are desired from the model for the specialist. Traditional methods to obtain multiple choices try to solve the following optimization:

(66)

However, the obtained second-best choice will typically be one-pixel shifted versions of the best [28]. In other words, the next best choices will almost certainly be located on the upper slope of the peak corresponding with the most confident detection, while other peaks may be ignored entirely.

To overcome this problem, many methods, such as diversified multiple choice learning (D-MCL), submodular, M-Modes, M-NMS, have been developed for inference diversification in prior works. These methods try to diversify the obtained choices (do not overlap under a user-defined criteria) while obtaining high score on the optimization term. Fig. 7 shows some results of image segmentation from [31]. Under the inference diversification, we can obtain multiple diversified choices, which represent the different optima of the data. Actually, there also exist many methods which focus on providing multiple diversified choices in the inference phase. In this work, we summarize the diversification in these works as inference diversification. The following subsections will introduce these works in detail.

Measurements Papers
D-MCL [93, 29, 94, 31, 95, 96, 144]
Submodular for Diversification [97, 106, 107, 98, 99, 100, 101]
M-modes [30]
M-NMS [109, 32, 108, 120, 110, 111]
DPP [5, 151, 154]
TABLE IV: Overview of most frequently used inference diversification methods and the papers in which example measurements can be found.

V-a Diversity-Promoting Multiple Choice Learning (D-MCL)

The D-MCL tries to find a diverse set of highly probable solutions under a discrete probabilistic model. Given a dissimilarity function measuring similarity between the pairwise choices, our formulation involves maximizing a linear combination of the probability and the dissimilarity to the previous choices. Even if the MAP solution alone is of poor quality, a diverse set of highly probable hypotheses might still enable accurate predictions. The goal of D-MCL is to produce a diverse set of low-energy solutions.

The first method is to approach the problem with a greedy algorithm, where the next choice is defined as the lowest energy state with at least some minimum dissimilarity from the previously chosen choices. To do so, a dissimilarity function is defined first. In order to find the diverse, low energy, labellings , the method proceeds by solving a sequence of problems of the form [29, 31, 95, 96, 144]

(67)

for , where determines a trade-off between diversity and energy, is the MAP-solution and the function defines the diversity of two labels. In other words, takes a large value if and are diverse, and a small value otherwise. For special case, the M-Best MAP is obtained when is a 0-1 dissimilarity (i.e. ). The method considers the pairwise dissimilarity between the obtained choices. More importantly, it is easy to understand and implement. However, under the greedy strategy, each new labelling is obtained based on the previously found solutions, and ignores the upcoming labellings [94].

Contrary to the former form, the second method formulate the -best diverse problem in form of a single energy minimization problem [94]. Instead of the greedy sequential procedure in (67), this method suggests to infer all labellings jointly, by minimizing

(68)

where defines the total diversity of any labellings. To achieve this, let us first create copies of the initial model. Three specific different diversity measures are introduced. The split-diversity measure is written as the sum of pairwise diversities, i.e. those penalizing pairs of labellings [94]

(69)

The node-diversity measure is defined as [94]

(70)

Finally, the special case of the split-diversity and node-diversity measures is the node-split-diversity measure [94]

(71)

The D-MCL methods try to find multiple choices with a dissimilarity function. This can help the machine learning model to provide choices with more difference and show more diversity. However, the obtained choices may not be the local extrema and there may exist other choices which could better represent the objects than the obtained ones.

V-B Submodular for Diversification

The problem of searching for a diverse but high-quality subset of items in a ground set of items has been studied in information retrieval [99], web search [98], social networks [103], sensor placement [104], observation selection problem [102], set cover problem [105]

, document summarization

[100, 101], and others. In many of these works, an effective, theoretically-grounded and practical tool for measuring the diversity of a set are submodular set functions. Submodularity is a property that comes from marginal gains. A set function is submodular when its marginal gains are decreasing: for all and . In addition, if is monotone, i.e. whenever , then a simple greedy algorithm that iteratively picks the element with the largest marginal gain to add to the current set , achieves the best possible approximation bound of [107]. This result has presented significant practical impact. Unfortunately, if the number of items is exponentially large, then even a single linear scan for greedy augmentation is simply infeasible. The diversity is measured by a monotone, nondecreasing and normalized submodular function .

Denote as the set of choices. The diversification is measured by a monotone, nondecreasing and normalized submodular function . Then, the problem can be transformed to find a maximizing configurations for the combined score [97, 98, 99, 100, 101]

(72)

The optimization can be solved by the greedy algorithm that starts out with , and iteratively adds the best term:

(73)

where . The selected choice is within a factor of of the optimal solution :

(74)

The submodular takes advantage of the maximization of marginal gains to find multiple choices which can provide the maximum of complement information.

V-C M-Nms

Another way to obtain multiple diversified choices is the non-maximum suppression (M-NMS) [150, 111]. The M-NMS is typically defined in an algorithmic way: starting from the MAP prediction one goes through all labellings according to an increasing order of the energy. A labelling becomes part of the predicted set if and only if it is more than away from the ones chosen before, where is the threshold defined by user to judge whether two labellings are similar. The M-NMS guarantee the choices to be apart from each other. The M-NMS is typically implemented by greedy algorithm [32, 120, 110, 111].

A simple greedy algorithm for instantiating multiple choices are used: Search over the exponentially large space of choices for the maximally scoring choice, instantiate it, remove all choices with overlapping, and repeat. The process is repeated until the score for the next-best choice is below a threshold or M choices have been instantiated. However, the general implementation of such an algorithm would take exponential time.

The M-NMS method tries to find M-best choices by throwing away the similar choices from the candidate set. To be concluded, the D-MCL, submodular, and M-NMS have the similar idea. All of them tries to find the M-best choices under a dissimilarity function or the ones which can provide the most complement information.

V-D M-modes

Even though the former three methods guarantee the obtained multiple choices to be apart from each other, the choices are typically not local extrema of the probability distribution. To further guarantee both the local extrema and the diversification of the obtained multiple choices simultaneously, the problem can be transformed to the M-modes. The M-modes have multiple possible applications, because they are intrinsically diverse.

For a non-negative integer , define the -neighborhood of a labelling to be as the set of labellings whose distances from is no more than , where