The study of finite mixture models was initiated in the 1890s by Karl Pearson when he wanted to model multimodal densities. Research on finite mixture models continued ever since but its focus changed over time as further areas of application were identified and available computational power increased. More recently the natural connection between finite mixture models and classification methods with their applications in fields like machine learning or credit scoring began to be investigated in more detail. In these applications, often it can be assumed that the mixture models aresimple in the sense that the component densities are known (ie there is no dependence on unknown parameters) but their weights are unknown.
In this note, we explore a specific property of simple finite mixture models, namely that their maximum likelihood (ML) estimates provide an exact fit if they exist, and some consequences of this property. In doing so, we extend the discussion of the case ’no independent estimate of unconditional default probability given’ fromtasche2013art to the multi-class case and general probability spaces.
In Section 2, we present the result on the exact fit property in a general simple finite mixture model context. In Section 3, we discuss the consequences of this result for classification and quantification problems and compare the ML estimator with other estimators that were proposed in the literature. In Section 4, we revisit the cost quantification problem as introduced in forman2008quantifying as an application. In Section 5, we illustrate by a stylised example from mortgage risk management how the estimators discussed before can be deployed for the forecast of expected loss rates. Section 6 concludes the note.
2 The exact fit property
We discuss the properties of the ML estimator of the weights in a simple finite mixture model in a general setting which may formally be described as follows:
is a measure on . is a probability density with respect to . Write for the probability measure with . Write for the expectation with regard to .
In applications, the measure will often be a multi-dimensional Lebesgue measure or a counting measure. We study this problem:
Problem. Approximate by a mixture of probability -densities , ie for suitable with .
In the literature, most of the time a sample version of the problem (ie with as an empirical measure) is discussed. Often the component densities depend on parameters that have to be estimated in addition to the weights (see redner1984mixture , or fruhwirth2006finite and for more recent surveys schlattmann2009medical ). In this note, we consider the simple case where the component densities are assumed to be known and fixed. This is a standard assumption for classification (see MorenoTorres2012521 ) and quantification (see forman2008quantifying ) problems.
Common approaches to the approximation problem are
Least Squares. Determine or its weighted version (see hofer2013drift and hopkins2010method for recent applications to credit risk and text categorisation respectively). The main advantage of the least squares approach compared with other approaches comes from the fact that closed-form solutions are available.
Kullback-Leibler (KL) distance. Determine
(see duPlessis2014110 for a recent discussion).
In the following we examine a special property of the KL approximation which we call the exact fit property. First we note two alternative representations of the KL distance (assuming all integrals are well-defined and finite):
with , if .
The problem in (1) to maximise was studied in saerens2002adjusting (with which gives ML estimators). The authors suggested the Expectation-Maximisation (EM) algorithm involving conditional class probabilities for determining the maximum. This works well in general but sometimes suffers from very slow convergence.
The ML version of (1) had before been studied in peters1976numerical . There the authors analysed the same iteration procedure which they stated, however, in terms of densities instead of conditional probabilities. In peters1976numerical , the iteration was derived differently to saerens2002adjusting , namely by studying the gradient of the likelihood function. We revisit the approach from peters1976numerical from a different angle by starting from (2).
There is, however, the complication that is not necessarily integrable. But this observation does not apply to the gradient with respect to . We therefore focus on investigating the gradient. With as in (2), let
From this we obtain for the gradient of
is well-defined and finite for :
We are now in a position to state the main result of this note.
for as defined in (4). Define . Then the following two statements hold:
, , and are probability densities with respect to such that and , .
Let , , additionally be linearly independent, ie
Assume that , are -densities of probability measures on and with is such that and , . Then it follows that and for as defined in a), for .
Hence for all .
With regard to b) observe that
Like in (5), for it can easily be shown that is well-defined and finite. Denote by the Jacobi matrix of the and . Let and be the transpose of . Then it holds that
In addition, by assumption on the linear independence of the , implies . Hence is negative definite. From this it follows by the mean value theorem that the solution of (6) is unique in .
If the KL distance on the lhs of (1) is well-defined and finite for all then under the assumptions of Theorem 2.1 b) there is a unique such that the KL distance of and is minimal. In addition, by Theorem 2.1 a), there are densities such that the KL distance of and is zero – this is the exact fit property of simple finite mixture models alluded to in the title of this note. ∎
As mentioned in a), an interesting question for the application of Theorem 2.1 is how to find out whether or not there is a solution to (6) in for . The iteration suggested in peters1976numerical and saerens2002adjusting will correctly converge to a point on the boundary of if there is no solution in the interior (Theorem 2 of peters1976numerical ). But convergence may be so slow that it may remain unclear whether a component of the limit is zero (and therefore the solution is on the boundary) or genuinely very small but positive. The straight-forward Newton-Raphson approach for determining the maximum of defined by (3) may converge faster but may also become unstable for solutions close to or on the boundary of .
However, in case the observation made in b) suggests that the following Gauss-Seidel-type iteration works if the initial value with is sufficiently close to the solution (if any) of (6):
Assume that for some an approximate solution , , has been found.
For try successively to update by solving (9) with component playing the role of component in b) and as well as . If for all the sufficient and necessary condition for the updated to be in is not satisfied then stop – it then is likely that there is no solution to (6) in . Otherwise update where possible with the solution of (9), resulting in , and set
After step set if the algorithm has not been stopped by violation of the resolvability condition for (6).
Terminate the calculation when a suitable distance measure between successive , , is sufficiently small. ∎
3 Application to quantification problems
Finite mixture models occur naturally in machine learning contexts. Specifically, in this note we consider the following context:
is a measurable space. For some , is a partition of . is the -field generated by and the , ie
is a probability measure on with for . is a probability measure on . Write for the expectation with respect to .
There is a measure on and -densities , such that
describes the training set of a classifier. On the training set, for each example both the features (expressed by) and the class (described by one of the ) are known. Note that implies .
describes the test set on which the classifier is deployed. On the test set only the features of the examples are known.
In mathematical terms, quantification might be described as the task to extend onto , based on properties observed on the training set, ie of . Basically, this means to estimate prior class probabilities (or prevalences) on the test dataset. In this note, the assumption is that . In the machine learning literature, this situation is called dataset shift (see MorenoTorres2012521 and the references therein).
Specifically, we consider the following two dataset shift types (according to MorenoTorres2012521 ):
Covariate shift. but for . In practice, this implies for most if not all of the .
Prior probability shift. for at least one but for , . This implies if , , are linearly independent.
In practice, it is likely to have for some both in case of covariate and prior probability shift. Hence, quantification in the sense of estimation of the is important both for covariate and prior probability shifts.
Under a covariate shift assumption, a natural estimator of is given by
Under prior probability shift, the choice of suitable estimators of is less obvious.
The following result generalises the Scaled Probability Average method of bella2010quantification to the multi-class case. It allows to derive prior probability shift estimates of prior class probabilities from covariate shift estimates as given by (10).
Under Assumption 2, suppose that there are , , with such that can be represented as a simple finite mixture as follows:
Then it follows that
where the matrix is given by
Immediate from (11a) and the definition of conditional expectation.∎
For practical purposes, the representation of in the first row of (11c) is more useful because most of the time no exact estimate of will be available. As a consequence there might be a non-zero difference between the values of the expectations in the first and second row of (11c) respectively. In contrast to the second row, for the derivation of the rhs of the first row of (11c), however, no use of the specific properties of conditional expectations has been made.
We start from the first element of the vector-equation (11b) and apply some algebraic manipulations:
From now the result follows. ∎
By Corollary 1, for binary classification problems the covariate shift estimator (10) underestimates the change in the class probabilities if the dataset shift is not a covariate shift but a prior probability shift. See Section 2.1 of forman2008quantifying for a similar result for the Classify & Count estimator. However, according to (12) the difference between the covariate shift estimator and the true prior probability is the smaller the greater the discriminatory power (as measured by the generalised ) of the classifier is. Moreover, both (12) and (11b) provide closed-form solutions for , , that transform the covariate shift estimates into correct estimates under the prior probability shift assumption. In the following the estimators defined this way are called Scaled Probability Average estimators.
Corollary 1 on the relationship between covariate shift and Scaled Probability Average estimates in the binary classification case can be generalised to the relationship between covariate shift and KL distance estimates.
Suppose that is a density of with respect to some measure on . need not equal from Assumption 2, and we can choose and if there is no other candidate. By Theorem 2.1 a) then there are -densities , such that and .
How is the KL distance estimator (or ML estimator in case of being the empirical measure) of the prior class probabilities, defined by the solution of (6), in general related to the covariate shift and Scaled Probability Average estimators?
Suppose the test dataset differs from the training dataset by a prior probability shift with positive class probabilities, ie (11a) applies with . Under Assumption 2 and a mild linear independence condition on the ratios of the densities , then Theorem 2.1 implies that the KL distance and Scaled Probability Average estimators give the same results. Observe that in the context given by Assumption 2 the variables from Theorem 2.1 can be directly defined as , or, equivalently by
) of the density ratios might be preferable in particular if the classifier involved has been built by binary or multinomial logistic regression.
In general, by Theorem 2.1 the result of applying the KL distance estimator to the test feature distribution , in the quantification problem context described by Assumption 2, is a representation of as a mixture of distributions whose density ratios are the same as the density ratios of the class feature distributions , .
Hence the KL distance estimator makes sense under an assumption of identical density ratios in the training and test datasets. On the one hand this assumption is similar to the assumption of identical conditional class probabilities in the covariate shift assumption but does not depend in any way on the training set prior class probabilities. This is in contrast to the covariate shift assumption where implicitly a ’memory effect’ with regard to the training set prior class probabilities is accepted.
On the other hand the ’identical density ratios’ assumption is weaker than the ’identical densities’ assumption (the former is implied by the latter) which is part of the prior probability assumption.
One possible description of ’identical density ratios’ and the related KL distance estimator is that ’identical density ratios’ generalises ’identical densities’ in such a way that exact fit of the test set feature distribution is achieved (which by Theorem 2.1 is not always possible). It therefore is fair to say that ’identical density ratios’ is closer to ’identical densities’ than to ’identical conditional class probabilities’.
Given training data with full information (indicated by the -field in Assumption 2) and test data with information only on the features but not on the classes (-field in Assumption 2), it is not possible to decide whether the covariate shift or the identical density ratios assumption is more appropriate for the data. For both assumptions result in exact fit of the test set feature distribution but in general give quite different estimates of the test set prior class probabilities (see Corollary 2 and Section 5). Only if Eq. (6) has no solution with positive components it can be said that ’identical density ratios’ does not properly describe the test data because then there is no exact fit of the test set feature distribution. In that case ’covariate shift’ might not be appropriate either but at least it delivers a mathematically consistent model of the data.
If both ’covariate shift’ and ’identical density ratios’ provide consistent models (ie exact fit of the test set feature distribution) non-mathematical considerations of causality (are features caused by class or is class caused by features?) may help choosing the more suitable assumption. See fawcett2005response for a detailed discussion of this issue.
4 Cost quantification
’Cost quantification’ is explained in forman2008quantifying as follows: “The second form of the quantification task is for a common situation in business where a cost or value attribute is associated with each case. For example, a customer support log has a database field to record the amount of time spent to resolve each individual issue, or the total monetary cost of parts and labor used to fix the customer’s problem. The cost quantification task for machine learning: given a limited training set with class labels, induce a cost quantifier that takes an unlabeled test set as input and returns its best estimate of the total cost associated with each class. In other words, return the subtotal of cost values for each class.”
Careful reading of Section 4.2 of forman2008quantifying reveals that the favourite solutions for cost quantification presented by the author essentially apply only to the case where the cost attributes are constant on the classes111Only then the as used in Equations (4) and (5) of forman2008quantifying stand for the same conditional expectations. The same observation applies to ..
Cost quantification can be more generally treated under Assumption 2 of this note. Denote by the (random) cost associated with an example. According to the description of cost quantification quoted above then is actually a feature of the example and, therefore, may be considered an
-measurable random variable under Assumption2.
In mathematical terms, the objective of cost quantification is the estimation of the total expected cost per class222For a set its indicator function is defined as for and for . , .
Covariate-shift assumption. Under this assumption we obtain
This gives a probability-weighted version of the ’Classify & Total’ estimator of forman2008quantifying .
’Constant density ratios’ assumption. Let , . If (6) (with and ) has a solution , , , then we can estimate the conditional class probabilities by
From this, it follows that
Obviously, the accuracy of the estimates on the rhs of both (14) and (15) strongly depends on the accuracy of the estimates of and the density ratios on the training set. Accurate estimates of these quantities, in general, will make full use of the information in the -field (ie the information available at the time of estimation) and, because of the -measurability of , of the cost feature . In order to achieve this, must be used as an explanatory variable when the relationship between the classes and the features as reflected in is estimated (eg by a regression approach). As one-dimensional densities are relatively easy to estimate it might make sense to deploy (14) and (15) with the choice .
Note that this conclusion, at first glance, seems to contradict Section 5.3.1 of forman2008quantifying . There it is recommended that “the cost attribute almost never be given as a predictive input feature to the classifier”. Actually, with regard to the cost quantifiers suggested in forman2008quantifying , this recommendation is reasonable because the main component of the quantifiers as stated in (6) of forman2008quantifying is correctly specified only if there is no dependence of the cost attribute and the classifier. Not using as an explanatory variable, however, does not necessarily imply that the dependence between and the classifier is weak. Indeed, if the classifier has got any predictive power and is on average different on the the different classes of examples then there must be a non-zero correlation between the cost attribute and the output of the classifier.
5 Loss rates estimation with mixture model methods
Theorem 2.1 and the results of Section 3 have obvious applications to the problem of forecasting portfolio-wide default rates in portfolios of rated or scored borrowers. The forecast portfolio-wide default rate may be interpreted in an individual sense as a single borrower’s unconditional probability of default. But there is also an interpretation in a collective sense as the forecast total proportion of defaulting borrowers.
The statements of Theorem 2.1 and Assumption 2 are agnostic in the sense of not suggesting an individual or collective interpretation of the models under inspection. But by explaining Assumption 2 in terms of a classifier and the examples to which it is applied we have suggested an individual interpretation of the assumption.
However, there is no need to adopt this perspective on Assumption 2 and the results of Section 3. Instead of interpreting as an individual example’s probability of belonging to class 1 we could as well describe as the proportion of a mass or substance that has property 1. If we do so we switch from an interpretation of probability spaces in terms of likelihoods associated with individuals to an interpretation in terms of proportions of parts of masses or substances.
|LTV band||Last year||This year|
|% of exposure||of this % lost||% of exposure||of this % lost|
|More than 100%||?|
|Between 90% and 100%||?|
|Between 70% and 90%||?|
|Between 50% and 70%||?|
|Less than 50%||?|
Let us look at a retail mortgage portfolio as an illustrative example. Suppose that each mortgage has a loan-to-value (LTV) associated with it which indicates how well the mortgage loan is secured by the pledged property. Mortgage providers typically report their exposures and losses in tables that provide this information per LTV-band without specifying numbers or percentages of borrowers involved. Table 1 shows a stylised example of how such a report might look like.
This portfolio description fits well into the framework described by Assumption 2. Choose events ’More than 100% LTV’, ’Between 90% and 100% LTV’ and so on. Then the -field is generated by the finite partition . Similarly, choose ’lost’ and ’not lost’. The measure describes last year’s observations, specifies the distribution of the exposure over the LTV bands as observed at the beginning of this year – which is the forecast period. We can then try and replace the question marks in Table 1 by deploying the estimators discussed in Section 3. Table 2 shows the results.
|LTV band||Covariate shift||Scaled prob. av.||KL distance|
|% lost||% lost||% lost|
|More than 100%||15.0||34.6||35.4|
|Between 90% and 100%||2.2||6.3||6.5|
|Between 70% and 90%||1.1||3.2||3.3|
|Between 50% and 70%||0.5||1.5||1.5|
|Less than 50%||0.2||0.6||0.6|
Clearly, the estimates under the prior probability shift assumptions are much more sensitive to changes of the features (ie LTV bands) distribution than the estimate under the covariate shift assumption. Thus the theoretical results of Corollaries 1 and 2 are confirmed. But recall that there is no right or wrong here as all the numbers in Table 1 are purely fictitious. Nonetheless, we could conclude that in applications with unclear causalities (like for credit risk measurement) it might make sense to compute both covariate shift estimates and ML estimates (more suitable under a prior probability shift assumption) in order to gauge the possible range of outcomes.
We have revisited the maximum likelihood estimator (or more generally Kullback-Leibler (KL) distance estimator) of the component weights in simple finite mixture models. We have found that (if all weights of the estimate are positive) it enjoys an exact fit property which makes it even more attractive with regard to mathematical consistency. We have suggested a Gauss-Seidel-type approach to the calculation of the KL distance estimator that triggers an alarm if there is no solution with all components positive (which would indicate that the number of modelled classes may be reduced).
In the context of two-class quantification problems, as a consequence of the exact fit property we have shown theoretically and by example that the straight-forward ’covariate shift’ estimator of the prior class probabilities may seriously underestimate the change of the prior probabilities if the covariate shift assumption is wrong and instead a prior probability shift has occurred. This underestimation can be corrected by the Scaled Probability Average approach which we have generalised to the multi-class case or the KL distance estimator.
As an application example, we then have discussed cost quantification, ie the attribution of total cost to classes on the basis of characterising features when class membership is unknown. In addition, we have illustrated by example that the mixture model approach to quantification is not restricted to the forecast of prior probabilities but can also be deployed for forecasting loss rates.
- (1) Bella, A., Ferri, C., Hernandez-Orallo, J., Ramírez-Quintana, M.: Quantification via probability estimators. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 737–742. IEEE (2010)
Du Plessis, M., Sugiyama, M.: Semi-supervised learning of class balance under class-prior change by distribution matching.Neural Networks 50, 110–119 (2014)
- (3) Fawcett, T., Flach, P.: A response to Webb and Ting’s On the Application of ROC Analysis to Predict classification Performance under Varying Class Distributions. Machine Learning 58(1), 33–38 (2005)
- (4) Forman, G.: Quantifying counts and costs via classification. Data Mining and Knowledge Discovery 17(2), 164–206 (2008)
- (5) Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. Springer (2006)
- (6) Hofer, V., Krempl, G.: Drift mining in data: A framework for addressing drift in classification. Computational Statistics & Data Analysis 57(1), 377–391 (2013)
- (7) Hopkins, D., King, G.: A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1), 229–247 (2010)
- (8) Moreno-Torres, J., Raeder, T., Alaiz-Rodriguez, R., Chawla, N., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521–530 (2012)
- (9) Peters, C., Coberly, W.: The numerical evaluation of the maximum-likelihood estimate of mixture proportions. Communications in Statistics – Theory and Methods 5(12), 1127–1135 (1976)
- (10) Redner, R., Walker, H.: Mixture densities, maximum likelihood and the EM algorithm. SIAM review 26(2), 195–239 (1984)
- (11) Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1), 21–41 (2002)
- (12) Schlattmann, P.: Medical applications of finite mixture models. Springer (2009)
- (13) Tasche, D.: The art of probability-of-default curve calibration. Journal of Credit Risk 9(4), 63–103 (2013)
- (14) Titterington, D., Smith, A., Makov, U.: Statistical analysis of finite mixture distributions. Wiley New York (1985)