Mixture Density Network Estimation of Continuous Variable Maximum Likelihood Using Discrete Training Samples

03/24/2021 ∙ by Charles Burton, et al. ∙ The University of Texas at Austin 0

Mixture Density Networks (MDNs) can be used to generate probability density functions of model parameters θ given a set of observables 𝐱. In some applications, training data are available only for discrete values of a continuous parameter θ. In such situations a number of performance-limiting issues arise which can result in biased estimates. We demonstrate the usage of MDNs for parameter estimation, discuss the origins of the biases, and propose a corrective method for each issue.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A frequent goal in the analysis of particle physics data is the estimation of an underlying physical parameter from observed data, in situations where the parameter is not itself observable, but alters the distribution of observables. A typical approach is to use maximum likelihood estimation (MLE) to extract values of the underlying parameters and their statistical uncertainties from the kinematic distributions in the observed data. In order to do this, the probability density function (PDF) of the observable(s), as a function of the underlying parameters, must be available. These are frequently available only from Monte Carlo simulation, not from analytic predictions. In these simulations, the value of the parameter is varied across a range of possible values; the derived PDFs (determined from histograms or other kernel density estimators, or approximated with analytic functions) are then compared to the distributions in the observed data to estimate the parameter. A number of machine learning-based approaches to parameter estimation trained using simulated data have been discussed in the literature, which can utilize multidimensional data to improve the uncertainty of estimates (see for example Refs. 

[3, 4, 1]).

If one has the PDF for the observables available for any given value of the parameter, the MLE can be computed. However the PDF may be complex and difficult to approximate, especially if there are multiple observables with correlations. An alternative procedure is to directly approximate the likelihood function. Solving a closely-related task is the goal of Mixture Density Networks [2], which approximate a conditional probability density of the parameters from the input features as a sum of basis PDFs, whose parameters are predicted by the network. The training of an MDN predicts a PDF for the underlying parameter given an observation, which is related to the likelihood by an irrelevant constant multiplier. A trained MDN is therefore able to build up the total likelihood function for a dataset on an event-by-event basis. MDNs used in this mode permit the straightforward use of multidimensional observables when performing parameter estimation.

Training of MDNs typically assumes that all input parameter values are equally likely to be sampled in the training dataset. In particular if the parameter is continuous, random values of the parameter will be thrown for each event. In essence, this is equivalent to specifying a flat prior for the application of Bayes’ theorem that translates the PDF that the MDN learns into the likelihood function. However, such datasets may not be readily available, for various reasons: for example, sharing Monte Carlo samples with analyses using other techniques that use histograms at specific parameter values to build up templates, or computational difficulties with changing the parameter values in the Monte Carlo generator for every event. In this work, we discuss issues which arise when using samples with discrete parameter values, rather than a continuous parameter distribution, to estimate parameters with MDNs.

A related set of issues arises when a restricted range of parameter values is used in the training, in which the trained network is reluctant to predict values near the boundaries of the range due to the relative lack of training events in that local vicinity. This occurs even when training with a continuous parameter distribution and affects tasks other than likelihood estimation, such as simple regression. As it will appear in any practical application of an MDN to estimate a physical parameter, we also discuss it here.

Figure 1: For some applications, training data is only available at discrete values of (e.g. at each of the red markers). For every example discussed here, the training samples consist of 10 template data sets with parameter equally spaced between 0 and 1.

The aim of the paper is to demonstrate the construction of MDNs for continuous parameters from datasets populated only with discrete parameter values and to discuss pitfalls that can occur in the training. This paper is structured as follows. The basic concept of Mixture Density Networks is introduced and the application to likelihood estimation discussed. Potential biases in the network training are discussed and procedures to mitigate them are proposed, in the context of specific examples. The performance of trained MDNs is demonstrated. Finally, limitations of the technique and avenues for future improvement are discussed.

Ii Mixture Density Networks as Likelihood Approximators

A Mixture Density Network is a neural network where the output is the parameters of a


, defined to be a properly normalized PDF. The parameters can, for example, be the mean, width, and relative contributions of a (fixed) number of Gaussian distributions. In general, the goal is to obtain an estimate of a PDF.

We will specifically use MDNs here with a training set that includes parameters and observables , where we are interested in estimating the conditional probability in this set. By using Bayes’ theorem we can convert this probability to an estimate of the likelihood function .

In general, the MDN represents the target PDF as a weighted sum of generic basis PDF functions ,


where and are the -dependent coefficients and parameters of the basis functions, which are predicted by the network. Typically the conditions and are imposed, for example through the softmax function. In principle, the basis functions can be any set of PDFs. For many processes, a useful choice is a mixture of Gaussian functions. Using that choice, the neural network’s output is a set of coefficients .

To train the network, the PDF is evaluated for each point in the training data set . The product of these values is a likelihood function (unrelated to the likelihood we want the MDN to predict)



as the loss function, the training of the network results in parameters which maximize the likelihood in Eq. (



Let us briefly mention two features which will be discussed in further detail in the rest of the paper. First, the basis functions may not be constructed with any knowledge of the range of parameters in the training set (for example, Gaussian basis functions have infinite support and therefore will always predict a non-zero probability outside the range of the training data). Proper training of an MDN requires that the PDFs be properly normalized over the range of the training data; biases are introduced if this is not the case. Second, the translation of the learned conditional probability to a likelihood involves Bayes’ theorem:


The PDF (which is determined by the entire training set) is actually not relevant when computing the MLE for a fixed set of observations, because it just contributes a constant multiplicative factor to the likelihood that does not depend on . However it is necessary to consider during the training in a fashion which will be discussed. The PDF depends on the distribution of training data and does matter during the MLE computation.

Iii Density of Parameter Points

We are interested in using a training set with discrete values of to learn . We will refer to the subset of training data produced with a specific parameter choice as a template. As long as the observed values can reasonably be produced by multiple values of

, and the number of basis functions in the MDN is kept reasonably low, the MDN will naturally be forced to interpolate between parameter points. It is important that there not be too much freedom in the MDN functions; if there are equal numbers of training parameter values and Gaussian functions in the MDN, for example, the network can essentially collapse the Gaussian functions to delta functions at each parameter value and reduce the loss function without limit.

The application of Bayes’ theorem in Eq. (4) involves , and clearly simplifies if can be assumed to be flat. To achieve this, the locations of the templates in parameter space should be chosen to ensure an equal density of training points across the entire range. This implies equal spacing of the parameter values and that the parameter range needs to be considered to extend somewhat beyond the extremal values of the parameters actually used to create templates. For example, suppose we have one parameter and the parameter range is selected as , and 10 templates are to be used. To ensure equal and unbiased coverage of the parameter range, they should be placed at , as shown in Fig. 1. If the extra gap is not left at the edges (for example, if eleven templates are used, at ), then the training data are effectively overrepresented at the extremal values of and there will be a bias in the training.

Essentially, the chosen parameter values should lie at the centers of equal-width histogram bins that extend from the lowest to highest values of . This can be generalized to multidimensional parameter spaces.

Iv PDF Normalization

For a given data point , the PDF of Eq. (1) must be properly normalized across the selected range of for MDN training to work properly. If this is not done, parameter values near the edges of the range will be penalized because the PDF predicts parameter values to occur outside of the range, and these are never encountered in the training data. In the case of a single real-valued parameter , one must require that

This can be achieved by dividing the PDF predicted by the MDN by the integral of the PDF over the parameter range. Since, during training, the PDF is never evaluated outside of this range, this results in the proper normalization.

Practically speaking, if one uses a predefined PDF from a numerical computing package in order to build the basis functions , it is easier to apply this renormalization procedure for each function than to do it on the sum. This has the benefit that the condition

is still valid. In the one-dimensional parameter case, if the cumulative distribution function (CDF) is available,

This effect is not specific to training with discrete parameter choices, and will generally occur in regions where observed data could be compatible with parameters outside the training range.

V Edge Bias from Training on Templates

We now discuss a bias which arises from the discreteness of the input parameter values. To achieve a flat , we need to consider the input range of parameters to be broader than just the range where training data are to be found. However the loss function is only evaluated on the training data, and so the optimizer can “cheat” by overpredicting values of that are compatible with a broad range of observations and underpredicting extremal values of that are not represented in the training data. The symptom of this is that the conditional probability density implied by the MDN output ,

does not integrate to one for all values of , but rather is smaller than one at extremal values of and greater than one in the interior of the range.

The size of this effect depends on how much the true distributions overlap, and how distinguishable the distributions of for the endpoint values of differ from those for the interior values. Fig. 2 shows examples of distributions which will demonstrate minor and extreme bias.

Figure 2: Examples of templates (training data for discrete parameter values) of an observable for different values of a parameter . When the observable templates are distinguishable (left), the edge bias (see text) is minor. When the templates are harder to distinguish over the range of (right), the correction for the edge bias is critical.

v.1 Demonstration With Functional Fit

It should be emphasized that the observed bias is not unique to the Mixture Density Network method. Rather, it is simply a result of maximizing the likelihood Eq. (2) composed of the conditional probability for a finite number of templates. To illustrate this we will show the existence and correction of the bias in a simple functional fit.

Consider a conditional PDF which, given some parameter value , produces a Gaussian distribution of , a univariate variable: this could correspond to a “true” value and an “observed” value which is subject to resolution effects. For this example we choose the distribution


with . We choose 10 equally-spaced values of between and at which to generate training points. Since the width of each template is much broader than the total range of , the 10 templates are not very distinguishable from one another. A few of the templates are shown on the right-hand side of Fig. 2.

Next, we attempt to reconstruct the joint probability distribution

by performing an unbinned maximum likelihood fit of a three-parameter function to the generated data,

where is a normalization factor. The best-fit values of the function parameters are not the ones used to generate the data.

True values 1 0 4
0.17 0.41 1.65
Unbiased 1.03 -0.02 3.99
Table 1: The maximum likelihood estimators for the parameters of the function , when determined using discrete parameter inputs, do not match the intended model parameters due to edge effects. Adding a correction term during likelihood optimization resolves this, and the results match the desired values.

To understand the source of the problem, it is helpful to consider the joint PDF, which is visualized in Fig. 3. We know the “true” value we want to reproduce, which is the product of the distribution in Eq. (5) with a flat prior on the range . We can then overlay given by the best-fit values of , , and . We see that the overall shapes of the two are similar, but is more concentrated at intermediate values of . Integrating along vertical lines, the value is clearly less than , while the values are constant for all . While maintaining the total probability in the joint PDF constant, the fit to has recognized that a better value of the likelihood can be achieved by removing probability from the regions near and (where there are no data points to enter the likelihood computation) and placing that likelihood near , a parameter value that is consistent with almost all values of . In other words, the function is not consistent with the presumption of a flat prior .

Figure 3: Joint PDFs for the example described in Section V.1. The “true” joint PDF is shown in black. Fitting to the training dataset with discrete parameter values results in the blue PDF, which is different from the desired form. Adding additional constraints to the fit requiring to be 1 at and yields the orange dashed line PDF instead, which is a good approximation of the true distribution.

Having seen this, we can now propose a solution: we require the fit to maximize the likelihood in Eq. (2) while also satisfying the following constraint:


This constraint prevents the optimizer from improving the loss function by implicitly modifying . In practice, the constraint is applied for a few values of , for example and in this case, with the hope that the smoothness of the functional forms will force the constraint to apply elsewhere. Applying this constraint, we redo the fit. The best-fit parameters obtained — the bottom row of Table 1 — are now consistent with the true values used to generate the data. Accordingly, the joint PDF of this fit (in Fig. 3) shows no bias towards . Any remaining disagreement is attributable to statistical fluctuations in the estimation of .

v.2 MDN With One Observable

Now we will demonstrate how this correction is applied with a Mixture Density Network. We construct a neural network using PyTorch 

[7] with a single input , and outputs which are parameters of a function which models the conditional likelihood

. There is a single hidden layer with an adjustable number of nodes. Because of the intentional simplicity of this data,the number of nodes in this hidden layer was generally kept between 2 and 5, and we model the likelihood with just a single Gaussian function. Therefore, the two output nodes of the network are just the mean and standard deviation of the likelihood function. The Adam minimizer 

[5] is used to adjust the network parameters to minimize the loss function.

Figure 4: Four of the ten template histograms used as training data for the single-input MDN of Section V.2.

A large number of values were generated at each of 10 equally-spaced discrete values of from Gaussian distributions with and . The network is trained by minimizing the loss function described in Section II. We find that this produces biased results. Due to the construction of the training data, we expect that the MDN should predict that is a truncated Gaussian (nonzero only on the range ) with and . In Fig. 5, the MDN output from the naive training for input values of are shown; we see that the network underestimates the width of the likelihood function and biases the mean towards .

Figure 5: Parameters of the likelihood curves predicted by the MDNs of Section V.2 for various values of the input . Without correction, the network will underestimate the width of the likelihood curve, and predict a maximum-likelihood value biased towards the middle of the sampled range.

v.2.1 1D Network Correction and Validation

The loss function of the network is then modified by adding an extra term,




for some set of parameter values . With respect to Fig. 3, this extra term consists the standard deviation of integrals of the joint PDF along vertical slices in . It ensures that remains constant, discouraging a bias towards intermediate values of . The joint probability is estimated from the MDN output and the probability distribution

over the training sample, which is estimated using histograms. The new hyperparameter

is tuned to balance the effectiveness of the additional term while avoiding the introduction of numerical instability into the minimization. With this new loss function, the outputs of the neural network — the coefficients of the Gaussian likelihood function — correctly match expectations, as seen in Fig. 5.

Next, we demonstrate the ability of the network to reconstruct the underlying parameter given a set of observations. Each measurement of the quantity will be denoted . For each , the trained network outputs a likelihood function . To calculate the combined likelihood of an entire set of measurements, , we take the product of the individual likelihoods,


The maximum likelihood estimator is defined as


A likelihood-ratio test can be used to determine the statistical uncertainty of the estimator . An example of the likelihood curves generated for this trained network is given in Fig. 6.

Figure 6: For a fixed value of the parameter , three values are generated. The likelihoods are obtained from the output of the trained MDN of Section V.2. The maximum likelihood estimator is extracted from the product of these likelihoods according to Eq. (10) and shows a result statistically compatible with the expected value of 0.2. (For visualization, all likelihoods are offset to have a minimum of 0.)

To test the accuracy and precision of the network, we choose 25 equally-spaced test values in the range . For each , we generate a set of 10,000 measurements . The pseudodata is passed though the network for each , the likelihoods are calculated, and a is found. The uncertainty in is estimated using Wilks’ theorem [8] (searching for values of that raise by 0.5). If the network is accurate and unbiased, we should find statistical agreement between and . Indeed, we found this to be true with . The results for the unbiased network are shown in Fig. 7.

Figure 7: Comparison of the true value of the parameter and the MLE estimate determined for a dataset generated at that over the range . The errors shown are those obtained by applying Wilks’ theorem to the likelihood curve. No bias is observed and the error estimates are .

v.3 Demonstration With 2D Network

Next, we consider a more advanced usage of MDNs. We consider a system with two inputs, . The value of is sensitive to an underlying parameter , but only for small values of ; when is large, there is no sensitivity. This is chosen to demonstrate how this neural network technique can be an advantageous approach to parameter estimation: the MDN will automatically learn when observables offer meaningful information on parameters and when they do not.

v.3.1 2D Toy Model

Two components comprise the model used to generate 2D pseudodata. The first is events with a small value. These events have an value which depends on the unknown parameter . In the toy model, this component is modeled by a Gaussian in . The mean of this Gaussian varies with . In , it is modeled by a Gaussian with a low mean.

In the other component, the value of is large, and the

value is not related to the unknown parameter. This component is modeled with a Gamma distribution in

, and a Gaussian with a high mean in . The relative fraction of the two components is the final model parameter.

Written out, the 2D PDF takes the form


This PDF is shown for the two extremal values of in Fig. 8.

Figure 8: Examples of PDFs and generated samples for different values of for the model of Section V.3.

v.3.2 2D Network Structure, Correction, and Validation

The network is built in the following manner. First, we note that there are two inputs. Accordingly, there will be two input nodes. Next, one should note from the contours of in Fig. 8, for a sample with large , the value of has no predictive power of . Restated simply, the red and blue contours overlap entirely on the upper half of the plot. However, for small values of , the variable can distinguish different values of ; the red and blue contours are separated on the lower half of the plot. Two hidden layers are used. The output of the network is the coefficients of a mixture of Gaussian distributions. The number of Gaussians in the mixture is a hyperparameter which may be tuned to best match the particular structure of some data. In this example, the data are bimodal, so it is reasonable to expect two Gaussians to be expected. It was confirmed by trial-and-error that a mixture of two Gaussians was adequate to model this data, since it produced the correct linear response in the validation of the network. Therefore, for this network, there are six output nodes (with one normalization constraint) for a total of five independent values. The correction term presented in Eq. (7) is applied as well; the term is estimated using a 2D histogram. The training data consist of 100,000 samples generated according to at each of 10 equally-spaced discrete values of .

Figure 9: Likelihood curves predicted by the trained MDN for three possible observations in the model of Section V.3. When is small (blue and green curves), the value of gives sensitivity to the model parameter . When is large (orange curve), there is very little sensitivity to and the likelihood curve is flat. (For visualization, all likelihoods are offset to have a minimum of 0.)

Fig. 9 demonstrates that the trained network is able to learn that the sensitivity of individual observations to the value of the model parameter varies with ; the predicted likelihood curve is much flatter in regions where all parameter values give similar distributions.

To validate the network’s performance, we generate test data sets for various test parameter values . As in the one observable case, each point is passed through the network and a likelihood function is calculated. We then find the maximum likelihood estimator of . An accurate network should, within statistical uncertainty, reproduce the line . Indeed, Fig. 10 shows that the network is accurate and unbiased, and that it provides a reasonable uncertainty estimate.

Figure 10: Maximum likelihood estimator for parameter value , and associated statistical uncertainties estimated using Wilks’ theorem, for samples generated at various values of true . No bias is observed and the error estimation works well.

Vi Method Limitations

In the previous section, we have demonstrated necessary corrections for networks taking one or two values for each observation. In principle, this method could work with an arbitrarily high number of input observables. However, the current correction technique has a curse-of-dimensionality problem in that computing the correction term

requires an estimate of over the full training data set. When determined using histograms with bins per axis, observables will require an -bin histogram to be reasonably well populated. Other techniques such as generic kernel density estimation [6] or -nearest neighbors run into the same issues eventually, although at differing rates, and the best option should be explored in each specific situation. The dimensionality issue is alleviated somewhat by the fact that only one function ( over the entire training dataset) needs to be estimated, unlike methods which need to generate templates separately at each generated parameter point.

Another limitation concerns the spacing of training points. While we have demonstrated that MDNs can be trained with discrete templates of data and still interpolate properly to the full continuous parameter space, it has been necessary to use templates from parameters which are uniformly distributed, to enforce that

is flat. If the parameter values are not equally spaced, this would correspond to a non-flat . In principle weights could be used during training to correct for this effect, but reconstructing from the distribution of is a density estimation problem and it is not clear if there is an optimal estimation method for the required if there are only a limited number of available.

Vii Conclusions

Mixture Density Networks can robustly approximate likelihood functions and enable the estimation of parameters from data, even if the training data is only available at discrete values. This method permits one to proceed directly from (possibly multiple) observables to a likelihood function without having to perform an intermediate density estimation step for various parameter values as would be required by many parameter estimation techniques. The MDN technique can be applied even with training data that provide a discrete set of parameter points, provided that the points are spaced evenly and certain corrections and constraints are applied turing the training of the network.


  • [1] A. Andreassen, S. Hsu, B. Nachman, N. Suaysom, and A. Suresh (2021) Parameter estimation using neural networks in the presence of detector effects. Phys. Rev. D 103 (3), pp. 036001. External Links: 2010.03569, Document Cited by: §I.
  • [2] C. M. Bishop (1994) Mixture density networks. Note: Aston University Neural Computing Research Group Report NCRG/94/004 Cited by: §I.
  • [3] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez (2018) A Guide to Constraining Effective Field Theories with Machine Learning. Phys. Rev. D 98 (5), pp. 052004. External Links: 1805.00020, Document Cited by: §I.
  • [4] F. Flesher, K. Fraser, C. Hutchison, B. Ostdiek, and M. D. Schwartz (2020-11) Parameter Inference from Event Ensembles and the Top-Quark Mass. External Links: 2011.04666 Cited by: §I.
  • [5] D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §V.2.
  • [6] E. Parzen (1962-09) On estimation of a probability density function and mode. Ann. Math. Statist. 33 (3), pp. 1065–1076. External Links: Document, Link Cited by: §VI.
  • [7] A. Paszke et al. (2019)

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §V.2.
  • [8] S. S. Wilks (1938) The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. The Annals of Mathematical Statistics 9 (1), pp. 60 – 62. External Links: Document, Link Cited by: §V.2.1.