Introduction
Probability density estimation for high dimensional data is difficult. Kernel density estimation (KDE) for approximating probability density function (PDF) is a commonly used nonparametric technique which has been studied both empirically and theoretically. However, this technique usually has poor performance when estimating probability density for data over dimensions, and suffers for finding optimal kernel functions and the corresponding bandwidths (i.e. smoothing parameters). In addition, data distribution tends to be sparser and sparser as the number of data dimensions increases, causing KDE, as well as other local methods, to give zero probability estimates for most of data space. Therefore, the need of data amount for the KDE method to give accurate density estimation grows exponentially with data dimension. Also, it does not scale well for ”Big Data” due to its data storage requirement and computational burden for probability density inference. Similar issues go to other nonparametric PDF techniques such as Knearestneighbor.
PDF estimation based on Gaussian mixtures is one of many parametric techniques nicely studied. Its strong assumption (mixture of Gaussians) makes it difficult to estimate probability density for data sampled from highly nonGaussian distribution (e.g. uniform distributions, highly skewed distributions) well. Similarly, all other parametric techniques also assume PDF of data has a certain functional form, and the functional form of the true PDF is usually unknown, causing most of the techniques rarely produce good estimated density for complex high dimensional data.
Estimating probability density using neural networks (NN) was proposed a long time ago. These methods usually give satisfying performance for low dimensional data, and become ineffective or computationallyinfeasible for very high dimensional data. [1] and [2] propose a few methods to address this issue. [1] proposes two methods called Stochastic Learning of the Cumulative (SLC) and Smooth Interpolation of the Cumulative (SIC). Both of these methods make use of Cumulative Distribution Function (CDF). SLC only works for 1dimensional data and SLC is able to deal with multidimensional data. But in the paper the authors only mention inferring PDF by differentiating the approximated CDF and no solution or algorithms for the computation of higher order derivatives provided. Such computation usually has no explicit formulas and hard to approximate numerically. [2] propose a method using a 3layer neural net and training it with a modified loss function which includes a term approximating integration numerically over the region of interest, making the output of model valid probability estimates. However, the integration part of this method suffers from the difficulty of numerical integration for high dimensional data.
Recent development of deep learning provides new algorithms for learning PDF by applying deep network structure. [3] discusses NADE and its deep extension DeepNADE, which make use of onedimensional conditionals, to estimate probability density for multidimensional binary data . For realvalued data, its real value version RNADE requires a parametric functional form be selected for PDF estimation. [4] propose a PDF estimation algorithm called MADE, which combines conditional masking trick inspired from the chain rule of probability and deep neural networks to enable a neural net model to learn valid probability estimates. But MADE is designed for binary data only and does not work for real valued data.
CDF2PDF is a method of PDF estimation by approximating CDF. The original idea of it was previously proposed in [1] called SIC. However, SIC requires additional hyperparameter tunning, and no algorithms for computing higher order derivative from a trained NN are provided in [1]. CDF2PDF improves SIC by avoiding the timeconsuming hyperparameter tuning part and enabling higher order derivative computation to be done in polynomial time. Experiments of this method for onedimensional data shows promising results.
Smooth Interpolation of the Cumulative (SIC)
SIC[1] uses a multilayer neural network which is trained to output the estimates of CDF. Given , let be the true PDF, and be the corresponding CDF defined as
.
Given data , to training a NN, needs to be estimated from finite data samples. [1] provides two estimation methods for high dimensional data. One is to sample input space uniformly and determine targets by the formula below:
where if for all , and otherwise. Another option is to use the data points to generate targets by the formula similar to the previous one:
After generating training data by either one of these estimation methods, a multilayer neural network , where stands for model parameters, is trained with a loss function for regression. Since there are many choices for the loss function, we chose the square loss function in this work.
Once training is done and approximates well, the PDF can be directly inferred from by the formula:
Issues with SIC
In [1], the following loss function is used for training.
where is a positive hyperparameter that controls the penalty of nonmonotonicity of the NN via the second term of the loss function, is a small positive number, and
is a ddimensional column vector of
’s.Since needs to be tuned, extra work such as crossvalidation or gridsearch are needed. Too large could cause the trained NN to produce oversmoothed estimates and too small cannot guarantee the trained is nondecreasing over input space where contains very few or even no training data points, causing invalid probability density estimates later (i.e. negative values).
In addition, as stated before, no algorithm for computation of was proposed in [1]. The authors did mention to estimate by numerical differentiation. However, numerally approximating higher order derivatives is inefficient and numerically unstable, thus not feasible in practice.
Advantages of CDF2PDF Method
In [1], the authors stated a few advantages of SIC:

Approximating the CDF is less sensitive to statistical fluctuations than approximating the PDF directly from data, for the integral operator, or summation operation in practice, plays the role of a regularizer.

Theoretically, the convergence rate of the PDF estimates inferred from the approximated CDF to a smooth density function which has bounded higher derivatives is faster than the convergence rate of KDE methods.

Unlike KDE methods, there are no smoothing parameters that need to be determined for density estimation.
These advantages are also for CDF2PDF, and additionally it has more advantages below:

CDF2PDF removes the nonmonotonicity penalty term from the loss function and meanwhile guarantees the estimated CDF function is monotonically increasing and close to be monotonically nondecreasing. So the only hyperparameter in this method is the number of hidden nodes.

After the neural network learns the CDF from data, the corresponding PDF can be efficiently computed from model parameters in polynomial time.
Cdf2pdf
Neural Networks as Increasing Function Approximators
In CDF2PDF method, only a single hidden layer neural net with sigmoid activation functions in the hidden layer and one linear output node is used for CDF estimation. A traditional single hidden layer neural net of this architecture has the following functional form:
where is the input vector, is the th row of , H is the number of hidden nodes, are the weight vector and bias of the hidden layer, and
are the weight matrix and the bias vector of the input layer.
When the model parameters travel in the parameter space during training, values of and can be any real numbers. Previously, the second penalty term in the loss function for SIC is designed to regularize these values for making the trained neural net nondecreasing. In CDF2PDF, instead of using this penalty term, we directly enforce the neural net to be an increasing function. The functional form can be written as follows:
We called this type of neural networks Monotonic Increasing Neural Networks (MINN). The rational of this modification is that any composition function of increasing functions is again increasing. Knowing that any sigmoid functions are monotonic increasing, as long as parameter of
and are positive, must be an increasing function. Therefore, we replace and with natural exponential functions of them. This modification removes the need of nonmonotonicity penalty term in the loss function, and serve as a kind of regularization. Experiments below for dimensional data will show effectiveness of MINN’s approximation ability of nondecreasing functions.Inference of PDF
After training is done, the corresponding estimated PDF can be inferred by:
where is the th derivative of the sigmoid function. It indicates that to estimate a PDF of dimensional data from the MINN, the th derivative of the sigmoid function must be computed.
Computation of Higher Order Derivative
Closed form computation for higher order derivative of sigmoid functions exists for its special differential structure. For example, the second and third derivatives of tanh can be written as , . Thus, the th derivative of the tanh function is a th degree polynomial of . All we need is to determine the coefficients of the polynomial. An efficient algorithm is being developed and the mathematical details for it is included in [5].
FineTuning for CDF of a NonSmooth PDF
In practice, not all CDF can be approximated well by smoothed functions such as sigmoid. It is hard for sigmoid functions to approximate strong nonsmoothness, e.g. uniform distributions. To deal with this, after training a MINN, we replace all tanh activation functions with an modified version of tanh functions as follows:
where and
For edgecutting learning of , we define , which is a model parameter to be optimized during finetuning phrase.
Noticing that all higher order derivatives (second and any larger order) of is , such modification on the activation functions will not cause any issue for inferring high dimensional PDF.
Experiments
Density Estimation for Multimodal Distributions
We first test the proposed method for estimating the following density function called BartSimpson distribution.
where denotes a Normal density function with mean
. data points are sampled from . The following plots are histogram and empirical CDF of the sampled data.A MINN with hidden nodes is selected for training. Adadelta[6] is chosen as the stochastic optimization algorithm for training with batch size of and epochs. The training data are generated by (1). Figure 1 shows approximated CDF vs emprical CDF, and Figure 2 compares inferred PDFs from MINN and KDE with true PDF.
The PDF from KDE is estimated with Gaussian kernels and bandwidth selected by crossvalidation. As we can see, CDF2PDF method is able to give better PDF estimates without any effort to hyperparameter tuning, compared to the probability density estimates given by KDE.
Codes of this experiment correspond to StochasticCDFLearning_demo.m in CDF2PDF_1d_codes folder.
Density Estimation for Mixture of Distributions
Next, we estimate the following density function of mixed distributions.
where denotes a uniform distribution defined on interval . contains two uniform distributions, which make not differentiable, thus not smooth, at boundaries of the uniforms. data points are sampled from , and training data are generated by (1).
In this experiment, a MINN with hidden nodes is used. The optimization algorithm is again Adaddelta with batch size of and epochs same as the previous experiment. Below are two plots showing approximated CDF vs emprical CDF and inferred PDF vs true PDF.
To make the MINN adaptive to the nonsmoothness of data, all activation functions in the trained MINN are replaced with from , and retrain the MINN along with ’s in . The second training phase aims to fine tuning the nonlinear property of activation functions by ’s. epochs of training is performed for the fine tuning phrase.
Figure 5 shows the effectiveness of the finetuning phrase. In Figure 5, the PDF estimates given by KDE uses Gaussian kernels with bandwidth determined by crossvalidation. The PDF estimates of MINN after finetuning is much closer to the true PDF and less fluctuate over the intervals of the two uniforms compared to the KDE estimates.
Codes of this experiment correspond to FineTuning4CDFLearning_demo.m in CDF2PDF_1d_codes folder.
Discussion
Probability density estimation is fundamental and essential in most fields of science and engineering. Directly estimating the density function from data is an illposed problem and naturally hard for high dimensional data due to ”the curse of dimensionality”. Based on the experimental results for
dimensional data, it seems promising to extend the CDF2PDF method to high dimensional data. Potential solutions to the major issues for computing the estimated high dimensional PDF from a trained MINN are given in this article. For theoretical discussion about convergence of estimated PDF from CDF to the true PDF, please refer to [1].Current & Future Work
Currently, I am working on developing efficient algorithms for fast generation of training data and higher order differentiation. The CDF2PDF method proposed in this article is only a prototype. It is definitely possible to build a deep MINN while keeping higher order derivative computation feasible, which belongs to future work.
References

MagdonIsmail, M., Atiya, A. (2002). Density Estimation and Random Variate Generation using Multilayer Networks. IEEE Transactions on Neural Networks, 13(3), 497520.

Likas, A. (2001). Probability Density Estimation using Artificial Neural Networks. Computer physics communications, 135(2), 167175.

Uria, B., Côté, M. A., Gregor, K., Murray, I., Larochelle, H. (2016). Neural Autoregressive Distribution Estimation.
Journal of Machine Learning Research
, 17(205), 137. 
Germain, M., Gregor, K., Murray, I., Larochelle, H. (2015, February). MADE: Masked Autoencoder for Distribution Estimation. In
ICML (pp. 881889). 
Boyadzhiev, K. N. (2009). Derivative Polynomials for Tanh, Tan, Sech and Sec in Explicit form. arXiv preprint arXiv:0903.0117.