# Deep PDF: Probabilistic Surface Optimization and Density Estimation

A probability density function (pdf) encodes the entire stochastic knowledge about data distribution, where data may represent stochastic observations in robotics, transition state pairs in reinforcement learning or any other empirically acquired modality. Inferring data pdf is of prime importance, allowing to analyze various model hypotheses and perform smart decision making. However, most density estimation techniques are limited in their representation expressiveness to specific kernel type or predetermined distribution family, and have other restrictions. For example, kernel density estimation (KDE) methods require meticulous parameter search and are extremely slow at querying new points. In this paper we present a novel non-parametric density estimation approach, DeepPDF, that uses a neural network to approximate a target pdf given samples from thereof. Such a representation provides high inference accuracy for a wide range of target pdfs using a relatively simple network structure, making our method highly statistically robust. This is done via a new stochastic optimization algorithm, Probabilistic Surface Optimization (PSO), that turns to advantage the stochastic nature of sample points in order to force network output to be identical to the output of a target pdf. Once trained, query point evaluation can be efficiently done in DeepPDF by a simple network forward pass, with linear complexity in the number of query points. Moreover, the PSO algorithm is capable of inferring the frequency of data samples and may also be used in other statistical tasks such as conditional estimation and distribution transformation. We compare the derived approach with KDE methods showing its superior performance and accuracy.

## Authors

• 4 publications
• 14 publications
• ### Deconvolutional Density Network: Free-Form Conditional Density Estimation

Conditional density estimation is the task of estimating the probability...
05/29/2021 ∙ by Bing Chen, et al. ∙ 0

• ### General Probabilistic Surface Optimization and Log Density Estimation

In this paper we contribute a novel algorithm family, which generalizes ...
03/25/2019 ∙ by Dmitry Kopitkov, et al. ∙ 0

• ### Universal Approximation of Edge Density in Large Graphs

In this paper, we present a novel way to summarize the structure of larg...
08/06/2015 ∙ by Marc Boullé, et al. ∙ 0

• ### Kernel density estimation-based sampling for neural network classification

Imbalanced data occurs in a wide range of scenarios. The skewed distribu...
10/25/2021 ∙ by Firuz Kamalov, et al. ∙ 0

• ### Robust Stochastic Configuration Networks with Kernel Density Estimation

Neural networks have been widely used as predictive models to fit data d...
02/15/2017 ∙ by Dianhui Wang, et al. ∙ 0

• ### Design optimization of stochastic complex systems via iterative density estimation

Reliability-based design optimization (RBDO) provides a rational and sou...
02/29/2020 ∙ by Wang-Sheng Liu, et al. ∙ 0

• ### Temporal Density Extrapolation using a Dynamic Basis Approach

Density estimation is a versatile technique underlying many data mining ...
06/03/2019 ∙ by Georg Krempl, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Density estimation is a fundamental statistical problem which is essential for many scientific fields and application domains such as robotics, AI, economics, probabilistic planning and inference. Today’s most statistically robust techniques that work on a wide range of probabilistic distributions are non-parametric where arguably the leading is kernel density estimation (KDE). Though a large amount of research was done on these techniques, they still are computationally expensive at both estimation and query stages. They require searching for hyper-parameters like bandwidth and kernel type, and their expressiveness in representing the estimated probability density function (pdf) is high but still limited to the selected kernel. On the other hand, neural networks (NN) are known for their high flexibility and ability to approximate any function. Moreover, recently strong frameworks (e.g. Tensorflow_url ; Pytorch_url ; Caffe_url ) were developed that allow fast and sophisticated training of NNs using GPUs.

With this motivation, in this paper we present a novel method to perform non-parametric density estimation using Deep Learning (DL) where we exploit the approximation power of NN in full. Unlike any existing density estimation techniques, our approach, named DeepPDF, estimates a target pdf from sampled data using our novel stochastic optimization that forces NN’s output to be identical to the output of a target pdf function, thereby approximating the latter by a NN.

The KDE method is one of the most expressive methods that is flexible enough to approximate a broad diapason of target pdfs. It represents the estimated pdf through a sum of kernel functions, with kernel for each sample point. A naive implementation of such an approach has complexity of , where is number of sample points and is number of query points. Although faster KDE extensions exist (e.g. OBrien16csda ), they are still relatively slow. Specifically, query stage of state-of-the-art fastKDE method OBrien16csda has quasi-linear time complexity . Additionally, most KDE methods require to specify kernel type and bandwidth for better performance. Optimal values of these parameters are pdf-specific and parameter fine-tuning needs to be done. Moreover, the target pdf may have a complex structure with different parts requiring different optimal KDE parameters. In such case, standard KDE methods will return a statistically sub-optimal solution.

In contrast, the presented herein DeepPDF approach (see also Figure (a)a) allows to approximate the target pdf via a NN of arbitrary architecture and takes advantage of approximation robustness embedded within DL methods. Our approach trains a NN using a Probabilistic Surface Optimization (PSO). The PSO, being one of our main contributions, turns to advantage the stochastic nature of each data sample point in order to push surface that is represented by the NN. Likewise, this surface is pushed at points sampled from another known distribution. We use specially discovered stochastic pdf loss to push the neural surface in the opposite directions at data samples and at known distribution samples, where balance (minimum of pdf loss) is achieved only when the NN surface is identical to the target density. Such a balance is a substantial attribute of pdf loss which implicitly forces the estimator function (the surface) to have total integral of 1 and to be non-negative, releasing us from intricate dealing with these constraints explicitly. Moreover, training DeepPDF can be done in batch mode where at each iteration only a small number of sample points is used, lowering the requirement on memory size. DeepPDF is highly expressive and able to use facilities and theory developed lately for efficient and accurate DL training, including fast-converging optimizers, regularization techniques and execution on GPU cards. The pdf value at query point can be calculated by just computing output of trained NN with query point as input, which can be done very fast with complexity only depending on the NN’s size, in contrast to quasi-linear complexity of fastKDE query stage.

To summarize, our main contributions in this paper are as follows: (a) we develop a Probabilistic Surface Optimization (PSO) and pdf loss that force any approximator function to converge to the target density; (b) we use PSO to approximate a target density using a NN; (c) we train a NN via PSO in batch mode using exponential learning rate decay for better convergence, naming the entire algorithm DeepPDF; and (d) we analyze different DL aspects of DeepPDF. Additionally, for simplicity the derivations are gathered together under Appendix, at the end of the paper.

## 2 Related work

Statistical density estimation may be divided into two different branches - parametric and non-parametric. Parametric methods assume data to come from a probability distribution of a specific family, and infer parameters of that family, for example via minimizing the negative log-probability of data samples. Non-parametric approaches are distribution-free in the sense that they do not take any assumption over data population a priori. Instead they infer the distribution density totally from data.

The main advantage of parametric approaches is their statistical efficiency. Given the assumption of a specific distribution family is correct, parametric methods will produce more accurate density estimation for the same number of samples compared to non-parametric techniques. However, as was also observed in our previous work Kopitkov18iros

, in case the assumption is not entirely valid for a given population, the estimation accuracy will be poor, making parametric methods not statistically robust. For example, one of the most flexible distribution families is a Gaussian Mixture Model (GMM)

McLachlan88 . One of its structure parameters is the number of mixtures. Using a high number of mixtures, it can represent multi-modal populations with high accuracy. Yet, in case the real unknown distribution has even higher number of modes, or sometimes even an infinite number (see Section 5.2 and Figure (c)c), the performance of a GMM will be low.

To handle the problem of unknown number of mixture components in parametric techniques, Bayesian statistics can be applied to model a prior over parameters of the chosen family. Models such as Dirichlet process mixture (DPM) and specifically Dirichlet process Gaussian mixture model (DPGMM)

Antoniak74aos ; Sethuraman82sdtrt ; Gorur10jcst

can represent uncertainty about the learned distribution parameters and as such can be viewed as infinite mixture models. Although such hierarchical models are more statistically robust (flexible), they still require to manually select a base distribution for DPM, limiting its robustness. Also, Bayesian inference applied in these techniques is more theoretically intricate and computationally expensive

MacEachern98 .

On the other hand, non-parametric approaches can infer distributions of arbitrary form. Methods such as data histogram and kernel density estimation (KDE) Scott15book ; Silverman18 use frequencies of different points within data samples in order to conclude how a population pdf looks like. In general, these methods require more samples but also provide more robust estimation by not taking any prior assumptions. Both histogram and KDE require selection of parameters - bin width for histogram and kernel type/bandwidth for KDE which in many cases require manual parameter search Silverman18 . Although an automatic parameter deduction was proposed in several studies Duong05sjs ; Heidenreich13asta ; OBrien16csda , it is typically computationally expensive and its performance is not always optimal. Moreover, one of the major weaknesses of the KDE technique is its time complexity during query stage. Even the most efficient KDE methods (e.g. fastKDE OBrien16csda ) require above linear complexity () in the number of query points . In contrast, the method presented herein, DeepPDF, is a robust non-parametric approach that calculates the pdf value of a query point by a single forward pass of a NN, making query complexity to be linear in . Such query complexity supremacy makes DeepPDF a much better alternative when many query points require evaluation. Moreover, as we will see in Section 5, DeepPDF provides comparable and sometimes even a much higher approximation accuracy.

Using NNs for density estimation was studied for several decades Smolensky86rep ; Bishop94tr ; Bengio00nips ; Hinton06 ; Uria2013nips . The Mixture Density Network (MDN) presented by Bishop94tr represents parameters of a GMM through a NN, taking the advantage of its flexibility and expressiveness. Additionally, Williams at el. Williams96nc

introduced an efficient representation to parametrize a positive-definite covariance matrix of a Gaussian distribution through a NN. In image classification CNN networks produce discrete class probabilities

Krizhevsky12nips for each image. Yet, most of the work was done in the context of parametric approaches, arguably, due to their relative analytical simplicity.

Another body of research in DL-based density estimation was explored in Larochelle11aistats ; Uria2013nips ; Germain15icml

. The described autoregressive methods NADE, RNADE and MADE decompose the joint distribution of a multivariate data into a product of simple conditional densities where a specific variable ordering needs to be selected for better performance. Although these approaches provide high statistical robustness, NADE and MADE are purposed to infer density of discrete binary multi-dimensional data which limits their applicability to real world data. Further, RNADE was developed to overcome this limitation, inferring each simple conditional density through MDN. It provides a Gaussian Mixture approximation for each simple density, thus introducing assumptions about data. Instead, our approach focuses on continuous multi-dimensional data without taking any a priori data bias.

A unique work combining DL and non-parametric inference was done by Baird et al. Baird05ijcnn . The authors represent a target pdf via Jacobian determinant of a bijective NN that has an implicit property of non-negativity and total integral to be 1. Additionally, their pdf learning algorithm has similarity to our pdf loss described in Section 3. Although the authors did not connect their approach to virtual physical forces that are pushing a neural surface, their algorithm can be seen as a simple instance of the more general DeepPDF method that we contribute.

Additionally, in Nam15toias the authors describe a loss for pdf ratio estimation which is highly similar to ours. Although our pdf loss was developed independently and through different mathematical concepts, Nam et.al. came to the same conclusion by using theory of density ratio estimation Sugiyama12book . Thus, insights from Nam15toias are highly valuable and relevant also to our approach. The contributions of both our paper and of Nam15toias are primarily different in the focused problem. While in Nam15toias the authors estimate a ratio between two pdf densities, we use our loss to infer an entire pdf function which is a more challenging task.

Further, in Saremi18arxiv the authors present a method, Deep Energy Estimator Network (DEEN), to infer energy function of the data pdf - an unnormalized model that is equal to (negative) log probability density function, up to a constant. Based on previous research of score matching Hyvarinen05mlr

and its connection to denoising autoencoders

Hyvarinen08nc ; Raphan11nc , a new approach Saremi18arxiv is presented that uses NN as a model for energy approximation. The applied loss can be shown as a force that shapes the surface (the energy approximation) to be parabolic around each data point , eventually providing final surface that in some sense resembles unnormalized KDE estimation. It is done by optimizing the slope of a surface for each input dimension independently, enforcing the slope around each data point to be zero at , to be positive at nearby smaller points and to be negative at nearby larger points . Such scheme will have an optimum when surface is parabolic around . Eventually, since the parabolic shape is enforced around each data point, the final estimated energy function will have a structure very close to the real data density, thus approximating it. Note that since the applied loss deals only with the slope of energy function, the final approximation will be very far from its normalized version - that is, the normalization constant can be arbitrary big, probably depending on the initialization of NN’s weights. In contrast, the presented herein method, DeepPDF, provides pdf estimation with an overall integral being very close to the true value 1 since our method forces approximation to be point-wise equal to the real pdf function. Hence, although not explicitly normalized, the pdf model produced by DeepPDF is approximately normalized.

Finally, recently Generative Adversarial Networks (GANs) Goodfellow14nips ; Radford15arxiv ; Ledig16arxiv became popular to generate new data samples (e.g. photo-realistic images). GAN learns a generative model of data samples, thus implicitly learning also the data distribution. Additionally, this technique is applied for distribution transformation and for feature representation learning of high-dimensional measurements, making it extremely important for image processing problems. In contrast, our DeepPDF method learns data distribution explicitly and in theory can also be used for data generation. Nonetheless, there is a strong connection between the herein proposed pdf loss and recently developed novel GAN losses in Mohamed16arxiv ; Zhao16arxiv ; Mao17iccv ; Mroueh17nips ; Gulrajani17nips ; Arjovsky17arxiv , which we are going to investigate in our future work.

## 3 Probabilistic surface optimization

### 3.1 Derivation

In this paper we present an approach to learn an arbitrary probability density function (pdf), without making any assumption on the specific distribution family. Concretely, define a multivariate random variable

whose pdf we want to learn and samples from which are available to us. Define a neural network of arbitrary structure (see more details below) with weights that gets as input and returns a scalar. Our goal is for to approximate . To that end, we develop a probabilistic optimization method, named Probabilistic Surface Optimization (PSO), that uses stochastic pdf loss as described below.

Both and functions can be seen as surfaces in : dimensions from support and one additional probabilistic dimension for probability values. We look for an optimization technique with loss (w.r.t. ) that will force surface to be similar to surface . Moreover, we look for a loss with the property that given enough data samples from the optimization convergence point will coincide with for any . We will derive such an optimization method in several simple steps.

Our method is based on the main observation that performing optimization for loss is identical to pushing the virtual surface , parametrized by , at point with some point-wise force. Indeed, consider a gradient descent (GD) update of a single optimization step for data sample , with where is the step size and . Then the height change (differential) of the surface at another point can be approximated via a first-order Taylor expansion as (see derivation in Appendix A)

 df(X′)≜f(X′;θi)−f(X′;θi−1)=−δ⋅g(X′,X,θi−1), (1)
 g(X1,X2,θ)≜∂f(X1;θ)∂θT⋅∂f(X2;θ)∂θ, (2)

where the gradient similarity function expresses gradient correlation at two points. Thus, at the point where GD optimization was performed, the surface change will be , i.e. proportional to the L2 norm of gradient at this point. In other words, performing such a GD step can be thought of as pushing the surface down at point by . Changing sign will simply change the direction of the differential/force, pushing the surface up. Further, the above first-order Taylor expansion was observed empirically to be very accurate for learning rate

below 0.01. During the typical training process the learning rate is less than 0.01 for most of the optimization epochs, thus the Eq. (

1) provides a good estimation for the real differential.

Intuitively, when pushing a physical surface at point , surface height at another point also changes, according to elasticity and physical properties of the surface. Typically, such change diminishes with larger distance between and . We argue that the same is true for the NN: according to Eq. (1), at point also changes due to optimization of loss at , with differential at being proportional to . Therefore, is responsible for correlation in height change at different points. The gradient similarity function may be seen as a filter, deciding how the push at point affects surface at point . In our experiments we saw that typically there is an opposite trend between and the Euclidean distance . Yet, does not go entirely to zero for large , suggesting that there exist (small) side influence even between far away points.

Next, consider an example optimization algorithm where at each iteration we sample and , with being some other known distribution (e.g. Gaussian). In case the loss is

 L(θ,XU,XD)=−f(XU;θ)+f(XD;θ), (3)

a typical GD (or any other optimization method) update at each iteration will push up (along probabilistic dimension) the surface at point and will push down the surface at point . Such loss, also known as the critic loss of Wasserstein GAN Arjovsky17arxiv , is partially reasonable for pdf estimation since also the target pdf function on average returns high values for data coming from distribution and low values for data coming from other distributions. Yet, such simple loss does not converge as we will see below.

Further, since the considered loss in Eq. (3) is probabilistic with pushed points and being sampled, we can calculate what will be (first-order Taylor approximation) the average differential at a specific point when performing a single GD step (see Appendix B):

 E[df(X)]=δ⋅∫[PU(X′)−PD(X′)]⋅g(X′,X,θ)dX′. (4)

The above equation can be thought of as a convolution of the signal around point according to the convolution operator . At each such point there are two opposite forces, up force and down force , that push the surface with total force . Note these two forces express the impact for each of the two terms in Eq. (3), the up term and the down term .

Since does not change along the optimization (it does not involve ), it forces the surface to constantly move which prevents the optimization convergence. This can be simply shown in case is a Dirac delta function as follows (for a general function see proof in Appendix C).

The Dirac delta assumption may be seen as taking model to its expressiveness upper bound, where for all points disjoint subsets of are allocated to represent surface around each . This will make arbitrary flexible since each surface area can be changed independently of others. Yet, this expressiveness limit is not achievable since it will require of infinite size. However, the expressiveness assumption can be useful for convergence analysis since if the most flexible model can not converge then for sure the typical approximation model (e.g. neural network) will not converge also.

Given this assumption, from Eq. (4) we will have . It can be interpreted as the opposite forces and are pushing the surface at each point , without any side influence between different points. Since both forces are point-wise constant along the optimization process, the surface will be pushed asymptotically to an infinite height for points with ; at the rest of the points, will be pushed to a negative infinite height. Clearly, such an optimization algorithm will never converge to anything meaningful, unless properly regularized (e.g. see Arjovsky17arxiv ; Gulrajani17nips ).

Note that the derived above asymptotic values of ’s outputs, , typically cannot be really achieved due to limited expressiveness of a given NN architecture. Instead, usually during the optimization of loss in Eq. (3), the is increased to very large positive heights at areas of density modes, and is decreased to very large negative heights at areas of density modes. Such increase/decrease of height is unbounded due to unequal forces as was explained above. Eventually, the optimization fails due to numerical instability that involves too large/small numbers.

One way to enforce the convergence of simple loss in Eq. (3) is by adapting/changing

density along the optimization. Indeed, this is the main idea behind the contrastive divergence (CD) method presented in

Hinton02nc and further improved in Liu17arxiv . In CD, a point is approximately sampled from the current pdf estimation , by performing Monte Carlo with Langevin dynamics Hinton02nc or Stein Variational Gradient Descent (SVGD) Liu17arxiv ; Liu16nips . As explained above, in case at specific we have , the surface will be pushed up. It in its turn will increase at since . Eventually, such optimization will converge when . A similar idea was also applied in GAN setting in Kim16arxiv , where sample is generated by the generator NN and where generator output’s density is forced to approximate current pdf estimation via KL minimization between two.

Alternatively, consider the following pdf loss:

 Lpdf(θ,XU,XD)=−f(XU;θ)⋅PD(XU)+f(XD;θ)⋅[f(XD;θ)]mf, (5)

where is an analytically calculated pdf value at according to down distribution . We use square brackets to specify magnitude function terms that only produce magnitude coefficients but whose gradient w.r.t. is not calculated (stop gradient

Tensorflow_url ). Such terms do not change the direction of the gradient, affecting only its magnitude.

The loss in Eq. (5) is still pushing the surface up at points (via term ) and down at points (via term ). However, the expected differential has changed due to the new loss terms (see Appendix D):

 E[df(X)]=δ⋅∫[FU(X′)−FD(X′;θ)]⋅g(X′,X,θ)dX′, (6)

with forces and being changed to:

 FU(X′)=PU(X′)⋅PD(X′), (7)
 FD(X′;θ)=PD(X′)⋅f(X′;θ). (8)

The above differential will become zero only at , and this is where GD will converge, using an appropriate learning rate decay. Note, that unlike loss in Eq. (3), the down force is now a function of and can adapt along optimization to stabilize the opposite up force .

The PSO optimization is based on pushing the probabilistic surface, represented by a NN , through forces and (see also Figure (a)a). It does so by applying the pdf loss defined in Eq. (5) iteratively, where this loss can also be written in batch mode, with multiple samples from data densities (see Section 4). Additionally, pdf loss can be also derived in an alternative way as a sampled approximation of (see Appendix E):

 L(θ)=EX∼PD[[PU(X)−f(X;θ)]2], (9)

similarly to the ratio estimation methods in Sugiyama12book . Although such derivation is faster, it lacks the intuition about the physical forces and the virtual surface. Yet, this intuition is very helpful for the below analysis of learning stability.

### 3.2 PSO Analysis

Here we analyze PSO convergence by assuming, for simplicity, that is a Dirac delta function. In other words, we assume that there is no correlation and side-influence among pushed points. As explained above, this can also be seen as an expressiveness limit assumption for extremely flexible surface where a push at affects only the surface height at . We emphasize this assumption is undertaken only to simplify the converegence analysis, while in our experiments was not constrained in any way. Nevertheless, the provided results (Section 5) support the below analysis. A more general analysis is part of future research.

Considering Eqs. (5)-(8), we can analyze the work done by PSO in two different perspectives. First is the typical optimization analysis of loss in Eq. (5), while second is the analysis of dynamical physical forces on surface in some physical system using Eqs. (7) and (8). Both perspectives are two different sides of the same coin. Below we will use each of these perspectives to analyze the properties of PSO.

Considering an arbitrary point and using a physical perspective, the balance between up and down forces at the point happens when: . Convergence to such balance can be explained as follows. In case , we can see from Eqs. (7) and (8) that , so the surface at will be pushed up by the pdf loss. If, on the other hand, will be true, then it applies and the opposite will happen, i.e. the surface will be pushed down. Eventually, when balance is achieved, both up and down forces will be equal and cancel each other; the surface at this point will not be affected anymore. Thus, PSO will indeed optimize to be in a time limit.

Intuitively, considering the above analysis and Eq. (4) we would like the gradient similarity to resemble kernel function with bandwidth being close to average distance between training points. In this way at each point the surface will be modified only by pushes at nearby training points with whom has similar statistical frequency (pdf value). Yet, in typical NN the gradient similarity doesn’t have the kernel shape, although it has rough opposite correlation with distance between two points. This in its turn raises an interesting research direction for NN architecture with more kernel-like function . We leave this direction for the future research.

### 3.3 Choosing PD

For we can use any candidate distribution that covers the support of a target distribution . This can be concluded from the following reasons. Consider points . From a physical perspective, Eqs. (7) and (8) tell us that both and are zeros at such points. From an optimization perspective, we can conclude the same by noting that the second term in Eq. (5), , will never be activated at such points since cannot be sampled from down distribution . Moreover, the first term, , has a zero multiplier inside, thus nullifying it. Therefore, at such points there will be no optimization done and

typically will be just interpolation of surface heights at points around

.

Further, consider points . The up force will not be activated since cannot be sampled from up distribution . The force will push surface down at such but using optimization perspective we can see that once surface height becomes negative, , the force will change its direction and push surface up due to sign canceling in down term in Eq. (5). Thus, for such points the will oscillate around zero and given an appropriate learning rate decay it will eventually converge to zero which is also the value of at this point.

Concluding from above, the only requirement on distribution is to wrap support of . Using such , the surface at points will converge to , and at points it will converge to zero. Empirically we observed that the above ”wrap” requirement on is typically enough for proper learning convergence. However, some other constraints on selecting may also exist. For instance, if there is an area where we may have many samples from but no samples from . Then the force balance in this area will not be reached and the surface will be pushed to infinity, causing instability in the entire optimization process. Intuitively, for proper convergence the ratio should be bounded in some dynamical range. A more detailed investigation is required to clarify this point which we consider as future work.

Typically limits of ’s support can be easily derived from the training dataset and a proper candidate for can be picked up. Furthermore, in case we do not know ’s support range and still want to learn a target pdf without experiencing optimization divergence, we can use a ”support-safe” version of pdf loss:

 Lsprt−safepdf(θ,XU,XD)=−f(XU;θ)⋅PD(XU)⋅[sign(Pmax−f(XU;θ))]mf+f(XD;θ)⋅[f(XD;θ)]mf, (10)

where the new term constrains the maximal height of surface to be . Note that such a constraint will prevent to accurately approximate target at points .

Finally, the density estimation can also be done via a variant of the pdf loss without the magnitude function terms:

 Lpdf(θ,XU,XD)=−f(XU;θ)⋅PD(XU)+12[f(XD;θ)]2, (11)

which has exactly the same optimization gradient as the standard pdf loss from Eq. (5).

## 4 Deep PDF approach

As explained above, the described PSO and pdf loss will force a NN to approximate . We can also use PSO in batch regime by sampling sets of points from distributions and , and , and updating appropriately by minimizing loss . Using batch of points for a single update can be seen as many points on the surface being pushed up/down in order to reach a force balance. Such an update should be much more accurate since the number of points in batch will reduce the stochastic nature of the pdf loss, and make real differential closer to its expected value in Eq. (6). The entire batch learning algorithm, named DeepPDF, is detailed more scrupulously in Appendix F. Below, we provide details on different deep learning aspects of the presented approach.

#### Learning rate and convergence

In our experiments we saw that convergence to the target pdf happens very fast, requiring us only to apply ”good” learning rate decay policy. Empirically observed, during the first several thousand iterations the surface reaches high similarity with surface , and its heights at various points continue to oscillate near the ground truth heights of target surface. Such oscillation around the target surface can be explained by the fact that at each iteration only points are pushed up/down. These points are sampled at each iteration from / and up/down forces are applied only at them during this iteration. Thus, considering only one iteration the true forces are very stochastic and may be far from their averaged expressions in Eqs. (7) and (8), but still are not too far since becomes similar to the target. Intuitively, we can see that the proper thing to do is to wait till starts to oscillate around some surface and then slowly decrease the learning rate

. Such a heuristic policy can be achieved for example by exponential decay learning rate which at first stage is constant letting

to get closer to , sequentially reducing learning rate to its minimal value via , where is the start learning rate, is decay rate, is number of steps before each learning rate reduction and is iteration index. We used Adam optimizer Kingma14arxiv with start learning rate and decay rate . In our experiments we observed that using and provides better convergence and higher density estimation accuracy. Allowing PSO to take more time to converge empirically was seen to provide more accurate approximation. Still, it is up to the network architect to define learning rate decay policy according to time vs accuracy preference.

Convergence of pdf loss is stochastic as was explained above. Empirically observed, the pdf loss from Eq. (5) does not provide clear convergence metric since it is too noisy and just oscillates around zero for most part of the optimization. Yet, the value of pdf loss without stop-gradient (Eq. (11)) provides better insight about the current optimization convergence, since it has direct connection to the squared error in Eq. (24). It reduces monotonically during the optimization and may be used to decide when accuracy stops improving (e.g. for early stopping). Alternatively, it is possible to use learning rate with exponential decay and stop training when learning rate get very close to and does not change anymore. Further, value of Eq. (11) is also useful for choosing the most accurate model between several trained candidates.

#### Overfitting

Like any other machine learning method, DeepPDF can overfit to training dataset. This can happen for example, if the used neural network

is highly expressive and the amount of available points from is small. In such case will converge to zero-height surface with peaks at available training points. In a sense, this would not be a mistake since the distribution of such a small and scarce training dataset has indeed flat pdf with peaks around training points. However, in some cases data is very difficult to generate and we need to do the best with a little data that we have.

In order to detect overfitting, the typical approach also applicable here is to have a separate validation dataset of data generated from , and check its performance (for example, via pdf loss in Eq. (11)). Once validation performance starts to decrease, overfitting takes place.

To deal with the overfitting problem, the network architect can decrease expressiveness of the network by reducing its size or adding L2 regularization on weights . Empirically we saw that such methods can indeed help. Reducing expressiveness of the network will of course increase the approximation error between and . Yet, we cannot expect high approximation accuracy when there is only scarce data amount available.

Still, there is one additional solution to the problem of a little data. We can perform data augmentation to add some diversity noise into our samples, based on our knowledge about the real distribution . This technique is very popular in image classification field where it proved to be very useful Wong16arxiv; Perez17arxiv

, producing object classifier that is robust to various camera light conditions and which is less prone to the overfitting. In case of the density estimation such method will increase variability of sample points and reduce the probability pike around each such point. Yet, the specific data augmentation process will also introduce bias (prior knowledge) in our pdf inference - the properties of the output pdf surface (e.g. its smoothness) will be affected by it. If such prior knowledge is correct and if data is indeed scarce, the data augmentation can tremendously improve accuracy of the density estimation.

#### Positive constraint

Since the network is an approximation of a non-negative function

, it makes sense to explicitly constrain its output to be positive/non-negative. For example, we can use activation functions Relu or Softplus at the end of the network. Alternatively, we can use an exponential function whose outputs are positive. This will be especially appealing since the input of an exponential function will be the log-probability function

. Such log-probability is very useful, for example, in estimation applications.

However, using such activation functions (especially exponential) can damage the numerical stability of learning. For example, will return infinity for . During training the network signal is not stable and can go above 88 depending on learning rate, causing training process to fail. Such a problem can be dealt with by initializing weights in a different way or by reducing learning rate. Yet, such handling is time-consuming and by itself may slow down the learning process.

Alternatively, we can use with an unconstrained output. As was explained in Section 3, the pdf loss will eventually force it to be non-negative at points sampled from both and . Of course a little oscillation around zero may occur. Also, at other points, where , the may return negative values since these areas are not affected by PSO optimization (at least not directly). But this can be easily handled by a proxy pdf function:

 ¯f(X;θ)={0,if f(X;θ)<0 or PD(X)=0f(X;θ),otherwise (12)

#### Inner network architecture

Expressiveness of a NN , i.e. how well it can approximate a specific surface , depends on its inner structure. Number, type and size of inner layers, all these will eventually affect ’s elasticity and the final approximation accuracy. Yet, while the optimal inner architecture of the network depends on the target pdf and its surface properties, in Section 5 we show that even a relatively simple network can approximate various target pdfs with high accuracy, being competitive to the state-of-the-art KDE methods. Still, finding an inner architecture that works perfectly for all target surfaces is an open research question.

Additionally, we tested the usage of batch-normalization (BN) and of dropout within our FC layers. We saw empirically that BN prevents DeepPDF from learning the target pdf. It can be explained by the fact that BN changes the distribution of each batch while DeepPDF basically relies on this distribution to approximate the target pdf. Similarly, dropout was seen to damage accuracy of density estimation. Thus, BN and dropout harm performance of our method and should be avoided for proper DeepPDF convergence.

#### Batch size vs. estimation quality

As was explained above the batch size can be interpreted as amount of points on surface where we push up/down during a single iteration. The bigger is and the more points are pushed at once, the closer are the real applied up/down forces to their expected values in Eqs. (7) and (8). Thus, bigger will produce more stable, fast and accurate approximation results, which was also observed empirically. Up to memory resource constraints, we suggest to use large batches for best performance.

## 5 Experimental evaluation

We evaluate our DeepPDF approach on three target densities, whose analytical expressions can be found in Appendix H. The inferred pdfs conceptually represent complicated multi-modal and discontinuous densities, which need to be inferred in real world applications (e.g. estimating measurement model in robotics domain).

In all scenarios, network

contains 3 FC layers of size 1024 with Relu non-linearity, followed by an output FC layer of size 1 with no activation function. Also, we use uniform distribution for

. During learning the batch size is 1000 and overall size of training samples for all evaluated methods is . The effect of smaller dataset size on the inference accuracy is part of future investigation. The learning rate was decayed through the same decay policy for all scenarios, without use of early stopping. The overall number of iterations was around for each scenario. Once trained, we use pdf proxy function from Eq. (12) as our pdf function approximation. DeepPDF was trained with Adam optimizer Kingma14arxiv using GeForce GTX 1080 Ti GPU card.

For comparison, we infer the same densities with KDE method, using two KDE implementations: First is KernelDensity provided by sklearn library Sklearn_KDE_url and represents a standard KDE approach. Second is fastKDE OBrien16csda which is a state-of-the-art KDE extension that provides better accuracy and computational efficiency. Optimal parameters for Sklearn-KDE’s were found via grid search, while fastKDE performs automatic parameter search.

Additionally, we experiment with a simplified version of PSO in Appendix G to demonstrate its power to infer probability of a specific single point.

### 5.1 Learning 2D distributions

Here we use DeepPDF to estimate density of two 2D distributions shown in Figures (b)b (Cosine) and (a)a (Columns). The corresponding density approximations are shown in Figures (c)c and (b)b. As can be seen, both distributions were inferred with high accuracy.

In Table 1 we compare performance of all the three approaches: DeepPDF, Sklearn-KDE and fastKDE. The used performance metric is average L2 distance , where is an approximation function and points form a grid in ’s support. As can be seen, all methods provide comparable accuracy. For Columns pdf the accuracy of DeepPDF is 36 times higher than other techniques, probably because there are density areas with different optimal bandwidths. KDE methods are using the same bandwidth for all inner kernels and cannot approximate such target pdf optimally. However, there is no such problem in DeepPDF since it is highly flexible and capable of approximating a wide variety of functions without the need to specify bandwidth. For Cosine pdf the accuracy of fastKDE is only 1.9 times lower than DeepPDF. The reason for such relatively comparable accuracy is the smoothness of the Cosine pdf that implies single optimal bandwidth for entire pdf surface.

Also note that query stage of both KDE techniques is very slow - it takes about two days for Sklearn-KDE and about an hour for fastKDE to evaluate 66000 points. In contrast, once trained DeepPDF can evaluate such query number in matter of seconds.

In context of KDE methods, the sample points can be viewed as parameters that represent target pdf approximation function, with such representation being inefficient for the query stage. Thus, the above KDE approaches can be seen as being parametrized by parameter vector of size identical to the size of training dataset (

in our experiments) multiplied by data dimension. On other hand, the size of representation parameter vector in our DeepPDF method is only about ( for 2D distributions). Moreover, post-training investigation of on test points showed that only small subset of weights has non-zero gradient on at least one test point. That is, only several tens of thousands of parameters are used to represent surface , suggesting that most of can be pruned away. Yet, with more compact representation DeepPDF successfully infers the target pdf with even higher accuracy relatively to KDE methods. This supports one of the main arguments of this paper that neural network is more flexible to approximate probability densities than sum of kernels.

### 5.2 Learning 3D range measurement model

Further, we evaluate the above density techniques to infer a 3D distribution. We use a RangeMsr distribution that is similar to the measurement model of a range sensor which is highly adapted in robotics domain. Each sample point has a structure where and are uniformly sampled from range and symbolically represent robot location. Additionally, is sampled from and symbolize distance measurement between robot and some oracle located at origin, plus some noise. Using specific , the RangeMsr distribution is visualized in 3D space in Figure (c)c. As can be seen, such distribution has infinitely many modes and approximating it with simple methods like GMM would fail.

In Figure (d)d we can see that DeepPDF’s approximation of RangeMsr is indeed very accurate. In Table 1 we compare L2 performance of DeepPDF vs KDE methods. As can be seen, our method performs much more accurate estimation than the standard KDE implementation by sklearn library Sklearn_KDE_url , and even is superior to a state-of-the-art fastKDE approach OBrien16csda .

## 6 Conclusions and future work

In this paper we presented a novel non-parametric density estimation technique: DeepPDF. It uses a neural network representation of a target pdf which produces high expressiveness, allowing it to approximate an arbitrary multi-dimensional pdf given sample points of thereof. Albeit it requires to design the neural network structure for pdf appropximator, the same simple composition of fully-connected layers with Relu non-linearity was able to approximate all experimented densities with high accuracy. Moreover, once trained, DeepPDF can be efficiently queried for a pdf value at a specific point by a simple network forward pass, making it a much better alternative to the compared KDE methods. We enforce the non-negativity of DeepPDF pdf approximation via proxy function from Eq. (12). The total integral is not explicitly constrained to be 1, yet we verified it via numerical integration and saw it to be almost identical to the correct value 1.

Further, we believe the described probabilistic surface optimization (PSO) is a conceptually novel approach to infer data density and can be applied in other statistical frameworks besides density estimation, such as new data generation. As future research directions, we will investigate both DeepPDF theoretical and empirical convergence properties, and its dependence on data dimension and batch size. Additionally, we will explore the relation between the accuracy of DeepPDF density estimation and size of the training dataset.

## References

• (1) C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152–1174, 1974.
• (2) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
• (3) Leemon Baird, David Smalenberger, and Shawn Ingkiriwang. One-step neural network inversion with pdf learning and emulation. In 2005 IEEE International Joint Conference on Neural Networks, IJCNN’05, volume 2, pages 966–971. IEEE, 2005.
• (4) Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 400–406, 2000.
• (5) C.M. Bishop. Mixture density networks. Technical report, Aston University, Birmingham, 1994.
• (6)
• (7) Tarn Duong and Martin L Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485–506, 2005.
• (8) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In Intl. Conf. on Machine Learning (ICML), pages 881–889, 2015.
• (9) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
• (10) Dilan Görür and Carl Edward Rasmussen. Dirichlet process gaussian mixture models: Choice of the base distribution. Journal of Computer Science and Technology, 25(4):653–664, 2010.
• (11) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems (NIPS), pages 5769–5779, 2017.
• (12) Nils-Bastian Heidenreich, Anja Schindler, and Stefan Sperlich. Bandwidth selection for kernel density estimation: a review of fully automatic selectors. AStA Advances in Statistical Analysis, 97(4):403–433, 2013.
• (13) G.E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
• (14) Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
• (15) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
• (16) Aapo Hyvärinen. Optimal approximation of signal priors. Neural Computation, 20(12):3087–3110, 2008.
• (17) Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
• (18) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• (19) D. Kopitkov and V. Indelman. Robot localization through information recovered from cnn classificators. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, October 2018. Accepted.
• (20) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
• (21) Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

, pages 29–37, 2011.
• (22) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.

Photo-realistic single image super-resolution using a generative adversarial network.

arXiv preprint, 2016.
• (23) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In NIPS, pages 2378–2386, 2016.
• (24) Qiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortized mle. arXiv preprint arXiv:1707.00797, 2017.
• (25) S. N. MacEachern and P. Muller. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics, 7:223–238, 1998.
• (26) Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In

Intl. Conf. on Computer Vision (ICCV)

, pages 2813–2821. IEEE, 2017.
• (27) G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988.
• (28) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
• (29) Youssef Mroueh and Tom Sercu. Fisher gan. In Advances in Neural Information Processing Systems (NIPS), pages 2510–2520, 2017.
• (30) Hyunha Nam and Masashi Sugiyama.

Direct density ratio estimation with convolutional neural networks with application in outlier detection.

IEICE TRANSACTIONS on Information and Systems, 98(5):1073–1079, 2015.
• (31) Travis A O’Brien, Karthik Kashinath, Nicholas R Cavanaugh, William D Collins, and John P O’Brien. A fast and objective multidimensional kernel density estimation method: fastkde. Computational Statistics & Data Analysis, 101:148–160, 2016.
• (32)
• (33) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
• (34) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011.
• (35) Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
• (36) David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
• (37) Jayaram Sethuraman and Ram C Tiwari. Convergence of dirichlet measures and the interpretation of their parameter. In Statistical decision theory and related topics III, pages 305–315. Elsevier, 1982.
• (38) Bernard W Silverman. Density estimation for statistics and data analysis. Routledge, 2018.
• (39) Sklearn KernelDensity.
• (40) Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1. The MIT press, Cambridge, MA, 1986.
• (41) Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
• (42) TensorFlow.
• (43) Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems (NIPS), pages 2175–2183, 2013.
• (44) Peter M Williams. Using neural networks to model conditional multivariate densities. Neural Computation, 8(4):843–854, 1996.
• (45) Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.

## Appendix A: Surface Change after Gradient Descent Update of L(θ,X)=f(X;θ)

Performing a single gradient descent step w.r.t. loss at a specific optimized point results in a weights update , where is step size and . After such update, the surface height at point , , can be approximated via a Taylor expansion as:

 f(X;θi)=f(X;θi−1)+(θi−θi−1)T⋅∂f(x;θ)∂θ|x=X,θ=θi−1++12⋅(θi−θi−1)T⋅Hθ⋅(θi−θi−1)+⋯==f(X;θi−1)−δ⋅(∂f(x;θ)∂θ|x=X,θ=θi−1)T⋅∂f(x;θ)∂θ|x=X,θ=θi−1++12δ2⋅(∂f(x;θ)∂θ|x=X,θ=θi−1)T⋅Hθ⋅(∂f(x;θ)∂θ|x=X,θ=θi−1)+⋯, (13)

where:

 Hθ≜∂2f(x;θ)∂θ2|x=X,θ=θi−1. (14)

Above we can see the first two elements of Taylor expansion where the second element depends on a quadratic step size . The (learning) step size is typically less than one. Further, assuming it is very small number (), can be approximated good enough by a first-order Taylor expansion as:

 f(X;θi)=f(X;θi−1)−δ⋅(∂f(x;θ)∂θ|x=X,θ=θi−1)T⋅∂f(x;θ)∂θ|x=X,θ=θi−1, (15)

and define

 df(X)≜f(X;θi)−f(X;θi−1)=−δ⋅g(X,X,θi−1), (16)
 g(X1,X2,θ)≜∂f(X1;θ)∂θT⋅∂f(X2;θ)∂θ, (17)

where function calculates the cross-gradient similarity between two different points and , i.e. the dot product between gradients at these two points. Further, is the L2 norm of gradient at point .

Similarly, the change of surface height at point , which is different from the optimized point (i.e. ), can be approximated as:

 df(X′)=−δ⋅g(X′,X,θi−1). (18)

Note that the cross-gradient similarity function is symmetric, i.e. . Thus, from the above we can see that for any two arbitrary points and , pushing at one point will change height at another point according to the cross-gradient similarity . Therefore, is responsible for correlation in height change at different points. Given is small (or zero), optimizing loss (pushing surface) at point will not affect the surface at point . The opposite is also true: if is large, then pushing the surface at point will have a big impact at point .

## Appendix B: Surface Change after Gradient Descent Update of Simple Loss

In case of simple loss where and are sampled from and respectively, consider weights update . Similarly to Eq. (13), the change of surface height at some point can be approximated as a first-order Taylor expansion:

 df(X)=δ⋅g(XU,X,θi−1)−δ⋅g(XD,X,θi−1). (19)

Further, considering the stochastic nature of and , the average differential can be calculated as:

 E[df(X)]=δ⋅∫PU(X′)⋅g(X′,X,θi−1)dX′−−δ⋅∫PD(X′)⋅g(X′,X,θi−1)dX′=δ⋅∫[PU(X′)−PD(X′)]⋅g(X′,X,θi−1)dX′. (20)

Again, we see that has the role of a filter that controls correlations between points. In contrast, the term depends only on point .

## Appendix C: Proof of Divergence of Simple Loss

In case of simple loss the average differential ,

 E[df(X)]=δ⋅∫[PU(X′)−PD(X′)]⋅g(X′,X,θi−1)dX′, (21)

contains signal term and filter term . Since the signal term does not depend on and is unchanging along the optimization, there is no stable point for the optimization convergence. Thus, the learning process diverges. This can be proved by contradiction as follows.

Assume, in contradiction, that the above simple loss converges at optimization step to the loss’s local minimum. Then, for the following optimization steps, , and so , will remain (almost) constant - GD will stay in the minimum area. Hence, the integral in Eq. (21) becomes a constant for each given point , since does not change anymore. This constant is non-zero, unless both up and down densities are identical ; or the learned is flat ( for any two points). Yet, such specific cases are degenerate and we exclude them.

Concluding from the above, after optimization step the became non-zero constant for each . Therefore, on average the surface height continues to change by a non-zero constant at each point , going towards . Yet, such change of surface is impossible since is (almost) constant after . We got a contradiction, meaning that the above convergence assumption is wrong; that is, the simple loss cannot converge due to non-zero signal .

## Appendix D: Surface Change after Gradient Descent Update of PDF Loss

In case of pdf loss where and are sampled from and respectively, consider weights update . Similarly to Eq. (13), the change of surface height at some point can be approximated as first-order Taylor expansion:

 df(X)=δ⋅PD(XU)⋅g(XU,X,θi−1)−δ⋅f(XD;θ)⋅g(XD,X,θi−1). (22)

Further, considering stochastic nature of and , the average differential can be calculated as:

 E[df(X)]=δ⋅∫PU(X′)⋅PD(X′)⋅g(X′,X,θi−1)dX′−−δ⋅∫PD(X′)⋅f(X′;θ)⋅g(X′,X,θi−1)dX′==δ⋅∫PD(X′)⋅[PU(X′)−f(X′;θ)]⋅g(X′,X,θi−1)dX′. (23)

## Appendix E: PDF Loss as Empirical Approximation of Continuous Squared Loss

Consider a squared loss between NN surface and a target function , weighted by the pdf of down distribution:

 L(θ)=EX∼PD[[PU(X)−f(X;θ)]2]=∫PD(X)⋅[PU(X)−f(X;θ)]2dX. (24)

Similarly to the ratio estimation methods in Sugiyama12book , we can show that the above loss can be minimized by minimizing our pdf loss, as follows:

 L(θ)=∫PD(X)⋅[PU(X)]2+PD(X)⋅[f(X;θ)]2−2⋅PD(X)⋅PU(X)⋅f(X;θ)dX. (25)

Since the first term doesn’t depend on , we can rewrite the loss as:

 L(θ)=∫PD(X)⋅[f(X;θ)]2−2⋅PD(X)⋅PU(X)⋅f(X;θ)dX==∫PD(X)⋅[f(X;θ)]2dX−2⋅∫PU(X)⋅PD(X)⋅f(X;θ)dX==EX∼PD[f(X;θ)2]−2⋅EX∼PU[PD(X)⋅f(X;θ)]. (26)

The above can be approximated by samples from distribution and samples from distribution as:

 L(θ)=1NN∑i=1f(XiD;θ)2−2⋅1NN∑i=1PD(XiU)⋅f(XiU;θ) (27)

where for notation simplicity we can rewrite it as one-sample version:

 L(θ)=f(XD;θ)2−2⋅PD(XU)⋅f(XU;θ). (28)

Next, note that the above loss has the same gradient w.r.t. as the following loss:

 L(θ)=2⋅f(XD;θ)⋅[f(XD;θ)]mf−2⋅PD(XU)⋅f(XU;θ), (29)

where specifies magnitude function terms that only produce magnitude coefficients but whose gradient w.r.t. is not calculated (stop gradient in Tensorflow Tensorflow_url ). Thus, both losses are equal in context of the gradient-based optimization.

After changing order of terms in Eq. (29), and deviding by (e.g. moving it into learning rate), we will finally get the pdf loss: