 # Training Neural Networks for Likelihood/Density Ratio Estimation

Various problems in Engineering and Statistics require the computation of the likelihood ratio function of two probability densities. In classical approaches the two densities are assumed known or to belong to some known parametric family. In a data-driven version we replace this requirement with the availability of data sampled from the densities of interest. For most well known problems in Detection and Hypothesis testing we develop solutions by providing neural network based estimates of the likelihood ratio or its transformations. This task necessitates the definition of proper optimizations which can be used for the training of the network. The main purpose of this work is to offer a simple and unified methodology for defining such optimization problems with guarantees that the solution is indeed the desired function. Our results are extended to cover estimates for likelihood ratios of conditional densities and estimates for statistics encountered in local approaches.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The likelihood ratio of two probability densities is a function that appears in a variety of problems in Engineering and Statistics. Characteristic examples ,  constitute Hypothesis testing, Signal detection, Sequential hypothesis testing, Sequential detection of changes, etc. Many of these problems also use the likelihood ratio under a transformed form with the most frequent example being the log-likelihood ratio. In all these problems the main assumption is that the corresponding probability densities are available under some functional form. What we aim in this work is to replace this requirement with the availability of data sampled from each of the densities of interest.

As we mentioned, the computation of the likelihood ratio function relies on the knowledge of the probability densities which, for the majority of applications, is an unrealistic assumption. One can instead propose parametric families of densities and, with the help of available data, estimate the parameters and form the likelihood ratio function. However, with the advent of Data Science and Deep Learning there is a phenomenal increase in need for processing data coming from images, videos etc. For most of these cases it is very difficult to propose any meaningful parametric family of densities that could reliably describe their statistical behavior. Therefore, these techniques tend to be unsuitable for most of these datasets.

If parametric families cannot be employed one can always resort to nonparametric density estimation  and then form the likelihood ratio. These approaches are purely data-driven but require two different approximations, namely one for each density. Additionally, most nonparametric density estimators involve critical parameters, as the “bandwidth”, that trade magnitude accuracy versus support accuracy with the corresponding tuning process not being well defined.

If we are only interested in the likelihood ratio, or its transformation with some fixed function, we could ask ourselves whether it is possible to directly estimate it, without passing through the double estimation of the two individual densities. Furthermore, motivated by their wide popularity and success in approximating nonlinear functions , it is only natural to be interested in developing a methodology based on neural networks. This is exactly the goal of this article, namely, to develop likelihood ratio estimation methods based on neural networks.

Of course, there already exists a substantial literature addressing the problem of density ratio estimation [5, 6, 7] (and references therein). We must also mention the usage of these methods in several applications as covariate shift corrections , density ratio estimation with dimensionality reduction , change detection , approximate likelihood estimation 

, etc. The focus and main tool in all these publications is density ratio estimation and no effort is made to estimate any transformation of this ratio or any other meaningful statistic that occurs in Detection and Hypothesis testing. With our present work we intend to address these missing parts by developing a unified methodology which can be used to directly estimate these functions.

Our paper is organized as follows: Section I contains the Introduction, In Section II we describe how a basic optimization problem must be defined in order to accept as solution a desired function. Section III contains a number of data-driven versions of the optimizations of Section II that are related to important problems in Detection and Hypotheses testing theory. In Section IV we apply our methodology to specific problems from Hypothesis testing, Classification and Sequential change detection and compare the different possibilities. Finally, in Section V we present our concluding remarks.

## Ii A General Optimization Problem

Let us begin our analysis by introducing a simple optimization problem which will serve as our basis for the final, data-driven version. Suppose that

is a random vector with probability density

and consider the following cost function

 (1)

where denotes expectation with respect to ; are scalar functions of the scalar and are scalar functions of the vector with taking values in a known interval . We are interested in the following optimization problem

 (2)

Our goal is to design the two functions so that the solution of (2) is of the form which is described next.

###### Problem 1.

If is a known scalar function of the scalar , we would like to design the two functions so that the global minimizer in (2) is equal to .

Problem 1 is not difficult and with the next lemma we introduce a necessary condition which constitutes the first step towards its solution.

###### Lemma 1.

In order for the optimization problem in (2) to accept as solution the function , it is necessary that satisfy

 (3)

for all real .

###### Proof:

Because is a function of the desired minimization can be performed as follows

 (4)

In other words, we can apply the minimization directly onto the function under the expectation and perform it for each fixed . By fixing , from (4) we see that the optimum will depend only on the value of the function suggesting that the optimum is in fact of the form . Because of this observation we can drop the dependence of on and solve, instead, the following problem for each fixed

 minu{ϕ(u)+rψ(u)}. (5)

We know that all critical points of satisfy the equation111Although not explicitly stated as assumption, wherever needed, we assume that derivatives and gradients of functions exist and are of sufficient smoothness.

 ϕ′(u(r))+rψ′(u(r))=0. (6)

Since we would like to be a critical point, it is clear that (6) must be true with replaced by , namely Eq. (3). This completes the proof. ∎

###### Remark 1.

In Lemma 1 it is particularly appealing the fact that (3) is independent from the probability density and the function and it only depends on .

###### Corollary 1.

If the transformation is strictly monotone then (3) is equivalent to

 ϕ′(z)+ω−1(z)ψ′(z)=0, (7)

for all , where denotes the inverse function of .

###### Proof:

Eq. (7) follows immediately after defining , which implies . ∎

Assumption: From now on we concentrate on strictly monotone transformations and, therefore, we rely on (7). Furthermore, without loss of generality, we assume that is strictly increasing. Guaranteeing that the only solution in (6) is the desired is possible by imposing additional constraints on the pair . The next lemma provides the required conditions.

###### Lemma 2.

Select a function that satisfies for all and define with

 ψ′(z)=ρ(z), and ϕ′(z)=−ω−1(z)ρ(z), (8)

then has a unique minimum for at . Furthermore, if , i.e.  and is selected to satisfy

 ρ′(z)>0  and  ω−1(z)ρ′(z)+(ω−1(z))′ρ(z)<0, (9)

for all then the function is strictly convex in .

###### Proof:

Using (8) we can write

 ϕ′(z)+rψ′(z)=(r−ω−1(z))ρ(z).

The roots of the equation are the critical points of . Clearly is the root of interest. In order to guarantee that no other root exists we need for all . For this to be true, if we assume that is continuous, it is necessary and sufficient that does not change sign. Because we assumed that is strictly increasing, the same monotonicity property is inherited by the inverse function therefore, in order for to be the only minimum of , its derivative which is equal to , must be negative for and positive for . This is assured when . No other extrema are possible unless we allow as approaches the end points of and, in this case, these extrema will be (local) maxima.

If then we can also force the function to be convex in . For this property to hold we need both functions to be convex, namely, . These two inequalities, after taking into account (8), are equivalent to (9). This completes the proof. ∎ Fig. 1: Convex and non-convex forms of the function ϕ(z)+rψ(z) that exhibit a single minimum.

Characteristic examples of convex (blue) and non-convex (red) functions enjoying a unique minimum are captured in Fig. 1. The non-convex function can exhibit local maxima at the end points of its domain which in this example are 0 and . Regarding the functions , they are defined in (8). This means that their explicit computation requires integration of and respectively. We have the following interesting point related to this issue.

###### Remark 2.

For applying our method, as we explain in Section III-A, we do not necessarily need closed form expressions for . This is because for the solution of the corresponding minimization problem we usually employ some form of gradient descent algorithm which requires only the derivatives and not the functions themselves. As we can see from (8) these derivatives are defined in terms of which are known.

For different selections of the function , let us present examples of pairs , derived by applying the previous analysis.

### Ii-a Examples for ω(r)=r

We are interested in identifying where no transformation is involved. This corresponds to selecting and . From our previous discussion, as long as , the optimum solution of (5) is and the minimizer is unique.

Case : The function of interest can take any real value. We have and , therefore we need to consider . According to (8), Lemma 2, we must define

 ϕ′(z)=−zρ(z),  ψ′(z)=ρ(z). (10)

In all the examples that follow, one can verify that the corresponding function is strictly negative (except at individual points).

• If we select , with this yields and . For , , , . For , , , .

• If we select , this yields and . This example corresponds to functions that are not both available in closed form. However they can still be used to define an optimization problem whose solution is numerically computable.

Case : The function of interest is positive. This means and , consequently we need to consider . It is straightforward to see that all the previous examples are still valid with the simplification and . We have an additional example that applies to this case.

• If we select then, and .

Since it is also possible to impose on the constraints in (9) so that the resulting function is convex in . These conditions take the form

 ρ′(z)>0  and  ρ(z)+zρ′(z)<0. (11)

We can verify that in Example 1) we enjoy convexity when and that, we also enjoy convexity with Examples 2) and 3).

As we mentioned in the Introduction, in the existing literature the focus is primarily on the estimation of the likelihood ratio. In [5, 6] a general methodology is developed for defining optimization problems based on the Bregman divergence . In fact, we can show that there is a one-to-one correspondence between our function and the function adopted in  for the definition of the Bregman divergence, making the two approaches equivalent. However this equivalence applies only to with which covers the case of the likelihood ratio. Our method, as we emphasize repeatedly, is far more general. Not only we can consider any nonlinear transformation but, even in the case , we can have which, as we will see, allows us to estimate statistics encountered in local approaches that are not likelihood ratios. For the existing literature, we would like to add that, the most popular optimization problem for likelihood ratio estimation is the mean square error

 J(u)=E0[(u(X)−r(X))2]=2E0[u2(X)2−r(X)u(X)]+C. (12)

Since the term does not contain the unknown function , it can be omitted from the cost. The cost in (12) corresponds to and it is a special case of Example 1) with .

### Ii-B Examples for ω(r)=logr

In this case we focus on meaning that , therefore . Also we must have namely suggesting that which means that . As before must be strictly negative in order to have a single minimum and, according to (8) in Lemma 2, we must define

 ψ′(z)=ρ(z),  ϕ′(z)=−ezρ(z). (13)

In the examples that follow, we can verify that the corresponding function is indeed negative.

• If with , this produces

 ϕ(z)=11−α{e(1−α)z−1},  ψ(z)=1α{e−αz−1}.

If then , , . If then , and .

• If then, and .

• If this yields and . The two functions can be written in terms of the Exponential integral or with the help of a power series expansion, but they do not enjoy closed form expressions. On the other hand, their derivatives are simple and can be clearly used in an iterative algorithm to numerically compute the solution of the corresponding optimization.

Since we can also consider the case of being convex. According to (9), is required for to satisfy

 ρ′(z)>0  and  ρ(z)+ρ′(z)<0. (14)

In Example 1), (14) is valid for . In Examples 2) and 3) the particular selection of also satisfies (14), thus giving rise to convex functions. There is no equivalence in the literature for this estimation problem and the corresponding estimation method appears here for the first time.

### Ii-C Examples for ω(r)=rr+1

For we are interested in the transformation . If denotes likelihood ratio then is the posterior probability (for equal priors) which we can estimate directly. Again we have and therefore we must consider while . For the functions , from (8) we have

 ψ′(z)=ρ(z),   ϕ′(z)=−z1−zρ(z). (15)
• If we select , this yields and . This particular combination leads to the cross-entropy loss method  which, with the hinge loss method (presented in the next section) constitute the most popular techniques for solving the classification problem. However, we must point out that from our analysis it becomes clear that the cross-entropy can also be used in other problems since it provides estimates of a strictly monotone transformation of the likelihood ratio.

• If we select , for , this yields and . For we obtain and and for the two functions become and .

### Ii-D Examples for ω(r)=sign(logr)

Here we are interested in estimating the sign of . This function appears in binary classification. Unfortunately is not continuous and not strictly increasing. The means to bypass this problem is to approximate the sign with a function that enjoys these necessary properties. We propose two different approximations that result in drastically different choices for .

#### Monotone Loss

As a first approximation we propose where a parameter. We note that . Using this approximation we can write for

 sign(logr)≈tanh(c2logr)=rc−1rc+1=ω(r). (16)

As we mentioned, we have exact equality for . Let us perform our analysis by assuming that is bounded. We note that . Since this means that and suggesting that we need to consider . According to our general theory, in order to enjoy a unique solution in the minimization problem, we must select for and, following (8), we must set

 ψ′(z)=ρ(z),   ϕ′(z)=−(1+z1−z)1cρ(z). (17)
• If we select , this yields and . If now we let we obtain and . Actually, we can discard the two constants since they do not affect the minimization problem and we can use instead which results in linear loss.

• We can generalize the previous example. If we consider any and define according to (17), then let , this will yield with being strictly increasing in since . This class of optimization problems was first proposed in  by our group as a means to solve the binary classification problem. However, here it is derived as a special case for estimating the function .

#### Hinge Loss

As a second approximation we can use , which is clearly increasing and continuous and converges to as . This suggests that

 sign(logr)≈sign(logr)|logr|c=ω(r), (18)

and . We have but now instead of the interval we had in the previous case. This means that the functions must be defined on the whole real line . We will present only one example that leads to a very well known optimization problem.

• Following (8), if we select then . If we now let , we obtain the limiting form for the derivatives which take the form and . By integrating we arrive at and . The optimization based on this particular pair is called the hinge loss method . The hinge loss and the cross-entropy loss method, as we mentioned, are very popular in the classification problem where they are known to have excellent performance , .

## Iii Data-driven Estimation based on Neural Networks

In the previous section we introduced Problem 1 and described how to design the corresponding optimization problem so that it enjoys a predefined solution. In this section we apply this idea under a pure data-driven scenario. We target estimates for the likelihood and log-likelihood ratio function, but also for other known functions encountered in Detection, Hypothesis testing and Classification.

To accommodate a data-driven setup we intend to apply two simultaneous approximations. The first consists in replacing statistical expectations with averages over available data and the second in replacing the general function with the output of a neural network222Although in our presentation we mention only neural networks, exactly the same analysis applies to any other function approximation method as for example kernel based. We can even accommodate the classical parametric density method. For example in the case of Gaussian modeling we can directly estimate the parameters of the likelihood ratio instead of the means and covariance matrices of the two densities. with

summarizing the network parameters. The first approximation is supported by the Law of Large Numbers whose validity, as we recall,

does not require independence of the data samples. For the second approximation, if the neural network is sufficiently rich then , it can closely approximate any nonlinear function. Under these two approximations, the optimization problem in (2) is transformed into a classical optimization over the parameters .

Even if we adopt the double approximation we just mentioned, due to the presence of which is unknown, it is still unclear how the cost function can be evaluated when we only have data. In what follows we introduce characteristic problems and explain how this difficulty can be resolved.

### Iii-a Estimation of the Likelihood Ratio and its Transformations

In the cost defined in (1), let denote the likelihood ratio of two probability densities and . Under this assumption we can rewrite the cost function as

 J(u)=E0[ϕ(u(X))+r(X)ψ(u(X))]=E0[ϕ(u(X))]+E1[ψ(u(X))], (19)

where denote expectation with respect to the densities respectively. Based on (19) we can propose the following problem of interest.

###### Problem 2.

We are given two sets of data samples and with the first set sampled from and the second from . The two densities are unknown and we are interested in finding a neural network based estimate of the transformation of the likelihood ratio function, where is a known scalar function of the scalar .

Following our methodology of Section II we first select the functions which are appropriate for the desired transformation , then we apply the two approximations mentioned in the beginning of this section, this yields

 J(u)≈^J(θ)=1n0n0∑i=1ϕ(u(X0i,θ))+1n1n1∑i=1ψ(u(X1i,θ)). (20)

Clearly, each data average approximates the corresponding expectation. Also with we limit to a class which can be represented by the output of a neural network. The optimization problem over the function defined in (2) is now transformed into a classical optimization over the parameters , that is,

 minu(X)J(u)≈minθ^J(θ)=minθ{1n0n0∑i=1ϕ(u(X0i,θ))+1n1n1∑i=1ψ(u(X1i,θ)).}. (21)

When solved, (21) will produce an optimum which corresponds to a neural network that approximates the desired function, namely, (provided the optimization algorithm does not converge to a local minimum, please also see the next remark). We realize that the new cost function uses only the two datasets and the pair of functions which, as we pointed out, depend only on and require no knowledge of . Regarding (21) and its solution we have the following remark.

###### Remark 3.

The local extrema in the general optimization problem become potential local extrema for the data-driven version, it is therefore imperative to remove them. As we have seen, this is achieved by assuring that the function has a single minimum for every . We must, however, emphasize that this very desirable characteristic established for the general optimization problem is not necessarily carried over to the data-driven version in (21). This is due to the structure of the neural network which gives rise to nonlinear costs that can still exhibit local minima.

The optimization problem in (21) can be used to train the corresponding neural network. A possibility is to apply the gradient descent

 θt=θt−1−μ{1n0n0∑i=1ϕ′(u(X0i,θt−1))∇θu(X0i,θt−1)+1n1n1∑i=1ψ′(u(X1i,θt−1))∇θu(X1i,θt−1)}, (22)

where denotes gradient with respect to and, as we can see, in each iteration we use all available data. Alternatively, when we can employ the stochastic gradient

 θt=θt−1−μ{ϕ′(u(X0t,θt−1))∇θu(X0t,θt−1)+ψ′(u(X1t,θt−1))∇θu(X1t,θt−1)} (23)

where in each iteration we use a pair of samples . In (23), when the data are exhausted we recycle them until the algorithm converges. In both updates denotes the step-size (learning rate).

From (22), (23) we observe that we only need the derivatives to perform the required training. Consequently, as pointed out in Remark 2, the functions are not needed explicitly for the algorithmic implementation of our method. Regarding the function we must assure that the output of the neural network captures the interval

precisely. For example, when we estimate the likelihood ratio function then the output must be positive, suggesting that we need to apply a proper nonlinearity (the ReLU or ELU) in the last layer of the network. When, on the other hand, we estimate the log-likelihood ratio, which can be any real number, no such nonlinearity is necessary.

#### Estimation of the Kullback-Leibler and the Mutual Information Number

A side product of our method for estimating log-likelihood ratio functions is the estimation of the Kullback-Leibler (K-L) information number and the mutual information number. The K-L information number between two densities is defined as . If is the resulting neural network obtained from solving (21) for the estimation of , then

 I(f1,f0)≈^I(f1,f0)=1n1n1∑i=1u(X1i,θo). (24)

constitutes a straightforward estimate of the K-L information number.

For the mutual information, suppose we have two random vectors that follow the joint density , then the mutual information is defined as

 I(X;Y)=E[logf(X,Y)fX(X)fY(Y)], (25)

where denote the marginal densities of and respectively. To estimate we first estimate the log-likelihood ratio between the density which treats and as statistically independent and . This leads to the definition of the next problem.

###### Problem 3.

We are given a single set of data pairs that follow the unknown joint density . We would like to use these data to obtain a neural network estimate of the log-likelihood ratio function .

To apply our method we must have two data sets. Clearly, the first is the existing which is sampled from but was also need a second set sampled from . Since the -components of the available pairs follow and the -components , we can form a second set by simply combining each with all the samples. This suggests the following form for (20)

 ^J(θ)=1n2n∑j=1n∑i=1ϕ(u(Xi,Yj,θ))+1nn∑i=1ψ(u(Xi,Yi,θ)), (26)

where the neural network has as input both vectors . Once we optimize in (26) and is the resulting neural network, we estimate the mutual information as

 I(X;Y)≈^I(X;Y)=1nn∑i=1u(Xi,Yi,θo).

### Iii-B Estimation of the Log-Likelihood Ratio of Conditional Densities

For a time series it is often required to compute the log-likelihood ratio of all the data samples that are available up to time . In on-line processing this computation must be repeated every time a new sample is acquired. We assume that the data can follow two hypotheses described by two probability densities . Since , and similarly for , this results in the update

 Lt=Lt−1+logf1(xt|xt−1,…,x1)f0(xt|xt−1,…,x1), (27)

where is the log-likelihood ratio of the samples up to time and , are the conditional densities of given the past samples .

The Markovian is clearly one of the most important models for time series. If we assume that under both densities the process is homogeneous Markov of order then for we can write

 f0(xt|xt−1,…,x1)=f0(xt|xt−1,…,xt−k),

and a similar expression applies for , suggesting that the update in (27) for , becomes

 Lt=Lt−1+logf1(xt|xt−1,…,xt−k)f0(xt|xt−1,…,xt−k). (28)

Using the definition of the conditional probability density we can also write

 logf1(xt|xt−1,…,xt−k)f0(xt|xt−1,…,xt−k)=logf1(xt,…,xt−k)f0(xt,…,xt−k)−logf1(xt−1,…,xt−k)f0(xt−1,…,xt−k). (29)

The previous equality implies that we can apply the following approximation to the log-likelihood function of the conditional densities

 logf1(xt|xt−1,…,xt−k)f0(xt|xt−1,…,xt−k)≈uk+1(xt,…,xt−k,θk+1)−uk(xt−1,…,xt−k,θk) (30)

where

 (31)

and denotes the approximation of the th order log-likelihood ratio with summarizing the coefficients of the corresponding neural network.

If we are given a realization of the process under and a realization under , then the two functions in (31) can be obtained with the help of two separate optimization problems as described in Section III-A. The data vectors are defined as , for the estimate of the st order log-likelihood ratio and , for the estimate of the th order.

Once , have been designed, following (28), we can use them to update an estimate of

 ^Lt=^Lt−1+uk+1(xt,…,xt−k,θk+1)−uk(xt−1,…,xt−k,θk).

This idea can be very useful in hypothesis testing for the implementation of the sequential probability ratio test (SPRT) . We recall that SPRT is used to sequentially decide between two hypotheses. With the previous update we can implement it without the need to know the corresponding densities.

Estimates of the log-likelihood ratio of conditional probability densities also allow for the recursive computation of the cumulative sum (CUSUM) statistic  which is used to sequentially detect a change in the statistical behavior of a process. If denotes the CUSUM statistic and its estimate then for the two quantities we can write the following update for

 (32)

where . If we are willing to sacrifice the first time instances then we can initialize with . We also recall that corresponds to the pre- and to the post-change probability density.

### Iii-C Estimation of Local Statistics

In Detection theory there is interest in problems where the nominal probability density is given but the alternative is known only as a parametric density. In this case a possible solution for testing/detection can be the application of the local approach (an alternative would be the GLRT, which we discuss in Section IV-C) and the usage of the locally most powerful (LMP) test [1, Pages 51–52],[2, Pages 82–83].

The local formulation, as we pointed out, relies on a parametric density where the nominal density corresponds (without loss of generality) to , i.e.

 f0(X)=f1(X,0). (33)

If we are interested in distinguishing between and with the alternative being unknown, one can examine the case , which corresponds to a “difficult” scenario due to the closeness of the two densities. The “small” size of allows for the analysis of the limiting form of the likelihood ratio, after Taylor expanding it

 f1(X,ϑ)f0(X)≈1+ϑ⊺∇ϑf1(X,ϑ)|ϑ=0f0(X). (34)

If we attempt to distinguish between and having a value where is a known direction and is unknown but “small”, then for testing we can use the local statistic

 r(X)=δ⊺∇ϑf1(X,ϑ)|ϑ=0f0(X), (35)

which must be compared against a constant threshold to make a decision. This corresponds to the LMP test [2, 1].

We would like to consider this problem under a data-driven setup. In order to free ourselves from the need of a specific parametric family, we focus on the case where (35) results in the following special form