 # Constructive Setting of the Density Ratio Estimation Problem and its Rigorous Solution

We introduce a general constructive setting of the density ratio estimation problem as a solution of a (multidimensional) integral equation. In this equation, not only its right hand side is known approximately, but also the integral operator is defined approximately. We show that this ill-posed problem has a rigorous solution and obtain the solution in a closed form. The key element of this solution is the novel V-matrix, which captures the geometry of the observed samples. We compare our method with three well-known previously proposed ones. Our experimental results demonstrate the good potential of the new approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The estimation of the ratio of two probability densities from a given collection of data is a fundamental technique for solving many applied problems, including data adaptation

[1, 2], conditional probability and regression estimation , estimation of mutual information , change-point detection , and many others.

Several approaches have been proposed and studied for the solution of this problem (see , , ,  and references therein). Among them, we find the Kernel Mean Matching (KMM) procedure , the unconstrained Least-Squares Importance Filtering (uLSIF) algorithm , and the Kullback-Leibler Importance Estimation Procedure (KLIEP) . For some of the proposed approaches, the convergence of the obtained estimates to the actual solution was proven under some assumptions about the (unknown) solution.

In this paper we introduce direct constructive methods of density ratio estimation. As opposed to existing approaches, ours is based on the definition of the density ratio function itself. We show that density ratio estimation requires the solution of an ill-posed integral equation for which not only the right hand side of the equation is approximately defined, but so is the operator of the equation. The solution of such equation essentially depends on our new concept of “-matrix”, which is directly computed from data. This type of matrix was not used in previously proposed approaches.

The paper is organized as follows. In Section 2 and 3 we outline the necessary basic concepts of theoretical statistics and show their relation to direct constructive density ratio estimation. In Section 4 through 7, we derive the concept of -matrix and direct methods of density ratio estimation based on this concept. Section 8 is devoted to experimental results.

## 2 Statistical Foundations

In order to simplify the notations in this paper, we denote a multidimensional function by . Likewise, we use the following notation for multidimensional integrals:

 ∫usf(t)dg(t)≡∫u1s1\scriptsize{...}∫udsdf(t1,…,td)dg(t1,…,td).

### 2.1 Empirical Distribution Function

The basic concept of theoretical statistics is the cumulative distribution function (CDF)

of a random vector

:

 F(x1,…,xd)=P{(X1≤x1)∩…∩(Xd≤xd)}.

It is known that any CDF is well-approximated by the empirical cumulative distribution function (ECDF) constructed on the basis of an i.i.d. sample distributed according to :

 x1,…,xℓ i.i.d∼ F(x).

The empirical cumulative distribution function has the form

 Fℓ(x1,…,xd)=1ℓℓ∑i=1d∏k=1θ(xk−xki), (1)

where is the step-function111Step-function is defined as

. For simplicity, we refer to the multidimensional from now on as

 Fℓ(x)=1ℓℓ∑i=1θ(x−xi). (2)

For , the Dvoretzky-Kiefer-Wolfowitz inequality states that converges fast to , namely, for any natural and any ,

 P{supx|Fℓ(x)−F(x)|>ϵ}≤2exp(−2ϵ2ℓ).

For , fast convergence of to also takes place:

 P{supx|Fℓ(x)−F(x)|>ϵ}≤2exp(−c∗ϵ2ℓ), (3)

where, according to VC theory , can be set to

 c∗=1−(d−1)lnℓϵ2ℓ.

### 2.2 Constructive Setting of the Density Estimation Problem

(if it exists) is defined as the derivative of the CDF:

 p(x1,…,xd)=∂dF(x1,…,xd)∂x1…∂xd. (4)

According to the definition of density function (4), the problem of density estimation from a given collection of data is the problem of solving the (multidimensional) integral equation

 ∫x−∞p(t)dt=F(x) (5)

when the cumulative distribution is unknown but an i.i.d sample

 x1,…,xℓ ∼ F(x)

is given.

The constructive setting of this problem is to solve equation (5) using approximation (2) instead of . As follows from (3), converges to uniformly with probability .

### 2.3 Ill-Posed Problems

It is known that solving a linear operator equation

 Af=F (6)

such as (5) is an ill-posed problem: small deviations in the right hand side may lead to large deviations of the solution .

In what follows, we assume that the operator maps functions from a normed space to functions in a normed space .

In the 1960’s, Tikhonov proposed the regularization method for solving ill-posed problems; it uses approximations such that

 ∥F−Fℓ∥E2→0  as  ℓ→∞.

According to this method, in order to solve an ill-posed problem, one has to minimize the functional

where is a regularization constant and is a regularizing functional defined in the space ; the functional must possess the following properties:

• is non-negative;

• there exists a for which the solution of (6) is in ;

• for every , the set of functions is compact.

It was shown for that, if the rate of convergence of (as ) is not greater than the rate of convergence of , then

 ∥f0−fℓ∥E1→0.

For the stochastic case, in which converges in probability to , it was shown that, if as , then for arbitrary there is a such that the following inequality holds for  [12, 3]:

 P{∥f0−fℓ∥E1>δ}≤P{∥F−Fℓ∥E2>√γℓν}. (7)

In particular, for density estimation inequalities (3) and (7) imply that

 P{∥p0−fℓ∥E1>δ}≤cexp(−c∗γℓνℓ).

Therefore, if

 γℓℓ→∞−−−→0  and  ℓγℓℓ→∞−−−→∞,

the sequence converges in probability to the solution of equation (5).

## 3 Constructive Setting of the Density Ratio Estimation Problem

In what follows, we consider the problem of estimating the ratio of two probability densities and (assuming ):

 r(x)=dF1(x)/dxdF2(x)/dx=p1(x)p2(x). (8)

According to definition (8), the problem of estimating the density ratio from data is the problem of solving the (multidimensional) integral equation

 ∫x−∞r(t)dF2(t)=F1(x) (9)

when the distribution functions and are unknown but samples and are given.

The constructive setting of this problem is to solve equation (9) using the empirical distribution functions

 F1,ℓ(x)=1ℓℓ∑i=1θ(x−xi)   and   F2,n(x)=1nn∑i=1θ(x−x′i)

instead of the actual cumulative distributions and .

### 3.1 Stochastic Ill-Posed Problems

Note that density ratio estimation leads to a more complicated ill-posed equation than density estimation, since the operator , which depends on , is also defined approximately:

 (Anr)(x)=∫x−∞r(t)dF2,n(t)=1nn∑i=1r(x′i)θ(x−x′i).

In this situation, in order to solve equation (9), we will also minimize the regularization functional

 rℓ,n=argminr∈F[ ∥∥Anr−F1,ℓ∥∥2E2+γℓ,nΩ(r) ]. (10)

Let the sequence of operators converge in probability to in the following operator norm:

 ∥A−An∥=supr∈F∥Ar−Anr∥E2√Ω(r).

Then, for arbitrary , there exists such that for any the following inequality holds for  [3, 13]:

 P{∥∥r0−rℓ,n∥∥E1>δ}≤P{∥A−An∥>C2√γℓ,n}+P{∥∥F1−F1,ℓ∥∥E2>C1√γℓ,n}. (11)

It was shown  that, if the solution of the equation (9) belongs to a set of smooth functions (in fact, to a set of continuous functions with bounded variation), then

 ∥A−An∥≤∥F2−F2,n∥E2. (12)

For sufficiently large and , inequalities (3), (11), and (12) imply that

 P{∥∥r0−rℓ,n∥∥E1>δ}≤c2exp(−c∗2nγℓ,nC22)+c1exp(−c∗1ℓγℓ,nC21).

Therefore, in the set of smooth functions, the sequence converges in probability to the solution of equation (9) provided that

 γℓ,nm→∞−−−−→0  and  mγℓ,nm→∞−−−−→∞, (13)

where .

## 4 V-Matrices

Let us rewrite the first term of functional (10):

 ρ2=∥∥ ∥∥1nn∑i=1r(x′i)θ(x−x′i)−1ℓℓ∑i=1θ(x−xi)∥∥ ∥∥2E2.

Using the norm in space , we obtain:

 ρ2=\bigintsssu0[1nn∑i=1r(x′i)θ(x−x′i)−1ℓℓ∑i=1θ(x−xi)]2dx. (14)

Expression (14) can be written as

 ρ2=∫u0a2dx−2∫u0abdx+∫u0b2dx, (15)

where

 a=1nn∑i=1r(x′i)θ(x−x′i), x′1,…,x′n i.i.d∼ F2(x), b=1ℓℓ∑i=1θ(x−xi), x1,…,xℓ i.i.d∼ F1(x).

The last term in (15) does not depend on and thus can be ignored for minimization on . In what follows, we use the notation

 (t1∨t2)=max(t1,t2).

The first two terms of (15) are

 ∫u0a2dx==1n2n∑i=1n∑j=1r(x′i)r(x′j)∫u0θ(x−x′i)θ(x−x′j)dx=1n2n∑i=1n∑j=1r(x′i)r(x′j)∫u1(x1′i∨x1′j)\scriptsize...∫ud(xd′i∨xd′j)dx1\scriptsize...dxd=1n2n∑i=1n∑j=1r(x′i)r(x′j)d∏k=1[uk−(xk′i∨xk′j)]
 −2∫u0abdx==−2nℓn∑i=1ℓ∑j=1r(x′i)∫u0θ(x−x′i)θ(x−xj)dx=−2nℓn∑i=1ℓ∑j=1r(x′i)∫u1(x1′i∨x1j)…∫ud(xd′i∨xdj)dx1…dxd=1n2n∑i=1n∑j=1r(x′i)r(x′j)d∏k=1[uk−(xk′i∨xkj)]

Let us denote by the values

 v′′ij=d∏k=1[uk−(xk′i∨xk′j)], x′i,x′j ∼ F2(x)

and by the -dimensional matrix of elements . Also, let us denote by the values

 v′ij=d∏k=1[uk−(xk′i∨xkj)], x′i ∼ F2(x),xj ∼ F1(x)

and by the -dimensional matrix of elements .

Therefore, the first term of functional (10) has the following form in vector-matrix notation:

 12→r⊤V′′→r−ℓn→r⊤V′→1+const, (16)

where by we denote the -dimensional vector and by the -dimensional vector .

Matrices and reflect the geometry of the observed data.

## 5 Regularizing Functional

We define the regularizing functional to be the square of the norm in Hilbert space:

 Ω(r)=∥r∥2H=⟨r,r⟩H. (17)

We will consider two different concepts of Hilbert space and its norm, which correspond to two different types of prior knowledge about the solution 222Recall that the solution must have a norm in the corresponding space: .

Let us first define (17) using the norm in Hilbert space:

 Ω(r)=∫r(t)2dF2,n(t)=1nn∑i=1r(x′i)2=1nr⊤r. (18)

We also define (17) as a norm in a Reproducing Kernel Hilbert Space (RKHS).

### 5.1 Reproducing Kernel Hilbert Space and Its Norm

An RKHS is defined by a positive definite kernel function and an inner product for which the following reproducing property holds true:

 ⟨f(x),k(x,y)⟩H=f(y), ∀f∈H. (19)

Note that any positive definite function

has an expansion in terms of its eigenvalues

:

 k(x,y)=∞∑i=1λiϕi(x)ϕi(y). (20)

Let us consider the set of functions

 f(x,c)=∞∑i=1ciϕi(x) (21)

and the inner product

 ⟨f(x,c∗),f(x,c∗∗)⟩H=∞∑i=1c∗ic∗∗iλi. (22)

Then, kernel (20), set of functions (21), and inner product (22) define an RKHS. Indeed,

 ⟨f(x),k(x,y)⟩H=⟨∞∑i=1ciϕi(x),∞∑i=1λiϕi(x)ϕi(y)⟩H=∞∑i=1ciλiϕi(y)λi=f(y).

According to the representer theorem , the solution of (10) has the form

 r(x)=n∑i=1αik(x′i,x), x′1,…,x′n∼F2(x). (23)

Therefore, using (19) and (23), the norm in RKHS has the form

 ⟨r(x),r(x)⟩H=n∑i,j=1αiαjk(x′i,x′j)=→α⊤K→α, (24)

where denotes the -dimensional matrix with elements and denotes the -dimensional vector . We choose this norm as a regularization functional:

 Ω(r)=→α⊤K→α.

## 6 Solving the Minimization Problem

In this section we obtain solutions of minimization problem (10) for a fixed value of the regularization constant .

### 6.1 Solution at Given Points (DRE-V)

We rewrite minimization problem (10) in an explicit form using (16) and (18). This leads to the following optimization problem:

 L=min→r[12→r⊤V′′→r−nℓ→r⊤V′→1+γn→r⊤→r], (25)

The minimum of this functional has the form

 →r=nℓ(V′′+γnI)−1V′→1, (26)

which can be computed by solving the corresponding system of linear equations.

In order to obtain a more accurate solution, we can take into account our prior knowledge that

 r(x′i)≥0,  i=1,…,n.

Any standard quadratic programming package can be used to find the solution (vector ).

Optimization problem (25) has a structure similar to that of the Kernel Mean Matching (KMM) method . The major difference is that our method uses -matrices, which describe the geometry of the observed data.

### 6.2 Solution in RKHS (DRE-VK)

In order to ensure smoothness of the solution in (10), we look for a function in an RKHS defined by a kernel. According to the representer theorem , the function has the form

 r(x)=n∑i=1αik(x′i,x). (27)

We rewrite the minimization problem (10) in explicit form using (16) and (24). Since , we obtain the following optimization problem:

 L=min→α∈Rn[12→α⊤KV′′K→α−nℓ→α⊤KV′→1+γ→α⊤K→α]. (28)

The minimum of this functional has the form

 →α=nℓ(V′′K+γI)−1V′→1, (29)

which can also be computed by solving the corresponding system of linear equations.

Optimization problem (28) has a structure similar to that of the unconstrained Least-Squares Importance Filtering (uLSIF) method , where, instead of and , one uses identity matrices and , and, instead of regularization functional , one uses .

In order to use our prior knowledge about the non-negativity of the solution, one can restrict the set of functions in (27) with conditions

 αi≥0  and  k(x′i,x)≥0,  i=1,…,n.

### 6.3 Linear INK-Splines Kernel

In what follows, we describe a kernel that generates linear splines with an infinite number of knots . This smoothing kernel has good approximating properties and has no free parameters.

We start by obtaining the kernel for functions defined on the interval . The multidimensional kernel for functions on is the product of unidimensional kernels.

According to its definition, a -th order spline with knots

 0

has the form

 f(x)=p∑i=0b∗ixi+m∑i=1bi(x−ti)p+, (30)

where

 (x−ti)+={x−ti,if x>ti0,otherwise.

For , expression (30) provides a piecewise linear function, whereas for , it provides a piecewise polynomial function.

Now, let the number of knots . Then (30) becomes

 f(x)=p∑i=0b∗ixi+∫u0b(t)(x−t)p+dt. (31)

It is possible to consider function (31) as an inner product in an RKHS:

 Kp(x,y)=p∑i=0xiyi+∫u0(x−t)p+(y−t)p+dt.

For the case of linear splines with an infinite number of knots, i.e, in (31), this kernel has a simple closed form expression:

 K1(x,y)=1+xy+12|x−y|(x∧y)2+13(x∧y)3,

where we denote .

As mentioned, the multidimensional linear INK-spline is the coordinate-wise product of linear INK-splines:

 K1,d(x,y)=d∏k=1K1(xk,yk).

## 7 Selection of Regularization Constant

According to the results presented in Section 3, the regularization constant should satisfy (13) for large and . For finite and , we choose as follows.

### 7.1 Cross-Validation for DRE-VK

For problem (28), we choose the regularization constant using -fold cross-validation based on the minimization of the least-squares criterion :

 J0=12∫[r(x)−r0(x)]2p2(x)dx=12∫r(x)2p2(x)dx−∫r(x)p1(x)dx+const.

We partition data sets

 X={x1,…,xℓ i.i.d∼ F1(x)}   and   X′={x′1,…,x′n i.i.d∼ F2(x)}

into disjoint sets

 Zi={z1,i,…,zℓ/k,i}   and   Z′i={z′1,i,…,z′n/k,i}.

Using samples and , a solution to the DRE-VK minimization problem

 →αγ=nℓ(V′′iKi+γI)−1V′i→1 (32)

is constructed for constant . After that, the solution is evaluated by the empirical least-squares criterion on samples and :

 Jγi=12n/k∑i=1rγ(z′i)2−nℓℓ/k∑i=1rγ(zi)

This procedure is performed for each .

Then, from a set of regularization constant candidates , we select that minimizes the least-squares criterion over all folds, i.e,

 γ∗=argminγ∈Γk∑i=1Jγi

We obtain the final solution using the selected and unpartitioned samples ,:

 →α=nℓ(V′′K+γ∗I)−1V′→1 (33)

### 7.2 Cross-Validation for DRE-V

The aforementioned procedure for selection is not readily available for estimation of values of the density ratio function at given points. However, we take advantage of the fact that finding a minimum of

 L=min→α∈Rn[12→α⊤V′′V′′→α−nℓ→α⊤V′→1+γn→α⊤V′′→α] (34)

leads to the same solution at given points as (26) if the same value of is used for both (26) and (34).

Indeed, the minimum of (34) is reached at

 →αγ=nℓ(V′′V′′+γV′′)−1V′→1. (35)

Consequently, the solution at given points is:

 →r=V′′→α=nℓV′′(V′′V′′+γnV′′)−1V′→1=nℓ(V′′+γnI)−1V′→1.

In order to choose for DRE-V, we use the same least-squares cross-validation procedure described in the last section, but using (35) instead of (32) and (33).

## 8 Experiments

In this section we report experimental results for the methods of density ratio estimation introduced in this paper: DRE-V (solution at given points) and DRE-VK (smooth solutions in RKHS). For the latter we instantiate two versions: DRE-VK-INK, which uses the linear INK-splines kernel described in Section 6.3; and DRE-VK-RBF, which uses the Gaussian RBF kernel

 k(x,y)=exp(−∥x−y∥222σ2).

In all cases the regularization constant is chosen by 5-fold cross-validation. For DRE-VK-RBF, the extra “smoothing” parameter is cross-validated along with .

For comparison purposes, we also run experiments for the Kernel Mean Matching (KMM) procedure , the Unconstrained Least-Squares Importance Filtering (uLSIF) algorithm , and the Kullback-Leibler Importance Estimation Procedure (KLIEP) . For uLSIF and KLIEP, we use the code provided on the authors’ website, leaving its parameters set to their default values with the exception of the number of folds of uLSIF, which is set444In our experiments, 5-fold uLSIF performed better than the default leave-one-out uLSIF to 5. For KMM, we follow the implementation used in the experimental section of . KLIEP, uLSIF, and KMM use Gaussian RBF kernels. KLIEP and uLSIF select by cross-validation, whereas KMM estimates by the median distance between the input points.

We choose not to enforce the non-negativity of the solutions of DRE-V and DRE-VK-* in order to conduct a fair comparison with respect to uLSIF. We note that KMM and KLIEP do enforce

 r(x′i)≥0,  i=1,…,n.

### 8.1 Experimental Setting

The experiments were conducted on synthetic data described in Table 1. Models 4 and 6 were taken from the experimental evaluation of previously proposed methods .

For each model, we sample points from and another points from , with varying in for unidimensional data and for 20-dimensional data. For each , we perform 20 independent draws.

We evaluate the estimated density ratio at the points sampled from , since most applications require the estimation of the density ratio only at those points. Accordingly, we compare the estimate to the actual density ratio

 →r0=[p1(x′1)p2(x′1),…,p1(x′m)p2(x′m)]⊤

using the normalized root mean squared error (NRMSE):

### 8.2 Results

Experimental results are summarized in Tables 2 and 3. For each

, we report the mean NRMSE and the standard deviation over the independent draws.

Overall, the direct constructive methods of density ratio estimation proposed in this paper achieve lower NRMSE than previously proposed methods uLSIF, KLIEP, and KMM.

Among the methods proposed in this paper, the ones providing smooth estimates in RKHS (DRE-VK-*) perform better than DRE-V. It is worth noting that the use of linear INK-splines kernel tends to provide equally or better performing estimates than the ones provided by the RBF kernel.

We believe that the advantage in accuracy of the methods proposed in this paper is due to 1) the information provided by the -matrices about the geometry of the data, and 2) the smoothness requirements introduced by RKHS.