1 Introduction
The estimation of the ratio of two probability densities from a given collection of data is a fundamental technique for solving many applied problems, including data adaptation
[1, 2], conditional probability and regression estimation [3], estimation of mutual information [4], changepoint detection [5], and many others.Several approaches have been proposed and studied for the solution of this problem (see [6], [7], [8], [9] and references therein). Among them, we find the Kernel Mean Matching (KMM) procedure [7], the unconstrained LeastSquares Importance Filtering (uLSIF) algorithm [10], and the KullbackLeibler Importance Estimation Procedure (KLIEP) [11]. For some of the proposed approaches, the convergence of the obtained estimates to the actual solution was proven under some assumptions about the (unknown) solution.
In this paper we introduce direct constructive methods of density ratio estimation. As opposed to existing approaches, ours is based on the definition of the density ratio function itself. We show that density ratio estimation requires the solution of an illposed integral equation for which not only the right hand side of the equation is approximately defined, but so is the operator of the equation. The solution of such equation essentially depends on our new concept of “matrix”, which is directly computed from data. This type of matrix was not used in previously proposed approaches.
The paper is organized as follows. In Section 2 and 3 we outline the necessary basic concepts of theoretical statistics and show their relation to direct constructive density ratio estimation. In Section 4 through 7, we derive the concept of matrix and direct methods of density ratio estimation based on this concept. Section 8 is devoted to experimental results.
2 Statistical Foundations
In order to simplify the notations in this paper, we denote a multidimensional function by . Likewise, we use the following notation for multidimensional integrals:
2.1 Empirical Distribution Function
The basic concept of theoretical statistics is the cumulative distribution function (CDF)
of a random vector
:It is known that any CDF is wellapproximated by the empirical cumulative distribution function (ECDF) constructed on the basis of an i.i.d. sample distributed according to :
The empirical cumulative distribution function has the form
(1) 
where is the stepfunction^{1}^{1}1Stepfunction is defined as
(2) 
For , the DvoretzkyKieferWolfowitz inequality states that converges fast to , namely, for any natural and any ,
2.2 Constructive Setting of the Density Estimation Problem
According to the definition of density function (4), the problem of density estimation from a given collection of data is the problem of solving the (multidimensional) integral equation
(5) 
when the cumulative distribution is unknown but an i.i.d sample
is given.
2.3 IllPosed Problems
It is known that solving a linear operator equation
(6) 
such as (5) is an illposed problem: small deviations in the right hand side may lead to large deviations of the solution .
In what follows, we assume that the operator maps functions from a normed space to functions in a normed space .
In the 1960’s, Tikhonov proposed the regularization method for solving illposed problems; it uses approximations such that
According to this method, in order to solve an illposed problem, one has to minimize the functional
where is a regularization constant and is a regularizing functional defined in the space ; the functional must possess the following properties:

is nonnegative;

there exists a for which the solution of (6) is in ;

for every , the set of functions is compact.
It was shown for that, if the rate of convergence of (as ) is not greater than the rate of convergence of , then
3 Constructive Setting of the Density Ratio Estimation Problem
In what follows, we consider the problem of estimating the ratio of two probability densities and (assuming ):
(8) 
According to definition (8), the problem of estimating the density ratio from data is the problem of solving the (multidimensional) integral equation
(9) 
when the distribution functions and are unknown but samples and are given.
The constructive setting of this problem is to solve equation (9) using the empirical distribution functions
instead of the actual cumulative distributions and .
3.1 Stochastic IllPosed Problems
Note that density ratio estimation leads to a more complicated illposed equation than density estimation, since the operator , which depends on , is also defined approximately:
In this situation, in order to solve equation (9), we will also minimize the regularization functional
(10) 
Let the sequence of operators converge in probability to in the following operator norm:
Then, for arbitrary , there exists such that for any the following inequality holds for [3, 13]:
(11) 
It was shown [3] that, if the solution of the equation (9) belongs to a set of smooth functions (in fact, to a set of continuous functions with bounded variation), then
(12) 
Therefore, in the set of smooth functions, the sequence converges in probability to the solution of equation (9) provided that
(13) 
where .
4 VMatrices
Let us rewrite the first term of functional (10):
Using the norm in space , we obtain:
(14) 
Expression (14) can be written as
(15) 
where
The last term in (15) does not depend on and thus can be ignored for minimization on . In what follows, we use the notation
The first two terms of (15) are
Let us denote by the values
and by the dimensional matrix of elements . Also, let us denote by the values
and by the dimensional matrix of elements .
Therefore, the first term of functional (10) has the following form in vectormatrix notation:
(16) 
where by we denote the dimensional vector and by the dimensional vector .
Matrices and reflect the geometry of the observed data.
5 Regularizing Functional
We define the regularizing functional to be the square of the norm in Hilbert space:
(17) 
We will consider two different concepts of Hilbert space and its norm, which correspond to two different types of prior knowledge about the solution ^{2}^{2}2Recall that the solution must have a norm in the corresponding space: .
Let us first define (17) using the norm in Hilbert space:
(18) 
We also define (17) as a norm in a Reproducing Kernel Hilbert Space (RKHS).
5.1 Reproducing Kernel Hilbert Space and Its Norm
An RKHS is defined by a positive definite kernel function and an inner product for which the following reproducing property holds true:
(19) 
Note that any positive definite function
has an expansion in terms of its eigenvalues
and eigenfunctions
:(20) 
6 Solving the Minimization Problem
In this section we obtain solutions of minimization problem (10) for a fixed value of the regularization constant .
6.1 Solution at Given Points (DREV)
We rewrite minimization problem (10) in an explicit form using (16) and (18). This leads to the following optimization problem:
(25) 
The minimum of this functional has the form
(26) 
which can be computed by solving the corresponding system of linear equations.
In order to obtain a more accurate solution, we can take into account our prior knowledge that
Any standard quadratic programming package can be used to find the solution (vector ).
6.2 Solution in RKHS (DREVK)
In order to ensure smoothness of the solution in (10), we look for a function in an RKHS defined by a kernel. According to the representer theorem [14], the function has the form
(27) 
We rewrite the minimization problem (10) in explicit form using (16) and (24). Since , we obtain the following optimization problem:
(28) 
The minimum of this functional has the form
(29) 
which can also be computed by solving the corresponding system of linear equations.
Optimization problem (28) has a structure similar to that of the unconstrained LeastSquares Importance Filtering (uLSIF) method [10], where, instead of and , one uses identity matrices and , and, instead of regularization functional , one uses .
In order to use our prior knowledge about the nonnegativity of the solution, one can restrict the set of functions in (27) with conditions
6.3 Linear INKSplines Kernel
In what follows, we describe a kernel that generates linear splines with an infinite number of knots [3]. This smoothing kernel has good approximating properties and has no free parameters.
We start by obtaining the kernel for functions defined on the interval . The multidimensional kernel for functions on is the product of unidimensional kernels.
According to its definition, a th order spline with knots
has the form
(30) 
where
For , expression (30) provides a piecewise linear function, whereas for , it provides a piecewise polynomial function.
Now, let the number of knots . Then (30) becomes
(31) 
It is possible to consider function (31) as an inner product in an RKHS:
For the case of linear splines with an infinite number of knots, i.e, in (31), this kernel has a simple closed form expression:
where we denote .
As mentioned, the multidimensional linear INKspline is the coordinatewise product of linear INKsplines:
7 Selection of Regularization Constant
According to the results presented in Section 3, the regularization constant should satisfy (13) for large and . For finite and , we choose as follows.
7.1 CrossValidation for DREVK
For problem (28), we choose the regularization constant using fold crossvalidation based on the minimization of the leastsquares criterion [10]:
We partition data sets
into disjoint sets
Using samples and , a solution to the DREVK minimization problem
(32) 
is constructed for constant . After that, the solution is evaluated by the empirical leastsquares criterion on samples and :
This procedure is performed for each .
Then, from a set of regularization constant candidates , we select that minimizes the leastsquares criterion over all folds, i.e,
We obtain the final solution using the selected and unpartitioned samples ,:
(33) 
7.2 CrossValidation for DREV
The aforementioned procedure for selection is not readily available for estimation of values of the density ratio function at given points. However, we take advantage of the fact that finding a minimum of
(34) 
leads to the same solution at given points as (26) if the same value of is used for both (26) and (34).
8 Experiments
In this section we report experimental results for the methods of density ratio estimation introduced in this paper: DREV (solution at given points) and DREVK (smooth solutions in RKHS). For the latter we instantiate two versions: DREVKINK, which uses the linear INKsplines kernel described in Section 6.3; and DREVKRBF, which uses the Gaussian RBF kernel
In all cases the regularization constant is chosen by 5fold crossvalidation. For DREVKRBF, the extra “smoothing” parameter is crossvalidated along with .
For comparison purposes, we also run experiments for the Kernel Mean Matching (KMM) procedure [7], the Unconstrained LeastSquares Importance Filtering (uLSIF) algorithm [10], and the KullbackLeibler Importance Estimation Procedure (KLIEP) [11]. For uLSIF and KLIEP, we use the code provided on the authors’ website^{3}^{3}3http://sugiyamawww.cs.titech.ac.jp/~sugi/software/, leaving its parameters set to their default values with the exception of the number of folds of uLSIF, which is set^{4}^{4}4In our experiments, 5fold uLSIF performed better than the default leaveoneout uLSIF to 5. For KMM, we follow the implementation used in the experimental section of [7]. KLIEP, uLSIF, and KMM use Gaussian RBF kernels. KLIEP and uLSIF select by crossvalidation, whereas KMM estimates by the median distance between the input points.
We choose not to enforce the nonnegativity of the solutions of DREV and DREVK* in order to conduct a fair comparison with respect to uLSIF. We note that KMM and KLIEP do enforce
8.1 Experimental Setting
The experiments were conducted on synthetic data described in Table 1. Models 4 and 6 were taken from the experimental evaluation of previously proposed methods [10].
Model #  Dim.  Supp.  

1  1  Beta  Uniform  
2  1  Beta  Uniform  
3  1  Beta  Beta  
4  1  Gaussian  Gaussian  
5  1  Laplace  Laplace  
6  20  Gaussian  Gaussian  
7  20  Laplace  Laplace  
For each model, we sample points from and another points from , with varying in for unidimensional data and for 20dimensional data. For each , we perform 20 independent draws.
We evaluate the estimated density ratio at the points sampled from , since most applications require the estimation of the density ratio only at those points. Accordingly, we compare the estimate to the actual density ratio
using the normalized root mean squared error (NRMSE):
8.2 Results
Experimental results are summarized in Tables 2 and 3. For each
, we report the mean NRMSE and the standard deviation over the independent draws.
M. #  uLSIF  KLIEP  KMM  

1  50  0.74  (0.15)  0.61  (0.12)  1.40  (0.68) 
100  0.78  (0.12)  0.62  (0.18)  1.60  (1.10)  
200  0.72  (0.16)  0.64  (0.13)  0.89  (0.61)  
2  50  0.50  (0.28)  0.39  (0.08)  0.98  (0.46) 
100  0.47  (0.27)  0.32  (0.11)  0.68  (0.32)  
200  0.27  (0.19)  0.30  (0.12)  0.36  (0.10)  
3  50  0.78  (0.74)  0.44  (0.21)  1.10  (0.70) 
100  0.55  (0.23)  0.33  (0.19)  0.67  (0.39)  
200  0.32  (0.24)  0.26  (0.19)  0.31  (0.09)  
4  50  2.90  (5.50)  1.30  (1.70)  2.00  (2.40) 
100  0.87  (0.48)  0.55  (0.27)  1.40  (0.60)  
200  0.61  (0.19)  0.42  (0.13)  2.00  (0.59)  
5  50  0.77  (0.51)  0.68  (0.46)  1.50  (0.77) 
100  0.82  (0.24)  0.41  (0.23)  1.70  (0.42)  
200  0.55  (0.19)  0.32  (0.10)  2.00  (0.84)  
6  100  0.76  (0.06)  1.10  (1.60)  0.85  (0.13) 
200  0.76  (0.06)  1.20  (0.51)  0.80  (0.07)  
500  0.75  (0.04)  0.84  (0.23)  0.89  (0.07)  
7  100  0.68  (0.03)  0.67  (0.02)  0.83  (0.11) 
200  0.68  (0.02)  0.67  (0.02)  0.86  (0.07)  
500  0.67  (0.01)  0.66  (0.01)  0.93  (0.08) 
Overall, the direct constructive methods of density ratio estimation proposed in this paper achieve lower NRMSE than previously proposed methods uLSIF, KLIEP, and KMM.
Among the methods proposed in this paper, the ones providing smooth estimates in RKHS (DREVK*) perform better than DREV. It is worth noting that the use of linear INKsplines kernel tends to provide equally or better performing estimates than the ones provided by the RBF kernel.
We believe that the advantage in accuracy of the methods proposed in this paper is due to 1) the information provided by the matrices about the geometry of the data, and 2) the smoothness requirements introduced by RKHS.
M. #  DREV  DREVKINK  DREVKRBF  Others’ Best  

1  50  0.72  (0.19)  0.59  (0.15)  0.61  (0.17)  0.61  (0.12) 
100  0.69  (0.18)  0.57  (0.20)  0.65  (0.23)  0.62  (0.18)  
200  0.62  (0.15)  0.52  (0.14)  0.51  (0.18)  0.64  (0.13)  
2  50  0.27  (0.09)  0.33  (0.15)  0.24  (0.11)  0.39  (0.08) 
100  0.27  (0.13)  0.28  (0.16)  0.27  (0.18)  0.32  (0.11)  
200  0.18  (0.05)  0.19  (0.08)  0.19  (0.10)  0.27  (0.19)  
3  50  0.34  (0.21)  0.34  (0.15)  0.40  (0.30)  0.44  (0.21) 
100  0.25  (0.22)  0.22  (0.11)  0.24  (0.16)  0.33  (0.19)  
200  0.19  (0.11)  0.15  (0.07)  0.16  (0.07)  0.26  (0.19)  
4  50  1.20  (1.50)  0.81  (0.80)  0.90  (0.94)  1.30  (1.70) 
100  0.63  (0.34)  0.43  (0.20)  0.54  (0.36)  0.55  (0.27)  
200  0.45  (0.14)  0.32  (0.16)  0.43  (0.18)  0.42  (0.13)  
5  50  0.50  (0.22)  0.65  (0.36)  0.68  (0.50)  0.68  (0.46) 
100  0.43  (0.11)  0.55  (0.19)  0.51  (0.31)  0.41  (0.23)  
200  0.35  (0.15)  0.41 