The estimation of the ratio of two probability densities from a given collection of data is a fundamental technique for solving many applied problems, including data adaptation[1, 2], conditional probability and regression estimation , estimation of mutual information , change-point detection , and many others.
Several approaches have been proposed and studied for the solution of this problem (see , , ,  and references therein). Among them, we find the Kernel Mean Matching (KMM) procedure , the unconstrained Least-Squares Importance Filtering (uLSIF) algorithm , and the Kullback-Leibler Importance Estimation Procedure (KLIEP) . For some of the proposed approaches, the convergence of the obtained estimates to the actual solution was proven under some assumptions about the (unknown) solution.
In this paper we introduce direct constructive methods of density ratio estimation. As opposed to existing approaches, ours is based on the definition of the density ratio function itself. We show that density ratio estimation requires the solution of an ill-posed integral equation for which not only the right hand side of the equation is approximately defined, but so is the operator of the equation. The solution of such equation essentially depends on our new concept of “-matrix”, which is directly computed from data. This type of matrix was not used in previously proposed approaches.
The paper is organized as follows. In Section 2 and 3 we outline the necessary basic concepts of theoretical statistics and show their relation to direct constructive density ratio estimation. In Section 4 through 7, we derive the concept of -matrix and direct methods of density ratio estimation based on this concept. Section 8 is devoted to experimental results.
2 Statistical Foundations
In order to simplify the notations in this paper, we denote a multidimensional function by . Likewise, we use the following notation for multidimensional integrals:
2.1 Empirical Distribution Function
The basic concept of theoretical statistics is the cumulative distribution function (CDF)
of a random vector:
It is known that any CDF is well-approximated by the empirical cumulative distribution function (ECDF) constructed on the basis of an i.i.d. sample distributed according to :
The empirical cumulative distribution function has the form
where is the step-function111Step-function
is defined as
For , the Dvoretzky-Kiefer-Wolfowitz inequality states that converges fast to , namely, for any natural and any ,
For , fast convergence of to also takes place:
where, according to VC theory , can be set to
2.2 Constructive Setting of the Density Estimation Problem
According to the definition of density function (4), the problem of density estimation from a given collection of data is the problem of solving the (multidimensional) integral equation
when the cumulative distribution is unknown but an i.i.d sample
2.3 Ill-Posed Problems
It is known that solving a linear operator equation
such as (5) is an ill-posed problem: small deviations in the right hand side may lead to large deviations of the solution .
In what follows, we assume that the operator maps functions from a normed space to functions in a normed space .
In the 1960’s, Tikhonov proposed the regularization method for solving ill-posed problems; it uses approximations such that
According to this method, in order to solve an ill-posed problem, one has to minimize the functional
where is a regularization constant and is a regularizing functional defined in the space ; the functional must possess the following properties:
there exists a for which the solution of (6) is in ;
for every , the set of functions is compact.
It was shown for that, if the rate of convergence of (as ) is not greater than the rate of convergence of , then
3 Constructive Setting of the Density Ratio Estimation Problem
In what follows, we consider the problem of estimating the ratio of two probability densities and (assuming ):
According to definition (8), the problem of estimating the density ratio from data is the problem of solving the (multidimensional) integral equation
when the distribution functions and are unknown but samples and are given.
The constructive setting of this problem is to solve equation (9) using the empirical distribution functions
instead of the actual cumulative distributions and .
3.1 Stochastic Ill-Posed Problems
Note that density ratio estimation leads to a more complicated ill-posed equation than density estimation, since the operator , which depends on , is also defined approximately:
In this situation, in order to solve equation (9), we will also minimize the regularization functional
Let the sequence of operators converge in probability to in the following operator norm:
Therefore, in the set of smooth functions, the sequence converges in probability to the solution of equation (9) provided that
Let us rewrite the first term of functional (10):
Using the norm in space , we obtain:
Expression (14) can be written as
The last term in (15) does not depend on and thus can be ignored for minimization on . In what follows, we use the notation
The first two terms of (15) are
Let us denote by the values
and by the -dimensional matrix of elements . Also, let us denote by the values
and by the -dimensional matrix of elements .
Therefore, the first term of functional (10) has the following form in vector-matrix notation:
where by we denote the -dimensional vector and by the -dimensional vector .
Matrices and reflect the geometry of the observed data.
5 Regularizing Functional
We define the regularizing functional to be the square of the norm in Hilbert space:
We will consider two different concepts of Hilbert space and its norm, which correspond to two different types of prior knowledge about the solution 222Recall that the solution must have a norm in the corresponding space: .
Let us first define (17) using the norm in Hilbert space:
We also define (17) as a norm in a Reproducing Kernel Hilbert Space (RKHS).
5.1 Reproducing Kernel Hilbert Space and Its Norm
An RKHS is defined by a positive definite kernel function and an inner product for which the following reproducing property holds true:
Note that any positive definite function
has an expansion in terms of its eigenvalues
Let us consider the set of functions
and the inner product
6 Solving the Minimization Problem
In this section we obtain solutions of minimization problem (10) for a fixed value of the regularization constant .
6.1 Solution at Given Points (DRE-V)
The minimum of this functional has the form
which can be computed by solving the corresponding system of linear equations.
In order to obtain a more accurate solution, we can take into account our prior knowledge that
Any standard quadratic programming package can be used to find the solution (vector ).
6.2 Solution in RKHS (DRE-VK)
The minimum of this functional has the form
which can also be computed by solving the corresponding system of linear equations.
Optimization problem (28) has a structure similar to that of the unconstrained Least-Squares Importance Filtering (uLSIF) method , where, instead of and , one uses identity matrices and , and, instead of regularization functional , one uses .
In order to use our prior knowledge about the non-negativity of the solution, one can restrict the set of functions in (27) with conditions
6.3 Linear INK-Splines Kernel
In what follows, we describe a kernel that generates linear splines with an infinite number of knots . This smoothing kernel has good approximating properties and has no free parameters.
We start by obtaining the kernel for functions defined on the interval . The multidimensional kernel for functions on is the product of unidimensional kernels.
According to its definition, a -th order spline with knots
has the form
For , expression (30) provides a piecewise linear function, whereas for , it provides a piecewise polynomial function.
Now, let the number of knots . Then (30) becomes
It is possible to consider function (31) as an inner product in an RKHS:
For the case of linear splines with an infinite number of knots, i.e, in (31), this kernel has a simple closed form expression:
where we denote .
As mentioned, the multidimensional linear INK-spline is the coordinate-wise product of linear INK-splines:
7 Selection of Regularization Constant
7.1 Cross-Validation for DRE-VK
We partition data sets
into disjoint sets
Using samples and , a solution to the DRE-VK minimization problem
is constructed for constant . After that, the solution is evaluated by the empirical least-squares criterion on samples and :
This procedure is performed for each .
Then, from a set of regularization constant candidates , we select that minimizes the least-squares criterion over all folds, i.e,
We obtain the final solution using the selected and unpartitioned samples ,:
7.2 Cross-Validation for DRE-V
The aforementioned procedure for selection is not readily available for estimation of values of the density ratio function at given points. However, we take advantage of the fact that finding a minimum of
Indeed, the minimum of (34) is reached at
Consequently, the solution at given points is:
In this section we report experimental results for the methods of density ratio estimation introduced in this paper: DRE-V (solution at given points) and DRE-VK (smooth solutions in RKHS). For the latter we instantiate two versions: DRE-VK-INK, which uses the linear INK-splines kernel described in Section 6.3; and DRE-VK-RBF, which uses the Gaussian RBF kernel
In all cases the regularization constant is chosen by 5-fold cross-validation. For DRE-VK-RBF, the extra “smoothing” parameter is cross-validated along with .
For comparison purposes, we also run experiments for the Kernel Mean Matching (KMM) procedure , the Unconstrained Least-Squares Importance Filtering (uLSIF) algorithm , and the Kullback-Leibler Importance Estimation Procedure (KLIEP) . For uLSIF and KLIEP, we use the code provided on the authors’ website333http://sugiyama-www.cs.titech.ac.jp/~sugi/software/, leaving its parameters set to their default values with the exception of the number of folds of uLSIF, which is set444In our experiments, 5-fold uLSIF performed better than the default leave-one-out uLSIF to 5. For KMM, we follow the implementation used in the experimental section of . KLIEP, uLSIF, and KMM use Gaussian RBF kernels. KLIEP and uLSIF select by cross-validation, whereas KMM estimates by the median distance between the input points.
We choose not to enforce the non-negativity of the solutions of DRE-V and DRE-VK-* in order to conduct a fair comparison with respect to uLSIF. We note that KMM and KLIEP do enforce
8.1 Experimental Setting
For each model, we sample points from and another points from , with varying in for unidimensional data and for 20-dimensional data. For each , we perform 20 independent draws.
We evaluate the estimated density ratio at the points sampled from , since most applications require the estimation of the density ratio only at those points. Accordingly, we compare the estimate to the actual density ratio
using the normalized root mean squared error (NRMSE):
, we report the mean NRMSE and the standard deviation over the independent draws.
Overall, the direct constructive methods of density ratio estimation proposed in this paper achieve lower NRMSE than previously proposed methods uLSIF, KLIEP, and KMM.
Among the methods proposed in this paper, the ones providing smooth estimates in RKHS (DRE-VK-*) perform better than DRE-V. It is worth noting that the use of linear INK-splines kernel tends to provide equally or better performing estimates than the ones provided by the RBF kernel.
We believe that the advantage in accuracy of the methods proposed in this paper is due to 1) the information provided by the -matrices about the geometry of the data, and 2) the smoothness requirements introduced by RKHS.
|M. #||DRE-V||DRE-VK-INK||DRE-VK-RBF||Others’ Best|