1 Introduction
Analyzing inputoutput relationship from samples is one of the central challenges in machine learning. The most common approach is
regression, which estimates the conditional mean of output given input . However, just analyzing the conditional mean is not informative enough, when the conditional densitypossesses multimodality, asymmetry, and heteroscedasticity (i.e., inputdependent variance) as a function of output
. In such cases, it would be more appropriate to estimate the conditional density itself (Figure 2).The most naive approach to conditional density estimation (CDE) would be
neighbor kernel density estimation
(KDE), which performs standard KDE along only with nearby samples in the input domain. However, KDE do not work well in highdimensional problems because the number of nearby samples is too few. To avoid the small sample problem, KDE may be applied twice to estimate and separately and the estimated densities may be plugged into the decomposed form to estimate the conditional density. However, taking the ratio of two estimated densities significantly magnifies the estimation error and thus is not reliable. To overcome this problem, an approach to directly estimating the density ratio without separate estimation of densities and has been explored (Sugiyama et al., 2010). This method, called leastsquares CDE (LSCDE), was proved to possess the optimal nonparametric learning rate in the minimax sense, and its solution can be efficiently and analytically computed. Nevertheless, estimating conditional densities in highdimensional problems is still challenging.A natural idea to cope with the highdimensionality is to perform dimensionality reduction (DR) before CDE. Sufficient DR (Li, 1991; Cook and Ni, 2005) is a framework of supervised DR aimed at finding the subspace of input that contains all information on output , and a method based on conditionalcovariance operators in reproducing kernel Hilbert spaces has been proposed (Fukumizu et al., 2009). Although this method possesses superior thoretical properties, it is not easy to use in practice because no systematic model selection method is available for kernel parameters. To overcome this problem, an alternative sufficient DR method based on squaredloss mutual information (SMI) has been proposed recently (Suzuki and Sugiyama, 2013). This method involves nonparametric estimation of SMI that is theoretically guaranteed to achieve the optimal estimation rate, and all tuning parameters can be systematically chosen in practice by crossvalidation with respect to the SMI approximation error.
Given such stateoftheart DR methods, performing DR before LSCDE would be a promising approach to improving the accuracy of CDE in highdimensional problems. However, such a twostep approach is not preferable because DR in the first step is performed without regard to CDE in the second step and thus small error incurred in the DR step can be significantly magnified in the CDE step.
In this paper, we propose a singleshot method that integrates DR and CDE. Our key idea is to formulate the sufficient DR problem in terms of the squaredloss conditional entropy (SCE) which includes the conditional density in its definition, and LSCDE is executed when DR is performed. Therefore, when DR is completed, the final conditional density estimator has already been obtained without an additional CDE step (Figure 1). We demonstrate the usefulness of the proposed method, named leastsquares conditional entropy (LSCE), through experiments on benchmark datasets, humanoid robot control simulations, and computer art.
2 Conditional Density Estimation with Dimensionality Reduction
In this section, we describe our proposed method for conditional density estimation with dimensionality reduction.
2.1 Problem Formulation
Let and be the input and output domains with dimensionality and , respectively, and let
be a joint probability density on
. Assume that we are given independent and identically distributed (i.i.d.) training samples from the joint density:The goal is to estimate the conditional density from the samples.
Our implicit assumption is that the input dimensionality is large, but its “intrinsic” dimensionality, denoted by , is rather small. More specifically, let and be and matrices such that
is an orthogonal matrix. Then we assume that
can be decomposed into the component and its perpendicular component so that and are conditionally independent given :(1) 
This measn that is the relevant part of , and the rest does not contain any information on . The problem of finding is called sufficient dimensionality reduction (Li, 1991; Cook and Ni, 2005).
2.2 Sufficient Dimensionality Reduction with SCE
Let us consider a squaredloss variant of conditional entropy named squaredloss CE (SCE):
(2) 
By expanding the squared term in Eq.(2), we obtained
(3) 
where is defined as
(4) 
Then we have the following theorem (its proof is given in Appendix A), which forms the basis of our proposed method:
Theorem 1.
This theorem shows , and the equality holds if and only if
This is equivalent to the conditional independence (1), and therefore sufficient dimensionality reduction can be performed by minimizing with respect to :
(5) 
Here, denotes the Grassmann manifold, which is a set of orthogonal matrices without overlaps:
where
denotes the identity matrix and
represents the equivalence relation: and are written as if their rows span the same subspace.Since , is equivalent to the negative Pearson divergence (Pearson, 1900) from to , which is a member of the divergence class (Ali and Silvey, 1966; Csiszár, 1967)
with the squaredloss function. On the other hand, ordinary conditional entropy (CE), defined by
is the negative KullbackLeibler divergence (Kullback and Leibler, 1951) from to
. Since the KullbackLeibler divergence is also a member of the
divergence class (with the logloss function), CE and SCE have similar properties. Indeed, the above theorem also holds for ordinary CE. However, the Pearson divergence is shown to be more robust against outliers
(Sugiyama et al., 2012), since the log function—which is very sharp near zero—is not included. Furthermore, as shown below, can be approximated analytically and thus its derivative can also be easily computed. This is a critical property for developing a dimensionality reduction method because we want to minimize with respect to , where the gradient is highly useful in devising an optimization algorithm. For this reason, we adopt SCE instead of CE below.2.3 SCE Approximation
Since in Eq.(5) is unknown in practice, we approximate it using samples .
The trivial inequality yields , and thus we have
(6) 
If we set , we have
If we multiply both sides of the above inequality with , and integrated over and , we have
(7) 
where minimization with respect to is now performed as a function of and . For more general discussions on divergence bounding, see (Keziou, 2003) and (Nguyen et al., 2010).
Let us consider a linearinparameter model for :
where
is a parameter vector and
is a vector of basis functions. If the expectations over densities and are approximated by samples averages and the regularizer () is included, the above minimization problem yieldswhere
(8) 
The solution is analytically given by
which yields . Then, from Eq.(7), an approximator of is obtained analytically as
We call this method leastsquares conditional entropy (LSCE).
2.4 Model Selection by CrossValidation
The above approximator depends on the choice of models, i.e., the basis function and the regularization parameter . Such a model can be objectively selected by crossvalidation as follows:

The training dataset is divided into disjoint subsets with (approximately) the same size.

For each model in the candidate set,

For ,

For model , the LSCE solution is computed from (i.e., all samples except ).

Evaluate the upper bound of obtained by using the holdout data :
where denotes the cardinality of .


The average score is computed as


The model that minimizes the average score is chosen:

For the chosen model , the LSCE solution is computed from all samples and the approximator is computed.
In the experiments, we use .
2.5 Dimensionality Reduction with SCE
Now we solve the following optimization problem by gradient descent:
(9) 
As shown in Appendix B, the gradient of is given by
where .
In the Euclidean space, the above gradient gives the steepest direction. However, on a manifold, the natural gradient (Amari, 1998) gives the steepest direction.
The natural gradient at is the projection of the ordinary gradient to the tangent space of at . If the tangent space is equipped with the canonical metric , the natural gradient is given as follows (Edelman et al., 1998):
where is a matrix such that is an orthogonal matrix.
Then the geodesic from to the direction of the natural gradient over can be expressed using as
where “” for a matrix denotes the matrix exponential and denotes the zero matrix. Note that the derivative at coincides with the natural gradient ; see (Edelman et al., 1998) for details. Thus, line search along the geodesic in the natural gradient direction is equivalent to finding the minimizer from .
Once is updated, SCE is reestimated with the new and gradient descent is performed again. This entire procedure is repeated until converges. When SCE is reestimated, performing crossvalidation in every step is computationally expensive. In our implementation, we perform crossvalidation only once every 5 gradient updates. Furthermore, to find a better local optimal solution, this gradient descent procedure is executed 20 times with randomly chosen initial solutions and the one achieving the smallest value of is chosen.
2.6 Conditional Density Estimation with SCE
Since the maximum of Eq.(6) is attained at and in the current derivation, the optimal is actually the conditional density itself. Therefore, obtained by LSCE is a conditional density estimator. This actually implies that the upperbound minimization procedure described in Section 2.3 is equivalent to leastsquares conditional density estimation (LSCDE) (Sugiyama et al., 2010), which minimizes the squared error:
Then, in the same way as the original LSCDE, we may postprocess the solution to make the conditional density estimator nonnegative and normalized as
(10) 
where . Note that, even if the solution is postprocessed as Eq.(10), the optimal estimation rate of the LSCDE solution is still maintained (Sugiyama et al., 2010).
2.7 Basis Function Design
In practice, we use the following Gaussian function as the th basis:
(11) 
where denotes the th Gaussian center located at . When the sample size is too large, we may use only a subset of samples as Gaussian centers. denotes the Gaussian bandwidth, which is chosen by crossvalidation as explained in Section 2.4. We may use different bandwidths for and , but this will increase the computation time for model selection. In our implementation, we normalize each element of and to have the unit variance in advance and then use the common bandwidth for and .
2.8 Discussions
We have proposed to minimize SCE for dimensionality reduction:
On the other hand, in the previous work (Suzuki and Sugiyama, 2013), squaredloss mutual information (SMI) was maximized for dimensionality reduction:
This shows that the essential difference is whether is included in the denominator of the density ratio. Thus, if is uniform, the proposed dimensionality reduction method using SCE is reduced to the existing method using SMI. However, if is not uniform, the density ratio function included in SMI may be more fluctuated than included in SCE. Since a smoother function can be more accurately estimated from a small number of samples in general, the proposed method using SCE is expected to work better than the existing method using SMI. We will experimentally demonstrate this effect in Section 3.
3 Experiments
In this section, we experimentally investigate the practical usefulness of the proposed method.
3.1 Illustration
We consider the following dimensionality reduction schemes:
 None:

No dimensionality reduction is performed.
 LSMI:

Dimension reduction is performed by maximizing an SMI approximator called leastsquares MI (LSMI) using natural gradients over the Grassmann manifold (Suzuki and Sugiyama, 2013).
 LSCE (proposed):

Dimension reduction is performed by minimizing the proposed LSCE using natural gradients over the Grassmann manifold.
 True (reference)

The “true” subspace is used (only for artificial data).
After dimension reduction, we execute the following conditional density estimators:
 KDE:

neighbor kernel density estimation, where is chosen by leastsquares crossvalidation.
 LSCDE:

Leastsquares conditional density estimation (Sugiyama et al., 2010).
Note that the proposed method, which is the combination of LSCE and LSCDE, does not explicitly require the postLSCDE step because LSCDE is executed inside LSCE.
First, we illustrate the behavior of the plain LSCDE (None/LSCDE) and the proposed method (LSCE/LSCDE). The datasets illustrated in Figure 2 have , , and . The first dimension of input and output of the samples are plotted in the graphs, and other 4 dimensions of are just standard normal noise. The results show that the plain LSCDE does not perform well due to the irrelevant noise dimensions of , while the proposed method gives much better estimates.
3.2 Artificial Datasets
For , , , and , where
denotes the normal distribution with mean
and covariance matrix , we consider the following artificial datasets: (a)

and .
 (b)

and .


Left column: The mean and standard error of the dimensionality reduction error over 20 runs on the artificial datasets. Middle column: Histograms of
. Right column: The mean and standard error of the conditional density estimation error over 20 runs.The left column of Figure 3 shows the dimensionality reduction error between true and its estimate for different sample size , measured by
where denotes the Frobenius norm. LSMI and LSCE perform similarly for the dataset (a), while LSCE clearly outperforms LSMI for the datasets (b). To explain this difference, we plot the histograms of in the middle column of Figure 3. They show that the profile of the histogram (which is a sample approximation of ) in the dataset (b) is much sharper than that in the dataset (a). As discussed in Section 2.8, the density ratio used in LSMI contains . Thus, for the dataset (b), the density ratio would be highly nonsmooth and thus is hard to approximate. On the other hand, the density ratio used in SCE is , where is not included. Therefore, would be smoother than and is easier to estimate than .
The right column of Figure 3 plots the conditional density estimation error between true and its estimate , evaluated by the squaredloss:
where is a set of test samples that have not been used for training. We set . The graphs show that LSCDE overall outperforms KDE for both datasets. For the dataset (a), LSMI/LSCDE and LSCE/LSCDE perform equally well, which are much better than no dimension reduction (None/LSCDE) and are comparable to the method with the true subspace (True/LSCDE). For the dataset (b), LSCE/LSCDE outperforms LSMI/LSCDE and None/LSCDE, and is comparable to the method with the true subspace (True/LSCDE).
3.3 Benchmark Datasets
Next, we use the UCI benchmark datasets (Bache and Lichman, 2013). We randomly select samples from each dataset for training, and the rest are used to measure the conditional density estimation error in the test phase. Since the dimensionality of the subspace is unknown, we chose it by crossvalidation. The results are summarized in Table 1, showing that that the proposed method, LSCE/LSCDE works well overall. Table 2 describes the dimensionalities selected by crossvalidation, showing that both LSCE and LSMI reduce the dimensionalty significantly. For “Housing”, “AutoMPG”, “Energy”, and “Stock”, LSMI/LSCDE tends to more aggressively reduce the dimensionality than LSCE/LSCDE.
Dataset  LSCE  LSMI  No reduction  Scale  

LSCDE  KDE  LSCDE  KDE  LSCDE  KDE  
Housing  100  
Auto MPG  100  
Servo  50  
Yacht  80  
Physicochem  500  
White Wine  400  
Red Wine  300  
Forest Fires  100  
Concrete  300  
Energy  200  
Stock  100  
2 Joints  100  
4 Joints  200  
9 Joints  500  
Sumie 1  200  
Sumie 2  250  
Sumie 3  300 
Data set  LSCE  LSMI  

LSCDE  KDE  LSCDE  KDE  
Housing  
Auto MPG  
Servo  
Yacht  
Physicochem  
White Wine  
Red Wine  
Forest Fires  
Concrete  
Energy  
Stock  
2 Joints  
4 Joints  
9 Joints  
Sumie 1  
Sumie 2  
Sumie 3 
3.4 Humanoid Robot
We evaluate the performance of the proposed method on humanoid robot transition estimation. We use a simulator of the upperbody part of the humanoid robot CBi (Cheng et al., 2007) (see Figure 4). The robot has 9 controllable joints: shoulder pitch, shoulder roll, elbow pitch of the right arm, shoulder pitch, shoulder roll, elbow pitch of the left arm, waist yaw, torso roll, and torso pitch joints.
Posture of the robot is described by 18dimensional realvalued state vector , which corresponds to the angle and angular velocity of each joint in radians and radians per seconds, respectively. We can control the robot by sending the action command to the system. The action command is a 9dimensional realvalued vector, which corresponds to the target angle of each joint. When the robot is currently at state and receives action , the physical control system of the simulator calculates the amount of torques to be applied to each joint. These torques are calculated by the proportionalderivative (PD) controller as
where , , and denote the current angle, the current angular velocity, and the received target angle of the th joint, respectively. and denote the position and velocity gains for the th joint, respectively. We set and for all joints except that and for the elbow pitch joints. After the torques are applied to the joints, the physical control system update the state of the robot to .
In the experiment, we randomly choose the action vector and simulate a noisy control system by adding a bimodal Gaussian noise vector. More specifically, the action of the
th joint is first drawn from uniform distribution on
. The drawn action is then contaminated by Gaussian noise with mean 0 and standard deviation 0.034 with probability 0.6 and Gaussian noise with mean 0.087 and standard deviation 0.034 with probability 0.4. By repeatedly control the robot
times, we obtain the transition samples . Our goal is to learn the system dynamic as a state transition probability from these samples. Thus, as the conditional density estimation problem, the stateaction pair is regarded as input variable , while the next state is regarded as output variable. Such statetransition probabilities are highly useful in modelbased reinforcement learning
(Sutton and Barto, 1998).We consider three scenarios: Using only 2 joints (right shoulder pitch and right elbow pitch), only 4 joints (in addition, right shoulder roll and waist yaw), and all 9 joints. Thus, and for the 2joint case, and for the 4joint case, and and for the 9joint case. We generate , , and transition samples for the 2joint, 4joint, and 9joint cases. We then randomly choose , , and samples for training, and use the rest for evaluating the test error. The results are summarized also in Table 1, showing that the proposed method performs well for the all three cases. Table 2 describes the dimensionalities selected by crossvalidation, showing that the humanoid robot’s transition is highly redundant.
3.5 Computer Art
Finally, we consider the transition estimation problem in sumie style brush drawings for nonphotorealistic rendering (Xie et al., 2012). Our aim is to learn the brush dynamics as state transition probability from the real artists’ strokedrawing samples.
From a video of real brush stroks, we extract footprints and identify corresponding 3dimensional actions (see Figure 5). The state vector consists of six measurements: the angle of the velocity vector and the heading direction of the footprint relative to the medial axis of the drawing shape, the ratio of the offset distance from the center of the footprint to the nearest point on the medial axis over the radius of the footprint, the relative curvatures of the nearest current point and the next point on the medial axis, and the binary signal of the reverse driving or not. Thus, the state transition probability has dimensional input and dimensional output. We collect transition samples in total. We randomly choose , and for training and use the rest for testing.
4 Conclusion
We proposed a new method for conditional density estimation in highdimension problems. The key idea of the proposed method is to perform sufficient dimensionality reduction by minimizing the squareloss conditional entropy (SCE), which can be estimated by leastsquares conditional density estimation. Thus, dimensionality reduction and conditional density estimation are carried out simultaneously in an integrated manner. We have also shown that SCE and the squaredloss mutual information (SMI) are similar but different in that the output density is included in the denominator of the density ratio in SMI. This means that estimation of SMI is hard when the output density is fluctuated, while the proposed method using SCE does not suffer from this problem. The effectiveness of the proposed method was demonstrated through extensive experiments including humanoid robot transition and computer art.
References
 Ali and Silvey (1966) S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28(1):131–142, 1966.
 Amari (1998) S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 Bache and Lichman (2013) K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
 Cheng et al. (2007) G. Cheng, S. Hyon, J. Morimoto, A. Ude, G.H. Joshua, Glenn Colvin, Wayco Scroggin, and C. J. Stephen. Cb: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10):1097–1114, 2007.
 Cook and Ni (2005) R. D. Cook and L. Ni. Sufficient dimension reduction via inverse regression. Journal of the American Statistical Association, 100(470):410–428, 2005.

Csiszár (1967)
I. Csiszár.
Informationtype measures of difference of probability distributions and indirect observation.
Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.  Edelman et al. (1998) A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 Fukumizu et al. (2009) K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. The Annals of Statistics, 37(4):1871–1905, 2009.
 Keziou (2003) A. Keziou. Dual representation of divergences and applications. Comptes Rendus Mathématique, 336(10):857–862, 2003.
 Kullback and Leibler (1951) S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22:79–86, 1951.
 Li (1991) K. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–342, 1991.
 Nguyen et al. (2010) X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Pearson (1900) K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302):157–175, 1900.

Sugiyama et al. (2010)
M. Sugiyama, I. Takeuchi, T. Kanamori, T. Suzuki, H. Hachiya, and D. Okanohara.
Conditional density estimation via leastsquares density ratio
estimation.
In Y. W. Teh and M. Tiggerington, editors,
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS2010)
, volume 9 of JMLR Workshop and Conference Proceedings, pages 781–788, Sardinia, Italy, May 1315 2010.  Sugiyama et al. (2012) M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
 Sutton and Barto (1998) R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.
 Suzuki and Sugiyama (2013) T. Suzuki and M. Sugiyama. Sufficient dimension reduction via squaredloss mutual information estimation. Neural Computation, 3(25):725–758, 2013.
 Xie et al. (2012) N. Xie, H. Hachiya, and M. Sugiyama. Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. In J. Langford and J. Pineau, editors, Proceedings of 29th International Conference on Machine Learning (ICML2012), pages 153–160, Edinburgh, Scotland, Jun. 26–Jul. 1 2012.
Appendix A Proof of Theorem 1
The is defined as
Then we have
Let , and . Then the final term can be expressed as
where , and are used. Therefore,
We can also express in term of as
Comments
There are no comments yet.