1 Introduction
We address the problem of learning from conditional distributions where the goal is to learn a function that links conditional distributions to target variables. Specifically, we are provided input samples and their corresponding responses . For each , there is an associated conditional distribution . However, we cannot access the entire conditional distributions directly; rather, we only observe a limited number of samples or in the extreme case only one sample from each conditional distribution . The task is to learn a function which links the conditional distribution to target by minimizing the expected loss:
(1) 
where
is a convex loss function. The function space
can be very general, but we focus on the case when is a reproducing kernel Hilbert space (RKHS) in main text, namely, where is a suitably chosen (nonlinear) feature map. Please refer to Appendix Efor the extension to arbitrary function approximators, , random features and neural networks.
The problem of learning from conditional distributions appears in many different tasks. For example:

[leftmargin=*,nosep,nolistsep]

Learning with invariance.
Incorporating priors on invariance into the learning procedure is crucial for computer vision
(Niyogi et al., 1998), speech recognition (Anselmi et al., 2013)and many other applications. The goal of invariance learning is to estimate a function which minimizes the expected risk while at the same time preserving consistency over a group of operations
. Mroueh et al. (2015) shows that this can be accomplished by solving the following optimization problem(2) where is the RKHS corresponding to kernel with implicit feature map , is the regularization parameter. Obviously, the above optimization (2) is a special case of (1). In this case, stands for possible variation of data
through conditional probability given by some normalized Haar measure
. Due to computation and memory constraints, one can only afford to generate a few virtual samples from each data point .

Policy evaluation in reinforcement learning. Policy evaluation is a fundamental task in reinforcement learning. Given a policy which is a distribution over action space condition on current state , the goal is to estimate the value function over the state space. is the fixed point of the Bellman equation
where is a reward function and is the discount factor. Therefore, the value function can be estimated from data by minimizing the meansquare Bellman error (Baird, 1995; Sutton et al., 2008):
(3) Restrict the policy to lie in some RKHS, this optimization is clearly a special case of eq:target by viewing as in eq:target. Here, given state and the the action , the successor state comes from the transition probability . Due to the online nature of MDPs, we usually observe only one successor state for each action given , , only one sample from the conditional distribution given .

Optimal control in linearlysolvable MDP. The optimal control in a certain class of MDP, , linearlysolvable MDP, can be achieved by solving the linear Bellman equation (Todorov, 2006, 2009)
(4) where denotes the immediate cost and denotes the passive dynamics without control. With , the trajectory of the optimal control can be calculated by . Therefore, can be estimated from data by optimizing
(5) Restricting function to lie in some RKHS, this optimization is a special case of eq:target by viewing as in eq:target. Here given a state , the successor state comes from the passive dynamics. Similar as policy evaluation, we usually observe only one successor state given , , only one sample from the conditional distribution given .

Hitting time and stationary distribution in stochastic process. Estimating the hitting time and stationary distribution of stochastic process are both important problems in social network application and MCMC sampling technique. Denote the transition probability as . The hitting time of starting from is defined as . Hence, the expected hitting time satisfies
(6) Based on the property of stochastic process, we can obtain the stationary distribution with . The hitting time can be learned by minimizing:
(7) where if , otherwise . Similarly, when restricting the expected hitting time to lie in some RKHS, this optimization is a special case of eq:target by viewing as in eq:target. Due to the stochasticity of the process, we only obtain one successor state from current state , , only one sample from the conditional distribution given .
Challenges.
Despite the prevalence of learning problems in the form of eq:target, solving such problem remains very challenging for two reasons: (i) we often have limited samples or in the extreme case only one sample from each conditional distribution , making it difficult to accurately estimate the conditional expectation. (ii) the conditional expectation is nested inside the loss function, making the problem quite different from the traditional stochastic optimization setting. This type of problem is called compositional stochastic programming, and very few results have been established in this domain.
Related work.
A simple option to address (1) is using sample average approximation (SAA), and thus, instead solve
where , and for each . To ensure an excess risk of , both and need be at least as large as , making the overall sample required to be ; see (Nemirovski et al., 2009; Wang et al., 2014) and references therein. Hence, when is small, SAA would provide poor results.
A second option is to resort to stochastic gradient methods (SGD). One can construct a biased stochastic estimate of the gradient using where is an estimate of for any . To ensure convergence, the bias of the stochastic gradient must be small, , a large amount of samples from the conditional distribution is needed.
Another commonly used approach is to first represent the conditional distributions as the socalled kernel conditional embedding, and then perform a supervised learning step on the embedded conditional distributions
(Song et al., 2013; Grunewalder et al., 2012a). This twostep procedure suffers from poor statistical sample complexity and computational cost. The kernel conditional embedding estimation costs , where is number of pair of samples . To achieve error in the conditional kernel embedding estimation, needs to be ^{1}^{1}1With appropriate assumptions on joint distribution , a better rate can be obtained (Grunewalder et al., 2012a). However, for fair comparison, we did not introduce such extra assumptions..Recently, Wang et al. (2014) solved a related but fundamentally distinct problem of the form,
(8) 
where
is a smooth function parameterized by some finitedimensional parameter. The authors provide an algorithm that combines stochastic gradient descent with moving average estimation for the inner expectation, and achieves an overall
sample complexity for smooth convex loss functions. The algorithm does not require the loss function to be convex, but it cannot directly handle random variable
with infinite support. Hence, such an algorithm does not apply to the more general and difficult situation that we consider in this paper.Our approach and contribution.
To address the above challenges, we propose a novel approach called dual kernel embedding. The key idea is to reformulate (1) into a minmax or saddle point problem by utilizing the Fenchel duality of the loss function. We observe that with smooth loss function and continuous conditional distributions, the dual variables form a continuous function of and . Therefore, we can parameterize it as a function in some RKHS induced by any universal kernel, where the information about the marginal distribution and conditional distribution can be aggregated via a kernel embedding of the joint distribution . Furthermore, we propose an efficient algorithm based on stochastic approximation to solve the resulting saddle point problem over RKHS spaces, and establish finitesample analysis of the generic learning from conditional distributions problems.
Compared to previous applicable approaches, an advantage of the proposed method is that it requires only one sample from each conditional distribution. Under mild conditions, the overall sample complexity reduces to in contrast to the complexity required by SAA or kernel conditional embedding. As a byproduct, even in the degenerate case (8), this implies an sample complexity when inner function is linear, which already surpasses the result obtained in (Wang et al., 2014) and is known to be unimprovable. Furthermore, our algorithm is generic for the family of problems of learning from conditional distributions, and can be adapted to problems with different loss functions and hypothesis function spaces.
Our proposed method also offers some new insights into several related applications. In reinforcement learning settings, our method provides the first algorithm that truly minimizes the meansquare Bellman error (MSBE) with both theoretical guarantees and sample efficiency. We show that the existing gradientTD2 algorithm by Sutton et al. (2009); Liu et al. (2015), is a special case of our algorithm, and the residual gradient algorithm (Baird, 1995) is derived by optimizing an upper bound of MSBE. In the invariance learning setting, our method also provides a unified view of several existing methods for encoding invariance. Finally, numerical experiments on both synthetic and realworld datasets show that our method can significantly improve over the previous stateoftheart performances.
2 Preliminaries
We first introduce our notations on Fenchel duality, kernel and kernel embedding. Let be some input space and be a positive definite kernel function. For notation simplicity, we denote the feature map of kernel or as
and use and , or and interchangeably. Then induces a RKHS , which has the property , , where is the inner product and is the norm in . We denote all continuous functions on as and as the maximum norm. We call a universal kernel if is dense in for any compact set , , for any and , there exists , such that . Examples of universal kernel include the Gaussian kernel, , Laplacian kernel, , and so on.
Convex conjugate and Fenchel duality.
Let , its convex conjugate function is defined as
When is proper, convex and lower semicontinuous for any , its conjugate function is also proper, convex and lower semicontinuous. More improtantly, the are dual to each other, , , which is known as Fenchel duality (HiriartUrruty and Lemaréchal, 2012; Rifkin and Lippert, 2007). Therefore, we can represent the by its convex conjugate as ,
It can be shown that the supremum achieves if , or equivalently .
Function approximation using RKHS.
Let be a bounded ball in the RKHS, and we define the approximation error of the RKHS as the error from approximating continuous functions in by a function , , (Bach, 2014; Barron, 1993)
(9) 
One can immediately see that decreases as increases and vanishes to zero as goes to infinity. If is restricted to the set of uniformly bounded continuous functions, then is also bounded. The approximation property, , dependence on remains an open question for general RKHS, but has been carefully established for special kernels. For example, with the kernel
induced by the sigmoidal activation function, we have
for Lipschitz continuous function space (Bach, 2014).^{2}^{2}2The rate is also known to be unimprovable by DeVore et al. (1989).Hilbert space embedding of distributions.
Hilbert space embeddings of distributions (Smola et al., 2007) are mappings of distributions into potentially infinite dimensional feature spaces,
(10) 
where the distribution is mapped to its expected feature map, , to a point in a feature space. Kernel embedding of distributions has rich representational power. Some feature map can make the mapping injective (Sriperumbudur et al., 2008), meaning that if two distributions are different, they are mapped to two distinct points in the feature space. For instance, when , the feature spaces of many commonly used kernels, such as the Gaussian RBF kernel, will generate injective embedding. We can also embed the joint distribution over a pair of variables using two kernels and as
where the joint distribution is mapped to a point in a tensor product feature space. Based on embedding of joint distributions, kernel embedding of conditional distributions can be defined as
as an operator (Song et al., 2013). With , we can obtain the expectations easily, ,(11) 
Both the joint distribution embedding, , and the conditional distribution embedding, , can be estimated from samples from or , respectively (Smola et al., 2007; Song et al., 2013), as
where , , and . Due to the inverse of , the kernel conditional embedding estimation requires cost.
3 Dual Embedding Framework
In this section, we propose a novel and sampleefficient framework to solve problem eq:target. Our framework leverages Fenchel duality and feature space embedding technique to bypass the difficulties of nested expectation and the need for overwhelmingly large sample from conditional distributions. We start by introducing the interchangeability principle, which plays a fundamental role in our method. [interchangeability principle] Let be a random variable on and assume for any , function is a proper^{3}^{3}3We say is proper when is nonempty and for . and upper semicontinuous^{4}^{4}4We say is upper semicontinuous when is an open set for . Similarly, we say is lower semicontinuous when is an open set for . concave function. Then
where is the entire space of functions defined on support . The result implies that one can replace the expected value of pointwise optima by the optimum value over a function space. For the proof of lemma 3, please refer to Appendix A. More general results of interchange between maximization and integration can be found in (Rockafellar and Wets, 1998, Chapter 14) and (Shapiro and Dentcheva, 2014, Chapter 7).
3.1 Saddle Point Reformulation
Let the loss function in eq:target be a proper, convex and lower semicontinuous for any . We denote as the convex conjugate; hence , which is also a proper, convex and lower semicontinuous function. Using the Fenchel duality, we can reformulate problem eq:target as
(12) 
Note that by the concavity and uppersemicontinuity of , for any given pair , the corresponding maximizer of the inner function always exists. Based on the interchangeability principle stated in Lemma 3, we can further rewrite (12) as
(13) 
where and is the entire function space on . We emphasize that the operator in eq:dual_opt and eq:dual_opt_exchange have different meanings: the one in eq:dual_opt is taking over a single variable, while the other one in eq:dual_opt_exchange is over all possible function .
Now that we have eliminated the nested expectation in the problem of interest, and converted it into a stochastic saddle point problem with an additional dual function space to optimize over. By definition, is always concave in for any fixed . Since , is also convex in for any fixed . Our reformulation (13) is indeed a convexconcave saddle point problem.
(a) th Iteration  (b) th Iteration  (c) th Iteration 
(a) th Iteration  (b) th Iteration  (c) th Iteration 
with a Gaussian distribution condition on location
, , where . Given samples , the task is to recover . The blue dash curve is the groundtruth . The cyan curve is the observed noisy . The red curve is the recovered signal and the green curve denotes the dual function with the observed plugged for each corresponding position . Indeed, the dual function emphasizes the difference between and on every . The interaction between primal and dual results in the recovery of the denoised signal.An example.
Let us illustrate this through a concrete example. Let be the true function, and output given . We can recover the true function by solving the optimization problem
In this example, and . Invoking the saddle point reformulation, this leads to
where the dual function fits the discrepancy between and , and thus, promotes the performance of primal function by emphasizing the different positions. See Figure 1 for the illustration of the interaction between the primal and dual functions.
3.2 Dual Continuation
Although the reformulation in eq:dual_opt_exchange gives us more structure of the problem, it is not yet tractable in general. This is because the dual function can be an arbitrary function which we do not know how to represent. In the following, we will introduce a tractable representation for (13).
First, we will define the function as the optimal dual function if for any pair ,
Note the optimal dual function is welldefined since the optimal set is nonempty. Furthermore, is related to the conditional distribution via . This can be simply derived from convexity of loss function and Fenchel’s inequality; see (HiriartUrruty and Lemaréchal, 2012) for a more formal argument. Depending on the property of the loss function , we can further derive that (see proofs in Appendix A): Suppose both and are continuous in for any ,

(Discrete case) If the loss function is continuously differentiable in for any , then is unique and continuous in for any ;

(Continuous case) If the loss function is continuously differentiable in , then is unique and continuous in on .
This assumption is satisfied widely in realworld applications. For instance, when it comes to the policy evaluation problem in 3, the corresponding optimal dual function is continuous as long as the reward function is continuous, which is true for many reinforcement tasks.
The fact that the optimal dual function is a continuous function has interesting consequences. As we mentioned earlier, the space of dual functions can be arbitrary and difficult to represent. Now we can simply restrict the parametrization to the space of continuous functions, which is tractable and still contains the global optimum of the optimization problem in eq:dual_opt_exchange. This also provides us the basis for using an RKHS to approximate these dual functions, and simply optimizing over the RKHS.
3.3 Feature Space Embedding
In the rest of the paper, we assume conditions described in Proposition 3.2 always hold. For the sake of simplicity, we focus only on the case when is a continuous set. Hence, from Proposition 3.2, the optimal dual function is indeed continuous in . As an immediate consequence, we lose nothing by restricting the dual function space to be continuous function space on . Recall that with the universal kernel, we can approximate any continuous function with arbitrarily small error. Thus we approximate the dual space by the bounded RKHS induced by a universal kernel where is the implicit feature map. Therefore, . Note that is a subspace of the continuous function space, and hence is a subspace of the dual space . To distinguish inner product between the primal function space and the dual RKHS , we denote the inner product in as .
We can rewrite the saddle point problem in eq:dual_opt_exchange as
(14) 
where by the definition of , and is the joint embedding of over . The new saddle point approximation (3.3) based on dual kernel embedding allows us to efficient represent the dual function and get away from the fundamental difficulty with insufficient sampling from the conditional distribution. There is no need to access either the conditional distribution , the conditional expectation , or the conditional embedding operator anymore, therefore, reducing both the statistical and computational complexity.
Specifically, given a pair of sample , where and , we can now easily construct an unbiased stochastic estimate for the gradient, namely,
with , respectively. For simplicity of notation, we use to denote the subgradient as well as the gradient. With the unbiased stochastic gradient, we are now able to solve the approximation problem (3.3) by resorting to the powerful mirror descent stochastic approximation framework (Nemirovski et al., 2009).
3.4 SampleEfficient Algorithm
The algorithm is summarized in Algorithm 1. At each iteration, the algorithm performs a projected gradient step both for the primal variable and dual variable based on the unbiased stochastic gradient. The proposed algorithm avoids the need for overwhelmingly large sample sizes from the conditional distributions when estimating the gradient. At each iteration, only one sample from the conditional distribution is required in our algorithm!
Throughout our discussion, we make the following standard assumptions: There exists constant scalars , , and , such that for any ,
There exists constant such that for any . Assumption 3.4 and 3.4
basically suggest that the variance of our stochastic gradient estimate is always bounded. Note that we do not assume any strongly convexity or concavity of the saddle point problem, or Lipschitz smoothness. Hence, we set the output as the average of intermediate solutions weighted by the learning rates
, as often used in the literature, to ensure the convergence of the algorithm.Define the accuracy of any candidate solution to the saddle point problem as
(15) 
We have the following convergence result, Under Assumptions 3.4 and 3.4, the solution after steps of the algorithm with stepsizes being satisfies:
(16) 
where and . The above theorem implies that our algorithm achieves an overall convergence rate, which is known to be unimprovable already for traditional stochastic optimization with general convex loss function (Nemirovski et al., 2009). We further observe that If is uniformly bounded by and is uniformly Lipschitz continuous in for any , then is Lipschitz continuous on with respect to , i.e.
Let be the optimal solution to (1). Invoking the Lipschitz continuity of and using standard arguments of decomposing the objective, we have
Combining Proposition 16 and Theorem 15, we finally conclude that under the conditions therein,
(17) 
There is clearly a delicate tradeoff between the optimization error and approximation error. Using large will increase the optimization error but decrease the approximation error. When is moderately large (which is expected in the situation when the optimal dual function has small magnitude), our dual kernel embedding algorithm can achieve an overall sample complexity when solving learning problems in the form of (1). For the analysis details, please refer to Appendix C.
4 Applications
In this section, we discuss in details how the dual kernel embedding can be applied to solve several important learning problems in machine learning, , learning with invariance and reinforcement learning, which are the special cases of the optimization eq:target. By simple verification, these examples satisify our assumptions for the convergence of algorithm. We tailor the proposed algorithm for the respective learning scenarios and unify several existing algorithms for each learning problem into our framework. Due to the space limit, we only focus on algorithms with kernel embedding. Extended algorithms with random feature, doubly SGD, neural networks as well as their hybrids can be found in Appendix E.1, E.2 and E.3.
4.1 Learning with Invariant Representations
Invariance learning. The goal is to solve the optimization (2), which learns a function in RKHS with kernel . Applying the dual kernel embedding, we end up solving the saddle point problem
where is the dual RKHS with the universal kernel introduced in our method.
Remark. The proposed algorithm bears some similarities to virtual sample techniques (Niyogi et al., 1998; Loosli et al., 2007) in the sense that they both create examples with prior knowledge to incorporate invariance. In fact, the virtual sample technique can be viewed as optimizing an upper bound of the objective eq:invariant by simply moving the conditional expectation outside, , , where the inequality comes from convexity of .
Remark. The learning problem eq:invariant can be understood as learning with RKHS with HaarIntegral kernel which is generated by as , with implicit feature map . If , then, . The HaarIntegral kernel can be viewed as a special case of Hilbertian metric on probability measures on which the output of function should be invariant (Hein and Bousquet, 2005). Therefore, other kernels defined for distributions, , the probability product kernel (Jebara et al., 2004), can be also used in incorporating invariance.
Remark. Robust learning with contamined samples can also be viewed as incorporating invariance prior with respect to the perturbation distribution into learning procedure. Therefore, rather than resorting to robust optimization techniques (Bhattacharyya et al., 2005; BenTal and Nemirovski, 2008), the proposed algorithm for learning with invariance serves as a viable alternative for robust learning.
4.2 Reinforcement Learning
Policy evaluation. The goal is to estimate the value function of a given policy by minimizing the meansquare Bellman error (MSBE) eq:RL_obj. With with feature map , we apply the dual kernel embedding, which will lead to the saddle point problem
(18) 
In the optimization eq:RL_dual, we simplify the dual to be function over due to the fact that is determinastic given and in our setting. If is a random variable sampled from some distribution , then the dual function should be defined over .
Remark. The algorithm can be extended to offpolicy setting. Let be the behavior policy and be the importance weight, then the objctive will be adjusted by ,
where the successor state and actions from behavior policy. With the dual kernel embedding, we can derive similar algorithm for offpolicy setting, with extra importance weight to adjust the sample distribution.
Remark. We used different RKHSs for primal and dual functions. If we use the same finite basis functions to parametrize both the value function and the dual function, , and , where , , our saddle point problem eq:RL_dual reduces to where . This is exactly the same as the objective proposed in (Sutton et al., 2009) of gradientTD2. Moreover, the update rules in gradientTD2 can also be derived by conducting the proposed EmbeddingSGD with such parametrization. For details of the derivation, please refer to Appendix D.
From this perspective, gradientTD2 is simply a special case of the proposed EmbeddingSGD applied to policy evaluation with particular parametrization. However, in the view of our framework, there is really no need to, and should not, restrict to the same finite parametric model for the value and dual functions. As further demonstrated in our experiments, with different nonparametric models, the performances can be improved significantly. See details in Section
5.2.The residual gradient (RG) (Baird, 1995) is trying to apply stochastic gradient descent directly to the MSBE with finite parametric form of value function, , , resulting the gradient as
Due to the inside conditional expectation in gradient expression, to obtain an unbiased estimator of the gradient, it requires two independent samples of
given , which is not practical. To avoid such “double sample” problem, Baird (1995) suggests to use gradient as . In fact, such algorithm is actually optimizing , which is an upper bound of MSBE eq:RL_obj because of the convexity of square loss.Our algorithm is also fundamentally different from the TD algorithm even in the finite state case. The TD algorithm updates the statevalue function directly by an estimate of the temporal difference based on one pair of samples, while our algorithm updates the statevalue function based on accumulated estimate of the temporal difference, which intuitively is more robust.
Optimal control. The goal is to estiamte the function by minimizing the error in linear Bellamn equation eq:OC_obj. With with feature map , we apply the dual kernel embedding, which will lead to the saddle point problem
(19) 
With the learned , we can recover the optimal control via its conditional distribution
Comments
There are no comments yet.