    # Distribution regression model with a Reproducing Kernel Hilbert Space approach

In this paper, we introduce a new distribution regression model for probability distributions. This model is based on a Reproducing Kernel Hilbert Space (RKHS) regression framework, where universal kernels are built using Wasserstein distances for distributions belonging to W 2 (Ω) and Ω is a compact subspace of R. We prove the universal kernel property of such kernels and use this setting to perform regressions on functions. Different regression models are first compared with the proposed one on simulated functional data. We then apply our regression model to transient evoked otoascoutic emission (TEOAE) distribution responses and real predictors of the age.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression analysis is a predictive modeling technique that has been widely studied over the last decades with the goal to investigate relationships between predictors and responses (inputs and outputs) in regression models, see for instance [1, 2] and references therein. When the inputs belong to functional spaces, different strategies have been investigated and used in several application domains about functional data analysis [3, 4]

. Extensions of the Reproducing Kernel Hilbert Space (RKHS) framework became recently popular to extend the results of the statistical learning theory in the context of regression of functional data as well as to develop estimation procedures of functional valued functions

[5, 6]. This framework is particularly important in the field of statistical learning theory because of the so-called Representer theorem, which states that every function can be written as a linear combination of the kernel function evaluated at training points .

In our framework, we aim to solve the regression problem with inputs belonging to probability distribution spaces, whose responses are probability distributions and whose predictors are real values. Specially, we consider the model

 yi=f(μi)+ϵi, (1)

where are probability distributions on , are real numbers and the represent an independent and identically distributed Gaussian noise. As in classical regression models, this setting estimates an unknown function from the observations .

The framework of 

recently became popular to embed probability distributions into RKHS. It solves the learning problem of distribution regression in a two-stage sampled setting and use the analytical solution of a kernel ridge regression problem to regress from probability measures to real-valued observations. Specifically, the authors embed a distribution to an RKHS

induced by a kernel which is defined on set of distribution inputs. The regression function is composed of an unknown function and an element of , where is the RKHS induced by kernel defined on the set of mean embeddings of distributions to RKHS . Whereas the relation between the random distribution and the real number response can be learnt by using directly Representer theorem for the regularized empirical risk over RKHS .

In what follows, we will consider kernels built using the Wasserstein distance. Details on Wasserstein distances and their links with optimal transport problems can be found in . Some kernels with this metric have been developed in [10, 11]. We focus here in the work in , in which the authors built a family of positive definite kernel. Within the setting of this paper, we will construct a RKHS corresponding to this kind of kernels to apply the theory of RKHS. More specifically, for an input belonging to Wasserstein spaces, the authors of  built a class of positive definite kernels that are functions of Wasserstein distances. More interestingly, the framework of , the authors provided a kind of universal kernel tittled the Gaussian-type RBF-kernel. This result is really useful for this paper because from [14, 15] we can build easily a RKHS from a universal kernel. Hence by using the good universal properties, that will be mentioned in Section 3, we will define a new method to construct the RKHS from a universal kernel which is supported by the family of positive definite kernels. Then, we will get a particular estimation from the Representer theorem for an unknown function in the regression model with distribution inputs.

The paper is structured as follows: In Section 2, we first recall important concepts about kernels on Wasserstein spaces . We then give a brief introduction to Wasserstein spaces on and explain how positive definite kernel done are constructed in . Section 3 deals with the proposed setting of distribution regression models. We motivate there the use of universal kernels and build an estimation function for the learning problem. We then assess the numerical performance of this method in Section 4. The tests are first performed on simulated generated data to compare our model with state-of-the-art ones. Then, we study the relationship between the age and hearing sensitivity by using TEOAEs recording that are acquired by stimulating with a very short but strong broadband stimulus. These recordings are then the ear responses by emitting a sound track on a given frequency. More precisely, we predict the age of the subjects, on which they were acquired using the proposed distribution regression model, from TEOAE data. Discussions are finally drawn Section 5.

## 2 Kernel on Wasserstein space W2(R)

### 2.1 The Wasserstein space on R

Let us consider the set of probability measures on

with a finite moment of order two. For two

probability distributions in , we denote the set of all probability measures over the product set with first (resp. second) marginal (resp. ).
The transportation cost with quadratic cost function, which we denote quadratic transportation cost, between two measures and is defined as:

 T2(μ,ν)=infπ∈Π(μ,ν)∫|x−y|2dπ(x,y). (2)

This transportation cost allows to endow the set with a metric by defining the quadratic Monge-Kantorovich (or quadratic Wasserstein) distance between and as

 W2(μ,ν)=T2(μ,ν)1/2.

A probability measure in performing the infimum in (2

) is called an optimal coupling. This vocabulary transfers to a random vector

with distribution . We will call endowed with the distance the Wasserstein space. More details on Wasserstein spaces and their links with optimal transport problems can be found in .

For distributions in , the Wasserstein distance can be written in a simpler way as follows: For any , we denote by

the quantile function associated to

. Given a uniform random variable

on , is the random variable with law . Then, for every and the random vector is the optimal coupling (see ), where is defined as

 F−1μ(t)=inf{u,Fμ(u)≥t}. (3)

In this case, the simplest expression for the Wasserstein distance is given in :

 W2(μ,ν)=E(F−1μ(U)−F−1ν(U))2. (4)

Topological properties of Wasserstein spaces are reviewed in . Hereafter, compacity will be required and will be obtained as follows: let be a compact subset, then the Wasserstein space is also compact. In this paper, we consider Wasserstein spaces , where is a compact subset on endowed with the Wasserstein distance . Hence for any , we denote with the distribution function restricted on a compact subset . We also define as:

 F−1μ|Ω(t)=inf{u∈Ω,Fμ|Ω(u)≥t},∀t∈[a,b]. (5)

Given a uniform random variable on , is a random variable with law . By inheriting properties from for every and in , the random vector is an optimal coupling. In this case, we consider in this paper the simplest expression for the Wasserstein distance between and in :

 W2(μ,ν)=E(F−1μ|Ω(V)−F−1ν|Ω(V))2. (6)

### 2.2 Kernel

Constructing a positive definite kernel defined on the Wasserstein space is not obvious and was recently done in . For sake of completeness, we recall here briefly this construction.

###### Theorem 2.1.

Let with the parameter such that and defined as

 kΘ(μ,ν):=γ2exp(−W2H2(μ,ν)l). (7)

Then for , is a positive definite kernel.

The proof of this Theorem directly follows Theorem 2.2 and Propositions 2.3. In this paper we use Theorem 2.1 to study the properties of such kernel in the RKHS regression framework.

The following theorem that can be found in  or referred to Theorem III.1 in , also provides a generic way to construct kernel using completely monotone functions.

###### Theorem 2.2.

(Schoenberg) Let be a completely monotone function, and a negative definite kernel. Then is a positive definite kernel.

The following proposition which can be found in , finally gives conditions on the exponent to achieve negative definite kernel using exponents of the Wasserstein distance.

###### Proposition 2.3.

The function is a negative definite kernel if and only if .

###### Proof.

The proof of Theorem 2.1 follows immediately below from Theorem 2.2 and Proposition 2.3. Applying Proposition 2.3, we deduce that with is a negative definite kernel for all in .
We can easily see that with positive is a completely monotone function. Let us then consider a mapping as follows:

 F:R+ →R+ x ↦γ2e−λx,

where with . Then is also a completely monotone function. From Theorem 2.2, is a positive definite kernel. ∎

## 3 Regression

### 3.1 Setting

In this section, we aim to define a regression function with distribution inputs. The problem of distribution regression consists in estimating an unknown function by using observations in for all . We recall observes in (1) as follows

 yi=f(μi)+ϵi. (8)

To provide a general form for functions defined on distributions, we will use the RKHS framework. Let be defined in Theorem 2.1. For a fixed valid , we define a space as follows:

 F0:=span{kΘ(∙,μ):μ∈W2(Ω)}.

And is endowed with the inner product

 ⟨fn,gm⟩F0=n∑i=1m∑j=1αiβjkΘ(μi,νj),

where and . The norm in is corresponding to the inner product,

 ∥fn∥2F0=n∑i=1α2ikΘ(μi,μi). (9)

Let be the space of all continuous real-valued functions from to . The set consists of all functions in which are uniform limits of functions of form . We want to approximate as well as possible . Following universal approximating property that is a universal kernel has a property that . Hence we will consider in the following section that a universal kernel of to prove that is dense in and that .
From that for all belong in , the inner product is well defined as following formula

 ⟨f,g⟩F:=limn→∞⟨fn,gn⟩F0. (10)

Coming back to our problem, we want to estimate the unknown function by an estimation function obtained by minimizing the regularized empirical risk over the RKHS . For this consider, we solve the solution of the minimization problem

 ^f=argminf∈F(n∑i=1|yi−f(μi)|2+λ∥f∥2F), (11)

where is the regularization parameter. Using the Representer theorem, this leads to the following expression for ,

 ^f:μ↦^f(μ):=n∑j=1^αjkΘ(μ,μj), (12)

where are parameters typically obtained from training data.

### 3.2 Universal kernel

First, we recall the definition of a universal kernel and the main theorem to ensure universal properties of positive definite kernels in Theorem 2.1.

###### Definition 3.1.

Let be the space of continuous bounded functions on compact domain . A continuous kernel on domain is called universal if the space of all functions induced by is dense in , i.e, for all and every there exists a function induced by with .

For more information on universal kernel and RKHS, we refer to Chapter 4 in  and , .

###### Theorem 3.2.

Let choose the parameter in (7) such that and . The kernel defined in (7) is universal.

The proof of this Theorem relies on the two following Proposition 3.3 and Proposition 3.4.

###### Proposition 3.3.

Let with be the distribution function restricted on a compact subset of , be defined by for all . Then is continuous if and only if is strictly increasing on . is strictly increasing if and only if is continuous on , where , the range of .

See e.g  for a proof of Proposition 3.3.

###### Proposition 3.4.

Let be a compact metric space and be a separable Hilbert space such that there exist a continuous and injective map: . For , the Gaussian-type RBF-kernel is the universal kernel, where

 kσ(x,x′):=exp(−σ2∥ρ(x)−ρ(x′)∥2H),x,x′∈X.

See a part of Theorem 2.2 in  for proof of Proposition 3.4.

###### Proof.

Proof of Theorem 3.2.
From Proposition 3.3 with the conditions including the distribution restricted on , be continuous and be strictly increasing on , then there exists a continuous and injective map

 ρ:W2(Ω) →L2[a,b] μ ↦ρ(μ):=F−1μ|Ω.

is continuous on and strictly increasing on .
We consider a Wasserstein space metrized by the Wasserstein distance with be a compact subset on and be the usual space of square integrable functions on . For in Proposition 3.4 is exactly defined by for all . We have

 kΘ(μ,ν)=γ2exp⎧⎨⎩−∥F−1μ|Ω−F−1ν|Ω∥2L2[a,b]l⎫⎬⎭

is the universal kernel from Proposition 3.4. We complete the proof of Theorem 3.2. ∎

The minimization program in (11) can be solved explicitely using the representer theorem of . Note that Schölkopf and Smola  give a simple proof of a more general version of the theorem. Define as follows

 cij=γ2exp(−W22(μi,μj)l)

and .
Now we take the matrix formulation of (11) we obtain

 minα trace((Y−Cα)(Y−Cα)T)+λtrace(CααT), (13)

where the operation trace is defined as

 trace(A)=n∑i=1aii

with .
Taking the derivative of (13) with respect to vector , we find that satisfies the system of linear equations

 (C+λI)α=Y. (14)

Hence

 ^f(μ)=n∑j=1^αjkΘ(μ,μj), (15)

with

 ^α=(C+λI)−1Y. (16)

## 4 Numerical Simulations and Real data application

### 4.1 Simulation

#### 4.1.1 Overview of the simulation procedure

In this section, we investigate the regression model for predicting the regression function from distributions. Particularly, we want to estimate the unknown function in model (8) by using the proposed estimation in (15), so we need to present how we can optimize the parameters in this formula. We then compare the regression model based on RKHS induced by our universal kernel function to more classical kernel functions operating on projections of the probability measures on finite dimensional spaces. We address the input-output map given by

 f(ν)=mν0.05+σν, (17)

where

is a Gaussian distribution of mean

and variance

. We consider the ground truth function that we compare with a predicted function , such as:

 ^f(ν)=γ2n∑j=1^αjexp[−W22(ν,μj)l], (18)

where the Wasserstein distance between two Gaussian distribution is calculated using:

 W22(μ,ν) =(mμ−mν)2+(σμ−σν)2,

where and .
Each value is estimated using Eq. (16) which depends on parameter . Thus our proposed estimation function depends totally on the three parameters and . To understand the effects of these parameters on , we define reference values of and

. We then generate a training set including the normal distributions

such that , with be a size of training set. In this simulation, we take .
From the training set , we fit two regression models which we call ”Wasserstein” and ”Legendre”, for which we provide more details below. Then we evaluate the quality of the two regression models on a test set of size of the form , where is generated in the same way as above. We consider the following quality criteria, that is the root mean square error (RMSE) to see the qualify of our regression model

 RMSE2(^f,f)=1ntnt∑i=1[f(νt,i)−^f(νt,i)]2.

#### 4.1.2 Detail on the regression models

We refer our model by Wasserstein and introduce briefly Legendre regression models. Wasserstein model first propose the estimated function as follows

 ^f(νt,i)=γ2n∑j=1^αjexp[−(mνj−mνt,i)2+(σνj−σνt,i)2l], (19)

where belong to testing set with size , and belong to training set with size . The estimated function depends on three parameters and .

The Legendre model is based on kernel functions operating on finite dimensional linear projections of the distributions. For a Gaussian distribution with density and support , we compute for :

 ai(μ)=∫101√2πσexp(−(t−m)22σ2)pi(t)dt,

where is the -th normalized Legendre polynomial, with . The integer is called the order of the decomposition. Then operators on the vector and is of the form

 kL(ν1,ν2)=γ2exp[−θ−1∑i=0|ai(ν1)−ai(ν2)|li]. (20)

Thus the estimated regression function in this case is calculated by following function

 ^f(νt,i)=γ2exp[−θ−1∑i=0|ai(νt,i)−ai(νj)|li].

We just consider the orders of the decomposition and . We fix for all , this estimated function depends also on three parameters and .

#### 4.1.3 Result

In simulation, we will see the effects of parameters and on RMSE between predicted function and exact function through the testing set . We also take two sizes of testing set to see the changes of RMSE. We just show the detailed presentation about choosing the optimal parameters on the ”Wasserstein” model.

##### Case of testing set size nt=500

: Now we consider RMSE in the case of under the different fixed values and running separated by 30 values from to , separated by 25 values from to . Let us see here the values of RMSE with the different cases of in following Figure 1, 2, 3. Figure 1: In the case of nt=500, fixing a value γ=1/2, we run λ>0 separated by 30 values from 0.005 to 30, l>0 separated by 25 values from 0.005 to 20. We see the two graphs, one follows values of λ in the left side and l in the right side. RMSE will be minimized, in which it is lower than 0.08, with 0<λ<15 and l big enough. We note that when 00, so we avoid to chose these values of l. Figure 2: In the case of nt=500, fixing a value γ=1, we run λ>0 separated by 30 values from 0.005 to 30, l>0 separated by 25 values from 0.005 to 20. We see the two graphs, one follows values of λ in the left side and l in the right side. The variations of RMSE in this case is not change significantly with the case of γ=1/2, however, it looks smaller than the case γ=1/2. We also see that RMSE will be minimized by two case: first 0<λ<1 and l big enough; second λ>1 for all l>1 and RMSE is quite big at 00. Figure 3: In the case of nt=500, fixing a value γ=10, we run λ>0 separated by 30 values from 0.005 to 30, l>0 separated by 25 values from 0.005 to 20. RMSE in this case looks bigger than two above cases of γ=1/2 and γ=1.

We realize through three choices of that the values give the same impact of RMSE variations, but the smallest RMSE in the case . In following this stimulation, we fix the value of and run the values of and to see the changes of RMSE in the case of bigger size of testing set.

##### Case of testing set size nt=700

: Now we consider RMSE in the case of in Figure 4 for a fixed value and running separated by 30 values from to , separated by 25 values from to . We want to see the affects of testing set size on RMSE. Then we consider directly about the estimated regression function effects under parameters and . As far as we known, there exists oversmoothing and undersmoothing issues which happen sometimes in the learning problem when the error component is small, but the estimated function is oversmooth or undesmooth. See the Figure 5, to more clearly about our regression model with the exact function defined in (17).

And finally, we consider the different RMSE’s between ”Wasserstein” and ”Legendre” model by choosing the values , and under considering . In Table 1, we show the values of RMSE quality criteria for the ”Wasserstein” and ”Legendre” distribution regression models. From the values of the RMSE criterion, the ”Wasserstein” model clearly outperforms the other models. The RMSE of the ”Legendre” models slightly decreases when the order increases, and stay well above the RMSE of the ”Wasserstein” model. Figure 4: In the case of nt=700, fixing a value γ=1, we run λ>0 separated by 30 values from 0.005 to 30, l>0 separated by 25 values from 0.005 to 20. RMSE is almost lower than 0.06 when λ>1 and l>1. However for 0<λ<1, we can also obtain the small RMSE when l big enough. This figure provides a view about size of testing set, in which for the big enough of testing set size we will obtain the smaller RMSE under of the optimal parameters γ, λ and l. Figure 5: Regression function under exact and estimated function. The green line presents an exact function, which is many more variations, we desire find the optimal parameters to obtain a more smooth curve. From Figure 4, we can chose these parameters following RMSE, however, in some cases it happens over-smoothing and under-smoothing. See in this Figure, the blue line looks like have some desirable properties when we chose the big enough values of λ and l.

Hence from the Figure 5 and Table 1, we can see that by choosing the optimal parameters for and we can obtain a very well estimation function without the under-smoothing and over-smoothing issues in the learning problem. Our regression model stay well above the RMSE criterion.

Our interpretation for these results is that, because of the nature of the simulated data working directly on distributions and with the Wasserstein distance, is more appropriate than using linear projections. Indeed, in particular, two distributions with similar means and small variances are close to each other with respect to both Wasserstein distance and the value of the output function

. However,the probability density functions of the two distributions are very different from each other with respect to the

distance in the case that the ratio between the two variances is large. Hence linear projections based on probability density functions is inappropriate in the setting considered here.

### 4.2 Application on evolution of hearing sensitivity

An otoacoustic emission (OAE) is a sound which is generated from within the inner ear. OAEs can be measured with a sensitive microphone in the ear canal and provide a noninvasive measure of cochlear amplification (see Chapter: Hearing basics in ). Recording of OAEs has become the main method for newborn and infant hearing screening (see Chapter: Early Diagnosis and Prevention of Hearing Loss in ). There are two types of OAEs: spontaneous otoacoustic emissions (SOAEs), which can occur without external stimulation, and evoked otoacoustic emissions (EOAEs), which require an evoking stimulus. In this paper, we consider a type of EOAEs that is Transient-EOAE (TEOAE) (see for instance in ), in which the evoked response from a click covers the frequency range up to around 4kHz. More precisely, each TEOAE models the ability of the cochlea to response to some frequencies in order to transform a sound into an information that will be processed by the brain. So to each observation is associated a curve (the Oto-Emission curve) which describes the response of the cochlea at several frequencies to a sound. The level of response depends on each individual and each stimulus should be normalized, but the way each individual reacts is characteristic of its physiological characteristic. Hence to each individual is associated a curve, which after normalization, it is considered as a distribution describing the repartition of the responses for different frequencies ranging from 0 to 10 kHz. These distributions are shown in Figure 6 and Table 2. Figure 6: Oto-emission curves. 48 TEOAE curves following to frequencies ranging from 0Hz to 10kHz.

The relationship between age and hearing sensitivity is investigated in [23, 24] The results show that when age increases, the presence of EOAEs by age group and the frequency peak in spectral analysis decreases and EOAE threshold increases. The differences in EOAE have been also reported between age classes in humans. These results convey the idea that the response evolves with age and that the effect of ages in hearing issues is deeply related to the changes of the cochlear properties. Hence our model uses as input these distributions and try to build a regression model to link between the age and these distributions representing the response of the cochlea at frequencies ranging from 0Hz to 10kHz. More precisely, we estimate the age for each level of response normalized and treated as a distribution by using our proposed function as follows

 ^f(μi)=γ2n∑j=1^αjexp⎛⎝−∫10(F−1μi(t)−F−1μj(t))2dtl⎞⎠, (21)

where defined as (5) and the value of is chosen by optimal parameter in (16). We estimate the integral in (21) by following formula

 ∫10(F−1μi(t)−F−1μj(t))2dt=M∑m=1[F−1μi(mM)−F−1μj(mM)]2, (22)

where we can understand each is an experimental distribution function of and is the number of discretized frequencies. As far as we know, each individual is associated with a curve, which after normalization without lost relationship among original data, it is considered as a distribution . To calculate , we arrange each curve in ascending order, for instance we denote following to distribution and , so . Hence, we write again the formula (22)

 ∫10(F−1μi(t)−F−1μj(t))2dt=M∑m=1(Xμi(m)−Xμj(m))2, (23)

where is a curve following to distribution .
In our simulation, we choose , the value of and the value of . We aim to study the age in relation with its TEOAE curve of 48 subjects, recorded on human population in South Africa, with the range of frequency from 0Hz to 10kHz. See the Figure 7 to show the differences between the age of 15 to 50 years old. Following the estimated function in (21), we take 47 distributions for training set to calculate estimation value of and try to estimate real age of a remaining individual with . And the results are showed clearly in the Figure 8 and Figure 9 about the exact age and predicted age. Figure 7: Histogram of real age in a human population. The age distribute diversity from 15 to 30, however, there is a few of individual of age from 35 to 40 and 45 to 50. And there exists no individual have age from 40 to 45. Figure 8: Histogram of difference between real and predicted age for OAE. In the first column, in which the difference between real and estimate age is very small closing to zero, this means more accuracy between real and predicted age. Almost the ages from 20 to 35 lie in this column. Figure 9: Real and Predicted Age. By using the optimized parameters of γ=1, l=10 and λ>0 depending on age class, we obtain almost the exacted ages belonging to the age class [20,30] with λ around 15. For instance, we can predict very well the exact ages 20, 21, 23, 24, 25, 27, 29 corresponding to the predicted ages 19.50, 21.02, 22.83, 23.87, 24.89, 27.15, 28.76.

Hence in figure 7 and Figure 9, we applied effectively our proposed estimation function in predicting age from its TEOAE data. By choosing the optimal parameters and we could predict very well the exact ages belonging to the age class and negligible errors in other age classes. This is quite reasonable when seeing in the Figure 7 that the age distributed diversity almost from 20 to 30 years old, so our proposed estimation function learnt very well to predict age in this age class. Thus by using the distribution regression model, we investigated the relationship between the evoked responses from clicks covering the frequencies range up to 10kHz and its evolutionary ages.

## 5 Discussion

In this paper, we have introduced a new estimated function for regression model with distribution inputs. More precisely, we effectively used class of positive definite kernel produced by Wasserstein distance, built in 

by proving that it is a kind of universal kernel. Researching the universal kernel theories, we detected a very good property of our universal kernel to build a RKHS. Then we obtained a particular estimation from Representer theorem for our distribution regression problem, these works showed that the relation between the random distribution and the real number response can be learnt by using directly the regularized empirical risk over RKHS. Our proposed estimation is clearly better than state-of-the-art-ones in simulated data. More interestingly, we researched successfully TEOAE curve of each individual in human population as a distribution which after normalization. We then investigated the relationship between age and its TEOAE that the response involves with age and the effect of age in hearing issues is deeply related to the change of cochlear. This is a new interesting approach in the field of Biostatistics, in which we indicated the evolution of hearing capacity under statistical domain - distribution regression model. We believe that our paper tackles an important issue for data science experts willing to predict problems in regression with probability distributions as input. The extension of this work on distributions for general dimensions should be addressed in a further work, using for instance as a kernel the one buit in

.

## References

•  J.-M. Azaïs, “Le modèle linéaire par l’exemple,” 2006.
•  M. H. Kutner, C. Nachtsheim, and J. Neter,

Applied linear regression models

.
McGraw-Hill/Irwin, 2004.
•  J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied linear statistical models, vol. 4. Irwin Chicago, 1996.
•  J. O. Ramsay and B. W. Silverman, Applied functional data analysis: methods and case studies. Springer, 2007.
•  C. Preda, “Regression models for functional data by reproducing kernel hilbert spaces methods,” Journal of statistical planning and inference, vol. 137, no. 3, pp. 829–840, 2007.
•  H. Kadri, E. Duflos, P. Preux, S. Canu, and M. Davy, “Nonlinear functional regression: a functional rkhs approach,” in

Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS’10)

, vol. 9, pp. 374–380, 2010.
•  A. Berlinet and C. Thomas-Agnan, Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
•  A. Smola, A. Gretton, L. Song, and B. Schölkopf, “A hilbert space embedding for distributions,” in International Conference on Algorithmic Learning Theory, pp. 13–31, Springer, 2007.
•  C. Villani, Optimal transport: old and new, vol. 338. Springer Science & Business Media, 2008.
•  S. Kolouri, Y. Zou, and G. K. Rohde, “Sliced Wasserstein kernels for probability distributions,” CoRR, vol. abs/1511.03198, 2015.
•  G. Peyré, M. Cuturi, and J. Solomon, “Gromov-Wasserstein averaging of kernel and distance matrices,” in ICML 2016, 2016.
•  F. Bachoc, F. Gamboa, J.-M. Loubes, and N. Venet, “A gaussian process regression model for distribution inputs,” IEEE Transactions on Information Theory, 2017.
•  A. Christmann and I. Steinwart, “Universal kernels on non-standard input spaces,” in Advances in neural information processing systems, pp. 406–414, 2010.
•  B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet, “Universality, characteristic kernels and rkhs embedding of measures,”

Journal of Machine Learning Research

, vol. 12, no. Jul, pp. 2389–2410, 2011.
•  C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” Journal of Machine Learning Research, vol. 7, no. Dec, pp. 2651–2667, 2006.
•  W. Whitt, “Bivariate distributions with given marginals,” The Annals of statistics, pp. 1280–1289, 1976.
•  M. G. Cowling, “Harmonic analysis on semigroups,” Annals of Mathematics, pp. 267–283, 1983.
•  P. Embrechts and M. Hofert, “A note on generalized inverses,” Mathematical Methods of Operations Research, vol. 77, no. 3, pp. 423–432, 2013.
•  G. Kimeldorf and G. Wahba, “Some results on tchebycheffian spline functions,” Journal of mathematical analysis and applications, vol. 33, no. 1, pp. 82–95, 1971.
•  A. J. Smola and B. Schölkopf, Learning with kernels, vol. 4. Citeseer, 1998.
•  J. J. Eggermont, Hearing Loss: Causes, Prevention, and Treatment. Academic Press, 2017.
•  P. X. Joris, C. Bergevin, R. Kalluri, M. Mc Laughlin, P. Michelet, M. van der Heijden, and C. A. Shera, “Frequency selectivity in old-world monkeys corroborates sharp cochlear tuning in humans,” Proceedings of the National Academy of Sciences, vol. 108, no. 42, pp. 17516–17520, 2011.
•  T. O-Uchi, J. Kanzaki, Y. Satoh, S. Yoshihara, A. Ogata, Y. Inoue, and H. Mashino, “Age-related changes in evoked otoacoustic emission in normal-hearing ears,” Acta Oto-Laryngologica, vol. 114, no. sup514, pp. 89–94, 1994.
•  L. Collet, A. Moulin, M. Gartner, and A. Morgon, “Age-related changes in evoked otoacoustic emissions,” Annals of Otology, Rhinology & Laryngology, vol. 99, no. 12, pp. 993–997, 1990.
•  F. Bachoc, A. Suvorikova, J.-M. Loubes, and V. Spokoiny, “Gaussian process forecast with multidimensional distributional entries,” arXiv preprint arXiv:1805.00753, 2018.