On functional logistic regression via RKHS's

In this work we address the problem of functional logistic regression, relying on the theory of RKHS's. The most common approach in the literature is to directly extend the multiple logistic model, replacing the inner product in R^d with the inner product in L^2[0,1]. In contrast, we propose to use the inner product of the RKHS associated with the process. It is a well-known fact that the Gaussian homoscedastic model for binary classification in R^d entails the logistic model. We analyze under which conditions L^2 and RKHS models hold. In this regard, the RKHS model can be seen as a generalization of the L^2 one, since it requires weaker conditions on the mean functions. In addition, this new approach is specially suitable to perform variable selection on the curves (in the sense that the finite-dimensional logistic model is a particular case). Besides, we carefully analyze whether the maximum likelihood estimators of both functional logistic models exist.

Authors

• 2 publications
• 3 publications
• 6 publications
06/26/2017

High-dimensional classification by sparse logistic regression

We consider high-dimensional binary classification by sparse logistic re...
02/05/2021

A new robust approach for multinomial logistic regression with complex design model

Robust estimators and Wald-type tests are developed for the multinomial ...
07/23/2019

Stochastic Tverberg theorems and their applications in multi-class logistic regression, data separability, and centerpoints of data

We present new stochastic geometry theorems that give bounds on the prob...
11/08/2019

Animal Movement Models with Mechanistic Selection Functions

Most statistical inference for animal trajectories has primarily arisen ...
11/04/2020

Joint Curve Registration and Classification with Two-level Functional Models

Many classification techniques when the data are curves or functions hav...
01/19/2021

Firth's logistic regression with rare events: accurate effect estimates AND predictions?

Firth-type logistic regression has become a standard approach for the an...
08/05/2014

Volumes of logistic regression models with applications to model selection

Logistic regression models with n observations and q linearly-independen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Logistic regression

Throughout this work we consider the problem of defining a suitable functional extension of the classical logistic regression model. The idea behind logistic regression already appeared at the end of nineteenth century (a complete historical overview can be found in Cramer (2003, Ch. 9)

) and became quite popular since then. For instance, this methodology is frequently used to deal with biomedical problems. In spite of the name “regression”, this technique is often used for binary classification problems. One of the main advantages of the logistic regression model in comparison with other standard classifiers is that it provides estimations of the probabilities of belonging to each class. In addition, it allows us to endow problems of cathegorical response with a linear structure on the regressors.

Hilbe (2009) is a rather complete book about logistic regression.

The logistic model is a particular case of the wider family of generalized linear models (we refer to McCullagh and Nelder (1989) for details) which presents some interesting characteristics. According to Hosmer et al. (2013, p. 52)

, one of its most appealing features is that the coefficients of the model are easily interpretable in terms of the values of the predictors. This technique stems from the attempt to apply well-known linear regression procedures to problems with categorical responses, like binary classification, or non-Gaussian distributions. There is no point in imposing that the categorical response is linear in the predictors

, but we might instead assume that is linear in (where is the probability of class 1 given

). In this quotient the logarithm could be replaced with other link functions. However, an important aspect of the logarithm-based model is that it holds whenever the predictor variable

in both classes is Gaussian with a common covariance matrix.

This finite-dimensional model has been widely studied. Apart from the already mentioned references, Efron (1975) provides a comparison between logistic predictors and Fisher discriminant analysis under Gaussianity of the predictors. In addition, Munsiwamy and Wakweya (2011) gives a quite user-friendly overview of asymptotic results of the estimators (firstly proved in Fahrmeir and Kaufmann (1985) and Fahrmeir and Kaufmann (1986)).

The motivations for extending logistic regression to functional data are quite obvious. An historical overview of several approaches to functional logistic regression can be found in Mousavi and Sørensen (2018)

. We start by establishing the framework of the problem in this functional context. The goal is to explore the relationship between a dichotomous response variable

, taking values on , and a functional predictor . We will assume throughout that (for ) are -stochastic processes with trajectories in and a common continuous covariance function

. Thus the random variable

conditional to the realizations

of the process follows a Bernoulli distribution with parameter

and the prior probability of class 1 is denoted by

. In this setting, the most common functional logistic regression (FLR) model is

 P(Y=1|X=x)=11+exp{−β0−⟨β,x⟩2}, (1)

where , and denotes the inner product in . This model is the direct extension of the -dimensional one, where the product in is replaced by its functional counterpart.

The standard approach to this problem is to reduce the dimension of the curves using PCA. That is, the curves are projected into the first eigenfunctions of the covariance operator associated with the covariance function of the process, given by

 (2)

Then standard -dimensional logistic regression is applied to the coefficients of these projections. Among others, this strategy has been explored by Escabias et al. (2004) and James (2002) from an applied perspective though, in fact, the latter reference deals with generalized linear models (and not only with logistic regression). These more general models are also studied by Müller and Stadtmüller (2005), but with a more mathematical focus.

A brief introduction to RKHS’s

Since the novel approach we will suggest for functional logistic regression is based on the theory of Reproducing Kernel Hilbert Spaces (RKHS’s), we will provide here a brief introduction to these spaces (see Berlinet and Thomas-Agnan (2004) and Appendix F of Janson (1997) for further details and references).

Given a -process, , with trajectories in , continuous covariance function and continuous mean function , we can define the auxiliary space,

 H0(K):={f∈L2[0,1] : f(⋅)=n∑i=1aiK(ti,⋅), ai∈R, ti∈[0,1], n∈N}.

That is, is the space of all finite linear combinations of evaluations of . It is easy to see that the representation in in terms of these finite linear combinations is unique whenever is strictly positive definite. This space is endowed with the inner product,

 ⟨f,g⟩K=∑i,jαiβjK(ti,sj),

where and .

Then, the RKHS associated with , , is defined as the completion of . In other words, is made of all functions obtained as pointwise limits of Cauchy sequences in (see Berlinet and Thomas-Agnan (2004, p. 16)). Let us recall one of the most interesting properties of these spaces, the so-called reproducing property:

 ⟨f,K(s,⋅)⟩K=f(s),  for all f∈H(K),s∈[0,1].

It is worth mentioning that while is, in several aspects, a natural Hilbert space associated with the process , typically the trajectories of the process themselves do not belong to with probability one (see, e.g., (Lukić and Beder, 2001, Cor. 7.1), (Pillai et al., 2007, Th. 11)). Then, one cannot directly write , for a realization of the process. In order to make sense of this expression we will use the approach suggested by Parzen (1961a).

Let us denote as

the space of random variables with finite second moment, with norm

. One can establish a natural congruence between and the linear span of the centered process, , which is the -completion of

where . The space is endowed with the same norm as , and it is clear that it is a closed subspace of . Via Loève’s isometry, we can see that is an isometric copy of . This isometry is defined as (see Lukić and Beder (2001, Lemma 1.1))

 ΨX(U)(s)=E[U(X(s)−m(s))]=⟨U,X(s)−m(s)⟩∈H(K),  for U∈L(X). (3)

From this definition, the image of a random variable in is a function in defined for the same set of points and the same coefficients . Besides, we can identify with , for and . The intuition behind the construction of this isometry is reminiscent of the definition of Itô’s isometry, which is used to define the stochastic integral with respect to the Wiener measure (Brownian motion) overcoming the fact that the Brownian trajectories are not of bounded variation.

An RKHS proposal. Main contributions

In this work we propose a novel model for functional logistic regression problems, based on ideas borrowed from the theory of RKHS’s. To be more specific, our proposal is to study the following model, instead of (1),

 P(Y=1|X=x) = 11+exp{−β0−⟨β,x⟩K}, (4)

where the inner product stands for , the inverse of Loève’s isometry defined in Equation (3). Throughout this manuscript we motivate this model and study its main properties. Similarly to the finite-dimensional case, our model holds when the conditional distributions of the process given the two possible values of are Gaussian with the same covariance structure. Another interesting property of this new model is that, when is finite dimensional (represented by a matrix) or for some particular choices of the slope function, the model amounts to a finite-dimensional logistic regression model for which the regressors are a finite number of projections of the trajectories of the process. Thus, the impact-point model studied by Lindquist and McKeague (2009) appears as as a particular case of the RKHS-based model and, more generally, model (4) can be seen as a true extension of the finite-dimensional logistic regression model, which is obtained from (4). As an important by-product, this provides a mathematical ground for variable selection in logistic regression, which is in fact the main goal of this work.

After defining the model we analyze an interesting feature of both functional logistic models (1) and (4), which does not occur in the finite-dimensional case: we will see that, in some common models for , the maximum likelihood estimator of the slope function does not exist with probability one. The family of processes for which this happens includes some interesting cases, like the Brownian motion and other related processes. To sort out this difficulty, we propose two sequential maximum likelihood approaches, based on Firth’s estimator (e.g. Firth (1993)). The first version is a greedy iterative algorithm inspired by the greedy EM approach of Verbeek et al. (2003), proposed to deal with high-dimensional parameters. The second is merely a simplification of this algorithm. However, we also prove that the dimension of these sequential approximations should be restricted to a finite value. Otherwise, the estimator might not exist asymptotically. This is not really an issue since we are mainly interested in variable selection. Then, it would be unreasonable to allow the number of selected variables to unrestrictedly increase.

In order to assess the performance of this method we compare it with some proposals already existing in the literature for binary classification problems. We use both simulated and real examples in this comparison.

Organization of the paper

In Section 2 we present the RKHS-based model and the maximum likelihood function of the slope parameter. The existence of the maximum likelihood estimator for functional logistic models is carefully analyzed in Section 3. The particular proposals implemented in practice are analyzed in Section 4. Section 5 includes the empirical results.

2 RKHS-based functional logistic model

In this section we motivate the reasons why model (4) is meaningful. In Theorem 1 we show that the standard assumption that both and are Gaussian implies (4). We also analyze under which conditions model (1) is implied and we clarify the difference between both approaches.

2.1 Conditional Gaussian distributions and functional logistic regression

In our functional setting, for , we assume that given is a Gaussian process with continuous trajectories, continuous mean function and continuous covariance function (equal for both classes). Let and be the probability measures (i.e., the distributions) induced by the process under and respectively. Recall that when and both belong to , we have that and are mutually absolutely continuous; see Theorem 5A of Parzen (1961a).

The following theorem provides a very natural motivation for the RKHS model (4) and clarifies some important properties of and in terms of the covariance operator.

Theorem 1.

Let be as in the previous lines, then

• if , then and are mutually absolutely continuous and this Gaussian setting entails model (4),

 P(Y=1|X=x) = 11+exp{−β0−⟨β,x⟩K} ≡ 11+exp{−β0−Ψ−1x(β)},

where is Loève’s isometry (Eq. (3)), and (with ).

• if , then and are mutually absolutely continuous and model (1) holds.

• if model (1) is never recovered, but different situations are possible. In particular if , recovers scenario (a), but if , and are mutually singular.

Proof.

(a) The conditional probability of the first class can be expressed in terms of the Radon-Nikodym derivative of with respect to (see Baíllo et al. (2011, Th.1)) by

 P(Y=1|X) = pdP1dP0(X)pdP1dP0(X)+(1−p) = (1+1−ppdP0dP1(X))−1. (5)

Now, let be the measure induced by a Gaussian process with covariance function but zero mean function, . According to Theorem 7A in Parzen (1961b) (or Theorem 5A of Parzen (1961a)), these Radon-Nikodym derivatives can be expressed in terms of the RKHS. That is, in this case

 dPidPG(X)=exp{⟨X,mi⟩K−12∥mi∥2K}  i=0,1,

where stands for the inverse of the Loève’s isometry

. From the last two displayed equations (and using the chain rule for Radon-Nikodym densities), one can rewrite

 P(Y=1|X)=(1+1−ppexp{⟨X,m0−m1⟩K−∥m0∥2K−∥m1∥2K2})−1.

Then, rewriting this probability we obtain the logistic model of Equation (4).

(b) In this setting, Theorem 6.1 in Rao and Varadarajan (1963) gives the following expression:

 log(dP1dP0(x))=⟨x−m0,K−1(m1−m0)⟩2−12⟨m1−m0,K−1(m1−m0)⟩2, (6)

for . Using Equation (5), it is easy to see that the functional logistic regression model holds.

(c) Also as a consequence of Theorem 6.1 in Rao and Varadarajan (1963), if it is not possible to express the Radon-Nikodym derivative in terms of inner products in or, equivalently, there is not a continuous linear functional and such that . The last sentence of the statement is a consequence of Theorem 5A of Parzen (1961a). ∎

The following comments are relevant regarding the interpretation of Theorem 1.

Part (b) of this theorem has been recently observed by Petrovich et al. (2018), see Theorem 1. Let us note, however, that is not that and must necessarily be orthogonal if . Part (c) of the above theorem clarifies this point.

On the other hand, in order to better interpret the above theorem in RKHS terms, let us recall that the space can be also defined as the image of the square root of the covariance operator defined in (2) (e.g. Definition 7.2 of Peszat and Zabczyk (2007)),

 H(K)={K1/2(f), f∈L2[0,1]},

where now the inner product is defined, for , as

 ⟨f,g⟩K=⟨K−1/2(f), K−1/2(g)⟩2.

It can be seen that this definition of is equivalent to that given in Section 1. Then, from part (c) of the theorem it follows that the RKHS functional logistic regression can be seen as a generalization of the usual functional logistic regression model, in the sense that this model is recovered when a higher degree of smoothness on the mean functions is imposed (since clearly ). Indeed, the functions in are convolutions of the functions in with the covariance function of the process. The discussion of the next section makes clear that this difference is of key importance in practice and not merely a technicality.

It is a well-known fact that in the finite dimensional case, the logistic model holds whenever are Gaussian and homoscedastic, but in fact this model is more general in the sense that it also holds for other non-Gaussian assumptions on the conditional distributions . Clearly it is also the case for our functional logistic the model in (4). Note also that, when is the finite-dimensional RKHS corresponding to a covariance matrix, (4) reduces to the usual finite-dimensional logistic model. Thus (4) looks as a quite natural extension to functional data of the classical logistic model. In fact, this connection between the functional model and the finite-dimensional one is even deeper, as we will show in the following section.

2.2 Finite RKHS model and variable selection

Dimension reduction in the functional logistic regression model may be often appropriate in terms of interpretability of the model and classification accuracy. This reduction must be done losing as little information as possible. We propose to perform variable selection on the curves. By variable selection we mean to replace each curve

by the finite-dimensional vector

, for some chosen in an optimal way. In this section we analyze under which conditions it is possible to perform functional variable selection, which is only feasible under the RKHS-model. In the following section we suggest how to do it: the idea is incorporating the points to the estimation procedure as additional parameters (in particular to the modified maximum likelihood estimator we propose).

Whenever the slope function has the form

 β(⋅)=p∑j=1βjK(tj,⋅), (7)

the model in (4) is reduced to the finite-dimensional one,

 P(Y=1|X)=(1+exp{−β0−p∑j=1βjX(tj)})−1. (8)

The main difference between the standard finite-dimensional model and this one is that now the proper choice of the points is part of the estimation procedure. In this sense, model (8) is truly functional since we will use the whole trajectories to select the points. This fact leads to a critical difference between the functional and the multivariate problems, as we will see in Section 3. Then, our aim is to approximate the general model described by Equation (4) with finite-dimensional models as those of Equation (8). This amounts to get an approximation of the slope function in terms of a finite linear combination of kernel evaluations . This model, for and a particular type of Gaussian process , is analyzed in Lindquist and McKeague (2009).

From the discussion above, it is clear that the differences between the RKHS model and the one are not minor technical questions. The functions of type belong to but do not belong to . This fact implies that within the setting of the RKHS model it is possible to regress on any finite dimensional projection of , whereas this does not make sense if we consider the model. This feature is clearly relevant if one wishes to analyze properties of variable selection methods.

2.3 Maximum Likelihood estimation

The most common way to estimate the slope function in logistic models is to use the maximum likelihood estimator (MLE). In order to apply this technique, we need to derive the likelihood function. Let assume that follows the RKHS logistic model described in Equation (4). That is,

 β0+Ψ−1X(β) ≡ β0+⟨X,β⟩K = log(pβ,β0(X)1−pβ,β0(X)),

where , and . The random element takes values in the space , which is a measurable space with measure , where is the distribution induced by the process and is the counting measure on . We can define in the measure , the joint probability induced by for a given slope function and an intercept . Then we define,

 fβ,β0(x,y) = dP(X,Y);β,β0dz(x,y) = d(P(Y|X);β,β0PX)d(μ×PX)(x,y) = = fβ,β0(y|x) = (11+e−β0−⟨β,x⟩K)y(e−β0−⟨β,x⟩K1+e−β0−⟨β,x⟩K)1−y.

Given this density function, the log-likelihood function for a given sample in , , is

 Ln(β,β0) = 1n0n0∑i=1log(e−β0−⟨β,x0i⟩K1+e−β0−⟨β,x0i⟩K)+1n1n1∑i=1log(11+e−β0−⟨β,x1i⟩K). (9)

The maximum log-likelihood estimator is the pair that maximizes this function . In order to study the asymptotic properties of this estimator, one needs to define the expected log-likelihood function,

 (10)

where denotes the expectation with respect to the measure and stands for . When one uses maximum likelihood (ML), it is standard to prove that the true parameters that define the model are a maximum of this expected likelihood function. We prove that this is also the case in this setting.

Proposition 1.

The true value of the parameter in model (4) is the unique maximizer in of the expected log-likelihood function of Eq. (10).

Proof.

If and are the true slope function and intercept of the model, one can rewrite the likelihood function, evaluated in another , as

 L(β,β0) = EX[EY|X[logf(X,Y,β,β0) | X,β,β0]] = EX[pβ∗,β∗0(X)log(pβ,β0(X))+(1−pβ∗,β∗0(X))log(1−pβ,β0(X))],

Now the fact that is a maximum of is straightforward, just following the same reasoning as for the multiple logistic regression to check that is always less of equal zero. If there is another that maximizes this function,

equals zero. Given real numbers, the function of the integrand is always less or equal zero, and the inequality is strict unless . Therefore with probability one (if not, the expectation would be positive over the set of positive measure where ). Since the logistic function is injective, . Both and are random variables with zero mean, so must coincide with . Therefore, agrees with in , since Loève’s isometry is also injective. ∎

3 On the non-existence of MLE in logistic models

In the finite-dimensional setting, it is well-known that the ML estimator does not exist when there is an hyperplane separating the observations of the two classes. This fact, which is presented in detail next, worsens dramatically for the case of functional data:

• For a wide class of process (including the Brownian motion), the MLE just does not exist, with probability one.

• Under less restrictive conditions, but still in the Gaussian case, the probability of non-existence of the MLE tends to one when the sample size tends to infinity.

3.1 A brief overview of the finite dimensional case

Despite the fact that the maximum likelihood estimation of the slope function for multiple logistic regression is widely used, it has an important issue that is sometimes overlooked. Given a sample for drawn from population zero and another sample for drawn from population one, the classical MLE in logistic regression is the vector that maximizes the log-likelihood

 Ln(b,b0)=1n0n0∑i=1log(e−b0−b′x0i1+e−b0−b′x0i)+1n1n1∑i=1log(11+e−b0−b′x1i).

The existence and uniqueness of such a maximum was carefully studied by Albert and Anderson (1984) (and previously by Silvapulle (1981) and Gourieroux and Monfort (1981)). As stated in Theorem 1 of Albert and Anderson (1984), the latter expression can be made arbitrarily close to zero (note that the log-likelihood is always negative) whenever the samples of the two populations are linearly separable. In that case the maximum can not be attained and then the MLE does not exist (the idea behind the proof is similar to the one of Theorem 2 below). There is another scenario where this estimator does not exist; the samples are linearly separable except for some points of both populations that fall into the separation hyperplane (named “quasicomplete separation”). In this case the supremum of the log-likelihood function is strictly less than zero, but it is anyway unattainable.

3.2 Non-existence of the MLE in functional settings

When moving from the finite-dimensional model to the functional one, the problem of the non-existence of the MLE is drastically worsened. We will show that, under some conditions, the maximum likelihood estimator for the slope function in the functional logistic regression model does not exist with probability one. We confine ourselves to the RKHS-based model (4), although the result can be easily extended, with a completely similar method of proof, for the standard based model of Equation (1). This result can be added to the list of conceptual differences between Functional Data Analysis and finite-dimensional statistics.

Recall that, given a sample of size , the log-likelihood function is, for ,

 Ln(β,β0) = 1nn∑i=1log(pβ,β0(xi)yi(1−pβ,β0(xi))1−yi).

One of the ways in which the linear separability condition mentioned above can be translated to functional data is presented hereunder.

Assumption 1 (Sc).

The multivariate process , satisfies the “Sign Choice” (SC) property when for all possible choice of signs , where is either or , we have that, with probability one, there exists some such that .

Now, the main result is as follows:

Theorem 2.

Let , , be a stochastic process with . Denote by the corresponding covariance function. Consider a logistic model (4) based on . Let be independent copies of . Assume that the -dimensional process fulfills the SC property. Then, with probability one, the MLE estimator of (maximizing the function in Eq. (9)) does not exist for any sample size .

Proof.

Let be a random sample drawn from . From the SC assumption there is (with probability 1) one point such that for all such that ( in total) and for those () indices with . Recall that the sample log-likelihood function given in Equation (9) is

 Ln(β,β0) =1n1∑{i:yi=1}log(11+e−β0−⟨β,xi⟩K)+1n0∑{i:yi=0}log(e−β0−⟨β,xi⟩K1+e−β0−⟨β,xi⟩K).

Note also that for all . Now, take a numerical sequence and define

 βm(⋅) = cmK(t0,⋅).

Then, since we are identifying with the inverse of Loève’s isometry, for every such that , we have

 ⟨βm,xj⟩K = cmxj(t0) → ∞,  % as m→∞,

since we have taken such that for those indices with . Likewise, goes to whenever since we have chosen such that for those indices. As a consequence as . Therefore the likelihood function can be made arbitrarily large so that the MLE does not exist. ∎

Remark 1.

A non-existence result for the MLE estimator, analogous to that of Theorem 2, can be also obtained with a very similar reasoning for the -based logistic model of Equation (1). The main difference in the proof would be the construction of which, in the case, should be obtained as an approximation to the identity (that is, a linear “quasi Dirac delta”) centered at the point .

Although the SC property could seem a somewhat restrictive assumption, the following proposition shows that it applies to some important and non-trivial situations.

Proposition 2.

(a) The -dimensional Brownian motion fulfills the SC property.

(b) The same holds for any other -dimensional process in whose independent marginals have a distribution absolutely continuous with respect to that of the Brownian motion.

Proof.

(a) Given the dimensional Brownian motion , where the are independent copies of the standard Brownian motion , , take a sequence of signs and define the event

 A={for any given t there exists 0

We may express this event by

 A=⋂t∈(0,1]∩QAt, (12)

where, for each ,

 At={there exists t0

Now, the result follows directly from Blumenthal’s 0-1 Law for n-dimensional Brownian processes (see, e.g., Mörters and Peres (2010, p. 38)). Such result establishes that for any event we have either or . Here denotes the germ -algebra of events depending only on the values of where lies in an arbitrarily small interval on the right of 0. More precisely,

 F+(0)=⋂t>0F0(t), where F0(t)=σ(Bn(s),0≤s≤t).

From (11) and (12) it is clear that the above defined event belongs to the germ -algebra . However, we cannot have since (from the symmetry of the Brownian motion) for any given the probability of is . So, we conclude as desired.

(b) If is another process whose distribution is absolutely continuous with respect to that of the n-dimensional Brownian motion , then the set , defined by (11) and (12) in terms of has also probability one when it is defined in terms of the process . Recall that, from the definition of absolute continuity, implies and therefore . ∎

Remark 2.

Following the comment in Mörters and Peres (2010) about processes with strong Markov property, this result based on RKHS theory may be extended for Lévy processes whenever the covariance function was continuous (like Poisson process in the real line). However, apart from the Brownian motion, this type of processes have discontinuous trajectories, and this situation is not considered in this work.

This property would be the functional counterpart of having a finite-dimensional problem where the supports of both classes (0 and 1) are linearly separable. However, this separability issue does not only appear in degenerate problems in the functional setting.

In practice this problem would rarely be encountered, since the curves are usually provided in a discretized fashion. Nevertheless, in the next section we suggest a couple of techniques that completely avoid the problem. From a theoretical perspective and in view of Theorem 2, it is clear that there is no hope of obtaining a general convergence result of the standard MLE defined by the maximization of function in (9). That is, one should define a different estimator or impose some conditions on the process to avoid the SC property. For instance, Lindquist and McKeague (2009) prove consistency results of the model with a single impact point for processes , where is a two-sided Brownian motion centered in (i.e. two independent Brownian motions starting at and running in opposite directions) and is a real random variable independent of . Then, due to the independence assumption, it is clear that accumulation points (like 0 for the Brownian motion) are avoided.

3.3 Asymptotic non-existence for Gaussian processes

In the previous section we have seen that the problem of non-existence of the MLE is dramatically aggravated for functional data, where the regressors are not fixed in advance. But this is not the only issue with MLE in functional logistic regression. In this section we see that the probability that the MLE does not exist goes to one as the sample size increases, for any Gaussian process satisfying very mild assumptions.

We use the following notation: for and , let and let be the matrix whose entry is given by .

Theorem 3.

Let be a random sample of independent observations satisfying model (4). Assume that is a Gaussian process such that is continuous and is invertible for any finite set . It holds

 limn→∞P(MLE exists)=0.
Proof.

Let be the true values of the parameters. Since , we have , where is the function defined in Equation (6) of Candès and Sur (2018) (see Remark 3 below). Let be an increasing sequence of natural numbers such that . Consider the set of equispaced points and denote . Define . Now, consider the following sequence of finite-dimensional logistic regression models

 P(Y=1|X)=11+exp{−β∗0−α′TnX(Tn)},

and the following sequence of events

 En={There exists α∈Rpn:α′xi(Tn)≥0, if yi=1;  α′xi(Tn)≤0, if yi=0}.

Recall that the event amounts to non-existence of MLE for finite-dimensional logistic regression models (see Albert and Anderson (1984)).

Now let us prove the validity of condition (3) in Candès and Sur (2018), which is required for the validity of Theorem 1 in that paper. In our case, such condition amounts to

 limn→∞Var(α′TnX(Tn))=limn→∞α′TnΣTnαTn=∥β∗∥2K,

but this directly follows from Theorem 6E of Parzen (1959). Since we apply Theorem 1 in Candès and Sur (2018) to deduce equal to one.

Now we define the auxiliary sequence of events

 ˜En={There exists α∈Rpn:α′xi(Tn)>0, if yi=1;  α′xi(Tn)<0, if yi=0},

with strict inequalities. Assume that happens so that there exists a separating hyperplane defined by . Then, in the same spirit as in the proof of Theorem 2, it is possible to show that if , then , where is the log-likelihood function. As a consequence, for all , if happens, then the MLE for the RKHS functional logistic regression model does not exist. The result follows from the fact that and the events have probability zero. Because we are assuming that the process does not have degenerate marginals. ∎

Remark 3.

Theorem 1 in Candès and Sur (2018) is a remarkable result. It applies to logistic finite-dimensional regression models with a number of covariables, which is assumed to grow to infinity with the sample size , in such a way that . Of course, the sample is given by data , . Essentially the result establishes that there is a critical value such that, if is smaller than such critical value, one has ; otherwise we have . Such critical value is given in terms of a function (which is mentioned in the proof of the previous result) whose definition is as follows. Let us use the notation whenever , for (note that, in the notation of Candès and Sur (2018), the model is defined for the case that the response variable is coded in ), , and where and . Now, define , where independent of and . Then, Theorem 1 in Candès and Sur (2018) proves that the above mentioned critical value for is precisely .

4 The estimation of β in practice

The problem of non-existence of the MLE can be circumvented if the goal is variable selection. The main idea behind the proof of Theorem 3 is that one can approximate the functional model with finite approximations as those in (8) with increasing as fast as desired. Therefore, if we constrain to be less than a finite fixed value, Theorem 3 does not apply.

In order to sort out the non-existence problem for a given sample (due to the SC property), it would be enough to use a finite-dimensional estimator that is always defined, even for linearly separable samples. As mentioned, an extensive study of existence and uniqueness conditions of the MLE for multiple logistic regression can be found in the paper of Albert and Anderson (1984). We suggest to use Firth’s estimator, firstly proposed by Firth (1993), which is always finite and unique.

The author initially proposed this approach to reduce the bias of the standard MLE, but afterwards it was used to obtain ML estimators for linearly separable samples (see e.g. Heinze and Schemper (2002), where the technique is presented in a quite accessible manner). Besides, the reduction of the bias leads to better results in practice. The general idea of Firth’s procedure is to use a modification of the original sample responses to compute the usual ML-score equations. More precisely, each response is split into two new responses and , where the coefficients go to zero with the sample size. The idea behind this duplication is to avoid the problem of the linear separability of the sample (previously discussed), which leads to the non-existence of the MLE. It is easy to see that in the new modified sample with duplicated observations there is always overlapping between the two classes, so ML can be safely applied. The specific coefficients are the diagonal elements of the matrix , being the diagonal matrix with elements and the matrix whose rows are . This estimator is implemented in the “brglm” function of the brglm R-package (see Kosmidis (2017)).

With this procedure we obtain estimators . An interesting observation is that the value of the independent term can be used to estimate the Bayes error of any homoscedastic Gaussian problem with equiprobable classes (), which is given by , where

the cumulative distribution function of a standard Gaussian random variable ( see Theorem 2 of

Berrendero et al. (2017)).

4.1 Greedy “max-max” algorithm

In view of the previous discussion, the objective is to find the points and the coefficients , for a fixed , that maximize the log-likelihood associated with model of Equation (8).

We propose an iterative algorithm in which the points and the coefficients are updated alternatively. This approach is reminiscent of the well-known Expectation-Maximization (EM) technique, which is typically applied to estimate mixtures in supervised classification (see e.g.

Bishop (2016, Chapter 9)). In general, the EM algorithm is used to compute ML estimators for models with non-observable data. In our setting, these non-observable parameters would correspond to the points . Estimating the ’s given a set of ’s is straightforward (via Firth’s approach), and once the parameters are known, the points can be obtained maximizing the log-likelihood over . The algorithm is as follows: first the coefficients are randomly initialized, and then we iterate the following steps until convergence.

• (Maximization 1) Compute the set of points that maximizes the log-likelihood function for the current set of coefficients .

• (Maximization 2) Then, use the just obtained points to compute the MLE of the model (via Firth’s estimator).

One might also start initializing the points , and then go to the second step. Besides, in practice the maximization over the continuous set is unfeasible, so it should be made using some grid. This is not usually an issue, since the sample curves are typically provided in a discretized fashion. Then one could directly search the points in the grid provided by the sample.

However, the proposed algorithm only ensures the convergence to a local maximum, and this maximum strongly depends on the initial (random) points of the algorithm. Besides, the number of local maxima of the likelihood function likely increases with the dimension of the search space. That is, the accuracy of the results might deteriorate when increases. A possible solution is to replicate the execution several times for the same sample and keep the parameters that give a maximum value of the likelihood. However, this could be computationally expensive depending on the dimension. Then, we suggest to adapt the greedy EM methodology proposed in Verbeek et al. (2003) for Gaussian mixtures. The idea is to start with a model of dimension one ( in Equation (8)) to estimate . Then, once this first point is fixed, add a second random point and start again the maximization iterations, now with , using as starting points . The algorithm continues adding points until it reaches the desired dimension . This approach should work better than simply initializing all the points at random, since only one random point is added in each step. In addition, for small values of it is more likely to obtain meaningful points, since the likelihood function should have less local maxima.

In practical problems, it is also important to determine how many points one should retain. The common approach is to fix this value by cross-validation, whenever it is possible. Another reasonable approach is to stop when the difference between the likelihoods obtained with and points is less than a threshold , in a similar way as in Berrendero et al. (2018a) for the number of selected variables in a linear regression model.

Once that the parameters are estimated for a given sample, one can also approximate the function in as