Let be independently and identically distributed observable data on . The conditional distribution of given is assumed to be
is a one-dimensional cumulative distribution function. The inverse functionis the link function in terms of generalized linear models. Denote the marginal distribution of by . The distribution function is typically the logistic, standard normal or Gumbel distributions. The corresponding link functions are the logit, probit and complementary log-log functions, respectively. For the three examples, the log-likelihood function of (1) is concave; see Wedderburn (1976).
Our interest is the situation that the data is highly imbalanced. In other words, the probability of success is almost zero. Examples of such cases are fraud detection, medical diagnosis, political analysis and so forth. See e.g.Bolton & Hand (2002), Chawla et al. (2004), Jin et al. (2005), and King & Zeng (2001). For the data without covariates, Poisson’s law of rare events is well known: if
, then the probability distribution of
converges to the Poisson distribution with the mean parameter. From this observation, for highly imbalanced data, it is natural to consider that the true parameter in (1) depends on , say , and as .
Owen (2007) showed that the maximum likelihood estimator of the logistic regression model converges to that of an exponential family if is fixed and goes to infinity. This result is roughly derived as follows. Consider the model (1) with the logistic distribution . Take and for any fixed and . Then we obtain
. By Bayes’ theorem, the conditional density ofgiven with respect to the distribution is, at least formally,
This is an exponential family with the sufficient statistic , and Owen’s result follows.
To be precise, Owen (2007) proved the convergence result under a different setting from here. He assumed that the true conditional distribution of given , , is any distribution . In our setting, is asymptotically equal to , and the density of with respect to should satisfy (3). In other words, our setting becomes misspecified unless this equality is satisfied. We discuss this point again in Section 5.
Warton & Shepherd (2010) pointed out that the likelihood of logistic regression converges to a Poisson point process model with a specific form of intensity. Indeed, by (2), the probability is approximately for any compact subset of . Therefore, by Poisson’s law of rare events, the number of observations for which and is approximately distributed according to the Poisson distribution with mean . This is the Poisson point process with the intensity measure .
In this paper, we consider the limit of various binomial regression models other than the logistic model. As expected from the result on logistic regression, the limit becomes a Poisson point process. A remarkable fact we prove is that the intensity measure of the point process should be a -exponential family for some real number . The -exponential family, also called the deformed exponential family or -family, is recently much investigated in the literature of statistical physics and information geometry; see e.g. Amari (1985), Amari & Nagaoka (2000), Amari & Ohara (2011), Naudts (2002), Naudts (2010), and Tsallis (1988). The precise definition is given in Section 2. The proof relies on the theory of extreme values. For example, for the probit or complementary log-log link functions, the limit of binomial regression is the usual exponential family as with the logit link. On the other hand, if
is the Cauchy distribution, then the limit becomes a-exponential family with
. If the uniform distribution is used,.
As a related work, Ding et al. (2011) introduced the -logistic regression, that uses the -exponential family for binary response, where . In Section 3, we show that the -logistic regression converges to the -exponential family if .
In Section 4, we study a penalized maximum likelihood estimator on the -exponential family of intensity measures. For some special cases, the estimator is reduced to a known admissible estimator for the Poisson mean parameter; see Ghosh & Yang (1988).
Some related problems are discussed in Section 5.
2 Imbalanced asymptotics of binomial regression
For each real number , define the -exponential function by
where and . This is inverse of the Box-Cox transformation. Note that for if . The function is convex if and only if .
Consider the binomial regression model (1) and put the following assumption on the distribution function .
There exist , and such that
as for each .
In the extreme value theory, it is known that there is no other asymptotic form than (5) as long as it exists; see e.g. de Haan & Ferreira (2006, Theorem 1.1.2 and 1.1.3). The number controls the lower tail structure of . For example, the logistic distribution satisfies Assumption 1 with , and . Other examples including the normal and Cauchy distributions are considered in Section 3.
for by using the sequences and that satisfy (5). Denote the probability law of under the true parameter by .
Now the asymptotic form like (2) follows from the assumption. Indeed,
Therefore, as in the logistic regression, we expect that the binomial regression model with converges to the Poisson point process under Assumption 1.
We give a lemma before the main result.
Let . Let be any compact subset of such that the function is finite over . Then the following equation holds:
The proof of Lemma 1 is given in Appendix.
Denote the observations for which by . Then, under , the set converges in law to the Poisson point process with the intensity measure
as . More precisely, we have
for any positive integer , non-negative integers and mutually disjoint compact subsets of such that is finite over .
Proof of Theorem 1.
is an independent and identically distributed sequence, the random vectorfor the disjoint compact subsets is distributed as the multinomial distribution. Then, by Lemma 1 and Poisson’s law of rare events,
converges to independent Poisson random variables with intensity. The proof is completed. ∎
The -exponential family of intensity measures is closely related to the -exponential family of probability measures as follows. Denote the total intensity by
Assume . Then the likelihood of is
where the base measure of is the counting measure on , and the base measure of for each is the distribution . In (11), the number of observed points is marginally distributed according to the Poisson distribution with intensity . Each point is independently distributed according to the
-exponential family defined by the probability density function
with respect to . The -exponential family is also called the deformed exponential family or the -family; see Amari & Nagaoka (2000) for the -family, where should be distinguished with the regression coefficient . It is known that the density (12) is also written as with appropriate and ; see e.g. Amari & Ohara (2011). However, we do not use this parametrization since the quantity remains in the whole likelihood (11).
We conjecture that the maximum likelihood estimator of the binomial regression model converges to that of the Poisson process model under mild conditions. However, we only give experimental results in Section 3. Instead, we study the estimation problem of the limit model in Section 4. See also Section 5 for further discussion.
In this section, we give some examples of distributions satisfying Assumption 1, and experimental results on the maximum likelihood estimation.
Even if satisfies Assumption 1, the sequences and are not uniquely determined. A unified choice is known (see Galambos (1987, Theorem 2.1.4–2.1.6)). However, in the following examples, one of possible pairs is explicitly given for each case.
For the logistic distribution and the Gumbel distribution on minimum values, we have
For the standard normal distribution, we have
See e.g. Galambos (1987, Section 2.3.2). For the Cauchy distribution, we have
We briefly study the -logistic regression proposed by Ding et al. (2011). For each real number , let , where denotes the -exponential function with and is uniquely determined by
We call the -logistic distribution. Uniqueness of follows from strictly monotone property of the -exponential function. The distribution is symmetric in the sense that since by (16). We obtain the following theorem. The proof is given in Appendix.
The -logistic distribution satisfies Assumption 1 with .
for and various ’s. For the binomial regression models, the estimated regression coefficient is normalized by (6). From Table 1, the convergence rate for the probit link is very slow, or may not converge. For the others, the rate is satisfactory.
4 Estimation of the -exponential family of intensity measures
We deal with estimation problem of the -exponential family of intensity measures (8). The maximum likelihood estimator is likely to fail to exist for small sample size . We propose a penalized maximum likelihood estimator.
We put the following assumption for simplicity.
The covariate distribution is known. The support of , denoted by
, is finite, and is not included in any hyperplane in. The observable data belongs to .
In practice, may be replaced with the empirical, or estimated, distribution based on the covariate sample of the original regression problem.
The parameter space is
The set is convex and unbounded since it is intersection of half spaces including the set . Furthermore, is open since is compact. In terms of convex analysis, corresponds to the polar set of . See Barvinok (2002).
We consider a penalized log-likelihood function
where is a non-negative regularization parameter. If , (19) is the log-likelihood function; see (11). The penalty term represents a pseudo-data of size distributed according to . The function (19) is concave with respect to if . Indeed, we can directly confirm that is concave if , and that is concave if .
We call the maximizer of (19) the additive-smoothing estimator.
This estimator has a desirable property as shown in the following example, even if .
Let be a two-point distribution on defined by
where and . Denote the intensity at and by and , respectively. It is not difficult to show that corresponds one-to-one with , where is the set of positive numbers. Hence the model is equivalent to the independent Poisson observable model with intensity , regardless of . Then the penalized log-likelihood (19) becomes
where denotes the number of observations , . The additive-smoothing estimator is , . If , then and the estimator always exists. Furthermore, if
, this estimator is known to be admissible with respect to the Kullback-Leibler loss function; seeGhosh & Yang (1988, Theorem 1). For the same reason, if has only points in , then the additive-smoothing estimator is admissible as long as .
Let and be any distribution satisfying Assumption 2. Then, since the model (11) is an exponential family, the pair is a sufficient statistic, where is the sample mean. Indeed, the additive-smoothing estimator should satisfy
For the maximum likelihood estimator, meaning , the second equation of (20) is consistent with the result of Owen (2007). From the theory of exponential families, the solution to (20) always exists if since belongs to the interior of the convex hull of ; see Barndorff-Nielsen (1978, Corollary 9.6). On the other hand, the maximum likelihood estimator fails to exist if is a boundary point.
For , we provide a similar result on existence. First consider the following example. The pair is not a sufficient statistic any more.
Let and be a three-point distribution on defined by for . Denote the number of observations by . We use and as a new parameter. Then the parameter space is and . The penalized log-likelihood is
where . The maximizer of (21) is
This always belongs to the parameter space if . On the other hand, the maximum likelihood estimator fails to exist if or .
In general, the following theorem holds. The proof is given in Appendix.
Let be any real number and . If Assumption 2 is satisfied, then the additive-smoothing estimator exists almost surely. It is unique if .
5.1 Multinomial regression
We studied so far the binomial regression. There are variants of multinomial regression models. The multinomial -logistic regression proposed by Ding et al. (2011) can be proved to have a limit under imbalanced asymptotics in the same manner as Theorem 2. The author was not aware of more general results. The problem is postponed as a future work.
5.2 Convergence of estimator
We did not study convergence properties of estimators such as the maximum likelihood estimator. Instead we considered the additive-smoothing estimator for the -exponential family of intensity measures in Section 4.
Owen (2007) showed that the maximum likelihood estimator of the logistic regression converges to that of the exponential family under imbalanced asymptotics. Then a natural conjecture is that the maximum likelihood estimator of the binomial regression model, which is the maximizer of
converges to that of the -exponential family. Note that estimation of is equivalent to that of via the formula (6). It will be also meaningful to study convergence of statistical experiments; see van der Vaart (1998) for the terminology.
5.3 Misspecified case
We studied asymptotic properties of the binomial regression model under an assumption that the model (1) is true. On the other hand, Owen (2007) put a different assumption, in that the true conditional distribution of the covariate given , , is fixed to some distribution . In this assumption, our setting is asymptotically described as and by (11). In other words, if the true distributions do not satisfy this relation, the model is misspecified.
It is important to consider robustness of estimators under the misspecified assumption. The problem is not so serious if the support of is included in that of , since then is absolutely continuous with respect to the estimated intensity measure , whenever belongs to the parameter space (18). Otherwise, however, is not absolutely continuous. In other words, the estimated intensity measure does not allow that the future data falls into a region. In particular, if the support of is not assumed a priori, there is risk of such a contradiction.
One may consider to take a distribution with the full support in order to contain the support of . However, if , we cannot assume such a distribution since the parameter space (18) becomes .
A solution to this problem will be to use a parametric family of together with a Bayesian prior distribution. For example, let be the uniform distribution on the hypercube , and assume a prior density on . As long as the true has compact support, we have a chance to detect it since there is a sufficiently large such that the support of is included in that of .
5.4 Bayesian prediction
In the preceding subsection, we considered the Bayesian approach for treating misspecified case. Even if the model is correctly specified, the approach will be fruitful.
In Section 4, we considered the additive-smoothing estimator of
. This is considered as a maximum-a-posteriori estimator if the prior density
is adopted. Then additive-smoothing Bayesian prediction can be also defined by the same prior.
In Example 1, we noted that, for special cases of and
, the additive-smoothing estimator becomes an admissible estimator with respect to the Kullback-Leibler divergence, shown byGhosh & Yang (1988). For prediction problem, a class of admissible predictive densities is investigated by Komaki (2004). Together with the additive-smoothing estimator, decision-theoretic properties of the additive-smoothing prediction are of interest.
The author thanks to Saki Saito for helpful discussions in the exploratory stage.
Appendix A Appendix
a.1 Proof of Lemma 1
Denote the induced probability distribution of by . Let be . Then is compact since is. We have
To prove (7), it is enough to show that
By Assumption 1, we know for each . Hence it is enough to show that converges to uniformly in . However, since is monotone in and is continuous in , uniform convergence follows from the general argument; see e.g. Galambos (1987, Lemma 2.10.1)).
Proof of Theorem 2
For each real number , denote the set of distributions that satisfy Assumption 1 by .
For , elementary calculation shows that . This is the logistic distribution and belongs to .
For , we have , . This is the uniform distribution on and belongs .
Let . It suffices to show that
Indeed, by the condition (16), if , then . Thus
Hence belongs to .
For , we first show that the support of has the infimum and that tends to as . Note that the -exponential function is continuous in , strictly increasing over , and remains over . Since for any , it must be for any by (16). Then only if . Conversely, if , it must be . Indeed, if , then by (16), but this contradicts . To prove as , due to (16), it is sufficient to show that as . This is shown as
Let and . It suffices to show that
By the definition of , we have
On the other hand, since as , we obtain
Proof of Theorem 3
In the following, we prove the theorem only for the case that , that is, no data is observed. The case is similarly proved if one notes that is contained in the convex hull of the support of .
Let be a discrete distribution with support and put , . By assumption, is not included in any hyperplane of . The parameter space (18) is written as
Note that is an open convex set and the origin always belongs to . The penalized log-likelihood is, since ,
By continuity of over , it is sufficient to show that if tends to a boundary point of or diverges. Note that if is a boundary point of , then belongs to for any since the origin does.
We prove the claim for first, and then .
Let . Fix any boundary point of . Then there is at least one such that . For such ’s, as . For the other ’s, is bounded as . Then, by (26), the function tends to as .
Let and fix any such that for any . Then it is necessary that for all . Since is not contained in a hyperplane, there is at least one such that . For such ’s, we have as . For the other ’s, . Therefore, by (26), the function tends to as , and the case was completed.
Let . Fix any boundary point of . Then there is at least one such that . For such ’s, as . For the other ’s, is bounded as . Then, by (26), the function tends to as .
Finally, let and fix any such that for any . Then it is necessary that . Since is not contained in a hyperplane, there is at least one such that . For such ’s,