A note on estimation in a simple probit model under dependency

12/27/2017 ∙ by Haolei Weng, et al. ∙ 0

We consider a probit model without covariates, but the latent Gaussian variables having compound symmetry covariance structure with a single parameter characterizing the common correlation. We study the parameter estimation problem under such one-parameter probit models. As a surprise, we demonstrate that the likelihood function does not yield consistent estimates for the correlation. We then formally prove the parameter's nonestimability by deriving a non-vanishing minimax lower bound. This counter-intuitive phenomenon provides an interesting insight that one bit information of the latent Gaussian variables is not sufficient to consistently recover their correlation. On the other hand, we further show that trinary data generated from the Gaussian variables can consistently estimate the correlation with parametric convergence rate. Hence we reveal a phase transition phenomenon regarding the discretization of latent Gaussian variables while preserving the estimability of the correlation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem statement

We consider a simple one-parameter probit model under dependency. The observed binary data

is generated by thresholding a multivariate Gaussian vector with compound symmetry covariance structure:

(1)

The single parameter characterizes the dependency in the exchangeable sequence . The covariance matrix can be considered as a special case of the spiked covariance form proposed in Johnstone & Lu (2004, 2009)

for studying high-dimensional principal component analysis. The central question we focus on in this paper is

As , can be consistently estimated from the sequence ?

In contrast to usual probit models where and the mean of is parameterized by a function of covariates, model (1) puts emphasis on the dependency structure of the observations, and is possibly one of the simplest models along this line.

The model can be well motivated from network data analysis. For notational convenience, we rewrite the observations by a matrix and the latent Gaussian variables by . The Bernoulli entry represents whether there exists an edge between nodes and . We consider undirected networks where for simplicity. The covariance matrix belongs to . Due to rich structures exhibited in different types of networked systems (Newman, 2003), there has been an extensive literature on network modeling including Erdös–Rényi random graph model (Erdös & Rényi, 1959), exponential random graph model (Robins et al., 2007), stochastic blockmodel (Holland et al., 1983), and latent space model (Hoff et al., 2002), among others. As an alternative, model (1) assumes the edges between nodes are generated from underlying Gaussian variables, and the covariance matrix captures the possible dependency among different edges. When , (1) is an example of Erdös–Rényi random graph. We should emphasize that model (1) is not sophisticated enough for fitting most real network data. One possible generalization is

The mean parameter is introduced to incorporate heterogeneity across different edges. It can be assumed of low-rank based on the hypothesis that the generation mechanism of edges is driven by a few node-specific factors. This is in the same spirit of both stochastic block model and latent space model. Furthermore, more general structure can be imposed for . For instance, letting be the covariance between and , then one possible generalization is

(2)

where is the empty set. The formulation (2) automatically enforces the symmetry constraint which is required for modeling undirected networks. In this case, the dependencies among edges that share common nodes and those which do not are characterized by two different parameters. To fix idea, we will focus on the simplified model (1). Nevertheless, our analysis will shed lights on general probit modeling of networks. We start by investigating the likelihood approach for estimating and indicating its infeasibility. We then formally prove the nonestimability of under model (1) and provide a solution to make the estimation possible. Finally we discuss some implications of our results regarding binary data modeling.

2 Likelihood methods

Since model (1) takes a simple parametric form, our first attempt is to check if the maximum likelihood estimate is consistent. Before that we give an alternative formulation of the model that will be useful in later discussions.

Lemma 1.

Model (1) can be reformulated as

(3)

where are independently and identically distributed as .

Let

be the cumulative distribution function of

, , and

According to Lemma 1, we can write down the likelihood function,

Ideally, we hope a properly normalized converges to a deterministic function whose maximizer is . But the following result says this is not the case.

Proposition 1.

Under the model formulation (3), for any

, with probability 1,

as .

Proposition 1 is essentially a first order Laplacian approximation. It shows that the normalized log-likelihood function in (1) does not converge to a deterministic function, and more interestingly the limiting function is invariant of for . Figure 1 illustrates the normalized log-likelihood function curves when and the sample size . It is clear that the functions are pretty flat in most of . Interestingly, the functions have a sharp transition at around . This can be justified by a straightforward calculation showing . Hence the limit at is deterministic and does not even depend on .

Figure 1: Plots of normalized log-likelihood functions under model (1). The left plot is for and the right one is for . Both plots are made by one realization of .

While the normalized likelihood function converges (in first order) to an unfavorable object, it does not necessarily mean that likelihood based estimates are not consistent, because the higher order terms in the limiting function may contain useful information for the parameter of interest. Unfortunately, this is not true either, as demonstrated below by a second order analysis of the likelihood function.

Proposition 2.

Under the model formulation (3), for any , as ,

where

is the probability density function of

.

Proposition 2 can be considered as a second order Laplacian approximation result. Proposition 1 shows that the first order term of is exponentially small. According to Proposition 2, after is scaled by a fully data-dependent and exponentially small term , the dominating term is of order . Moreover, the number comes into play in the precise constant of the second order term of , and the constant attains maximum at . If we replace with its expectation, then would equal . However, since the second order term is a random function of , we may not expect maximum likelihood estimate to be consistent for .

3 Nonestimability of

The preceding discussions in Section 2 indicate that the likelihood approach may not yield consistent estimates for . Since model (1

) is a simple parametric model, it may further imply no consistent estimates exist. Indeed, this is formally proved in the theorem below.

Theorem 1.

Under model (1), given any strictly increasing function with and any ,

where is any measurable function and the expectation is taken over .

Remark 1.

Theorem 1 reveals that no consistent estimates of exist even when the parameter space only consists of two elements. At first glance, this result seems counter-intuitive. The model (1) has a simple structure with a single parameter, but there is no way to reliably estimate the parameter from infinite components of the data sequence. On the other hand, if the observation is the hidden sequence instead of , -consistent estimates can be easily constructed. For instance, denote . Lemma 1 implies that are independently and identically distributed as . Hence is -consistent for .

Remark 2.

One might argue that it is the strong dependency in causing the issue. If the dependency can be somehow weakened, will become estimable? For example, we can consider with a constant and , and ask if there is any estimate such that . With a similar proof as the one for Theorem 1, it is possible to show

Hence remains nonestimable.

4 A solution

The arguments in the last two paragraphs suggest that the nonestimability of in model (1) is due to the thresholding operation in the binary data generating process. Specifically, is estimable from , but the thresholding on to produce results in loss of too much information, to be able to consistently estimate . In a nutshell, one bit information of each is not sufficient to recover the common correlation among them.

Interestingly, we demonstrate below that in fact a little more than one bit of information is enough to obtain -consistent estimates. Given two constants , denote . Suppose

That is, for every , instead of observing the sign of , we know which one of the three intervals belongs to. When , it reduces to the model (1). The theorem below gives one -consistent estimate.

Theorem 2.

For all , consider the estimate

(5)

where , and is the inverse function of . Then as , .

Figure 2 empirically verifies the consistency result of in Theorem 2.

Remark 3.

Following the preceding discussion, we can further consider a general setting where there exist () consecutive intervals and represents which interval falls into. It is not hard to show that -consistent estimate for can be constructed in a similar way as in (5). At a high level, we may consider the observed sequence as a discretized version of . Theorems 1 and 2 together characterize a phase transition regarding the estimability of . That is, is estimable if and only if .

Figure 2:

Plots of average and standard deviation of

in (5) over 1000 repetitions. We set . The upper plot is for , and the lower one is for .

5 Discussion

We have revealed an interesting phenomenon regarding the estimability of a single parameter in a simple probit model. Several important directions are left open. For instance, suppose the latent vector is modeled by the Gaussian copula family (Klaassen et al., 1997; Tsukahara, 2005) with the parameter having compound symmetry covariance structure as in (1). What can we say about the estimability of ? As another example, consider the covariance matrix in (1) is replaced by the more general one in (2) in network modeling. It is clear that both and are nonestimable from the one bit information of . The question is how much more information is needed to consistently estimate them. Would the phase transition phenomenon we discussed in Remark 3 continue to hold?

Our results also have a few insightful implications. For example, modeling a dependent exchangeable binary sequence is subtle. We have shown that it is even impossible to do inference under a simple one-parameter model. Regarding binary network modeling, it might not be desirable to assume dependency among all the edges. Furthermore, converting a weighted network to a binary one may lose substantial information for the sake of parameter estimation.

Acknowledgement

The authors are grateful to Professor Anthony C. Davison and Professor Zhiliang Ying for their insightful comments which greatly improved the scope and presentation of this paper.

Appendix

Appendix A Notations

Recall that are the cumulative distribution function and probability density function of a standard normal respectively, and is the inverse function of . We will use the following notations extensively,

We also use to mean and are of the same order. The next section collects a few useful lemmas that will be applied in the later proofs.

Appendix B Useful lemmas

Lemma 2.

The following holds

(i) is strictly log-concave
(ii)
(iii)
(iv)
(v) , as .

Proof.

Part (i) can be easily verified. Parts (ii) and (iii) can be shown by using the tail probability , as . Part (iv) is due to the fact that the second derivative of is negative for any and goes to only when . Part (v) is taken from (2.6) in Wong (2001) with minor change. ∎

Lemma 3.

Consider model (1), as ,

Proof.

The proof is a direct adaption from the calculations of the integral (2.1) in Wong (2001). We hence do not repeat the arguments. ∎

Lemma 4.

Consider model (1), as ,

Proof.

From Lemma 2 part (v), it is straightforward to confirm that

So there exists such that for . Hence it suffices to show is bounded. From the model formulation (3), it is easy to see that

We thus focus on bounding . For notational simplicity, we denote

First,

Also, for large ,

where we have used Hoeffding’s inequality in , and the inequality for large in . Hence we obtain . The proof will be completed if we can show . This is true because when is large enough. ∎

Appendix C Proof of Proposition 1

Proof.

It is straightforward to verify that when . Hence we can readily have the upper bound, if

(6)

Regarding the lower bound, if there exists an absolute constant such that

where is by a Taylor expansion and is due to Lemma 2 part (ii). Lemma 1 from the main text implies that , almost surely. Hence, almost surely as ,

These results combined with the upper and lower bounds in (6) and (C) completes the proof. ∎

Appendix D Proof of Proposition 2

Proof.

We first restrict our analysis to the case . We have

We focus on the first integral above. Lemma 2 part (i) says that is strictly concave. Hence is increasing in . By a change of variable we obtain

Denote . Then yielding

(8)

Based on the Taylor expansions

(9)

with , the following holds

Next we bound the second term on the right-hand side of (8). With a few calculations we get

(10)

According to Lemma 2 part (ii) and (iv), there exist absolute constants such that for any

(11)

Then using the expansions we had about and in (9), we can obtain

(12)

where is a positive constant only depending on . To bound the other term in (10), we need take the Taylor expansions to a higher order,

(13)
(14)

Plugging the above expansions together with in the numerator of the second term in (10) and use in the denominator, it is not hard to obtain

where the positive constants only depend on . To derive we have used Lemma 2 part (ii) (iii), and is due to (11). Combining the above upper bound with (