1 Problem statement
We consider a simple one-parameter probit model under dependency. The observed binary data
is generated by thresholding a multivariate Gaussian vector with compound symmetry covariance structure:
The single parameter characterizes the dependency in the exchangeable sequence . The covariance matrix can be considered as a special case of the spiked covariance form proposed in Johnstone & Lu (2004, 2009)
for studying high-dimensional principal component analysis. The central question we focus on in this paper is
As , can be consistently estimated from the sequence ?
In contrast to usual probit models where and the mean of is parameterized by a function of covariates, model (1) puts emphasis on the dependency structure of the observations, and is possibly one of the simplest models along this line.
The model can be well motivated from network data analysis. For notational convenience, we rewrite the observations by a matrix and the latent Gaussian variables by . The Bernoulli entry represents whether there exists an edge between nodes and . We consider undirected networks where for simplicity. The covariance matrix belongs to . Due to rich structures exhibited in different types of networked systems (Newman, 2003), there has been an extensive literature on network modeling including Erdös–Rényi random graph model (Erdös & Rényi, 1959), exponential random graph model (Robins et al., 2007), stochastic blockmodel (Holland et al., 1983), and latent space model (Hoff et al., 2002), among others. As an alternative, model (1) assumes the edges between nodes are generated from underlying Gaussian variables, and the covariance matrix captures the possible dependency among different edges. When , (1) is an example of Erdös–Rényi random graph. We should emphasize that model (1) is not sophisticated enough for fitting most real network data. One possible generalization is
The mean parameter is introduced to incorporate heterogeneity across different edges. It can be assumed of low-rank based on the hypothesis that the generation mechanism of edges is driven by a few node-specific factors. This is in the same spirit of both stochastic block model and latent space model. Furthermore, more general structure can be imposed for . For instance, letting be the covariance between and , then one possible generalization is
where is the empty set. The formulation (2) automatically enforces the symmetry constraint which is required for modeling undirected networks. In this case, the dependencies among edges that share common nodes and those which do not are characterized by two different parameters. To fix idea, we will focus on the simplified model (1). Nevertheless, our analysis will shed lights on general probit modeling of networks. We start by investigating the likelihood approach for estimating and indicating its infeasibility. We then formally prove the nonestimability of under model (1) and provide a solution to make the estimation possible. Finally we discuss some implications of our results regarding binary data modeling.
2 Likelihood methods
Since model (1) takes a simple parametric form, our first attempt is to check if the maximum likelihood estimate is consistent. Before that we give an alternative formulation of the model that will be useful in later discussions.
Model (1) can be reformulated as
where are independently and identically distributed as .
be the cumulative distribution function of, , and
According to Lemma 1, we can write down the likelihood function,
Ideally, we hope a properly normalized converges to a deterministic function whose maximizer is . But the following result says this is not the case.
Proposition 1 is essentially a first order Laplacian approximation. It shows that the normalized log-likelihood function in (1) does not converge to a deterministic function, and more interestingly the limiting function is invariant of for . Figure 1 illustrates the normalized log-likelihood function curves when and the sample size . It is clear that the functions are pretty flat in most of . Interestingly, the functions have a sharp transition at around . This can be justified by a straightforward calculation showing . Hence the limit at is deterministic and does not even depend on .
While the normalized likelihood function converges (in first order) to an unfavorable object, it does not necessarily mean that likelihood based estimates are not consistent, because the higher order terms in the limiting function may contain useful information for the parameter of interest. Unfortunately, this is not true either, as demonstrated below by a second order analysis of the likelihood function.
Proposition 2 can be considered as a second order Laplacian approximation result. Proposition 1 shows that the first order term of is exponentially small. According to Proposition 2, after is scaled by a fully data-dependent and exponentially small term , the dominating term is of order . Moreover, the number comes into play in the precise constant of the second order term of , and the constant attains maximum at . If we replace with its expectation, then would equal . However, since the second order term is a random function of , we may not expect maximum likelihood estimate to be consistent for .
3 Nonestimability of
) is a simple parametric model, it may further imply no consistent estimates exist. Indeed, this is formally proved in the theorem below.
Under model (1), given any strictly increasing function with and any ,
where is any measurable function and the expectation is taken over .
Theorem 1 reveals that no consistent estimates of exist even when the parameter space only consists of two elements. At first glance, this result seems counter-intuitive. The model (1) has a simple structure with a single parameter, but there is no way to reliably estimate the parameter from infinite components of the data sequence. On the other hand, if the observation is the hidden sequence instead of , -consistent estimates can be easily constructed. For instance, denote . Lemma 1 implies that are independently and identically distributed as . Hence is -consistent for .
One might argue that it is the strong dependency in causing the issue. If the dependency can be somehow weakened, will become estimable? For example, we can consider with a constant and , and ask if there is any estimate such that . With a similar proof as the one for Theorem 1, it is possible to show
Hence remains nonestimable.
4 A solution
The arguments in the last two paragraphs suggest that the nonestimability of in model (1) is due to the thresholding operation in the binary data generating process. Specifically, is estimable from , but the thresholding on to produce results in loss of too much information, to be able to consistently estimate . In a nutshell, one bit information of each is not sufficient to recover the common correlation among them.
Interestingly, we demonstrate below that in fact a little more than one bit of information is enough to obtain -consistent estimates. Given two constants , denote . Suppose
That is, for every , instead of observing the sign of , we know which one of the three intervals belongs to. When , it reduces to the model (1). The theorem below gives one -consistent estimate.
For all , consider the estimate
where , and is the inverse function of . Then as , .
Following the preceding discussion, we can further consider a general setting where there exist () consecutive intervals and represents which interval falls into. It is not hard to show that -consistent estimate for can be constructed in a similar way as in (5). At a high level, we may consider the observed sequence as a discretized version of . Theorems 1 and 2 together characterize a phase transition regarding the estimability of . That is, is estimable if and only if .
We have revealed an interesting phenomenon regarding the estimability of a single parameter in a simple probit model. Several important directions are left open. For instance, suppose the latent vector is modeled by the Gaussian copula family (Klaassen et al., 1997; Tsukahara, 2005) with the parameter having compound symmetry covariance structure as in (1). What can we say about the estimability of ? As another example, consider the covariance matrix in (1) is replaced by the more general one in (2) in network modeling. It is clear that both and are nonestimable from the one bit information of . The question is how much more information is needed to consistently estimate them. Would the phase transition phenomenon we discussed in Remark 3 continue to hold?
Our results also have a few insightful implications. For example, modeling a dependent exchangeable binary sequence is subtle. We have shown that it is even impossible to do inference under a simple one-parameter model. Regarding binary network modeling, it might not be desirable to assume dependency among all the edges. Furthermore, converting a weighted network to a binary one may lose substantial information for the sake of parameter estimation.
The authors are grateful to Professor Anthony C. Davison and Professor Zhiliang Ying for their insightful comments which greatly improved the scope and presentation of this paper.
Appendix A Notations
Recall that are the cumulative distribution function and probability density function of a standard normal respectively, and is the inverse function of . We will use the following notations extensively,
We also use to mean and are of the same order. The next section collects a few useful lemmas that will be applied in the later proofs.
Appendix B Useful lemmas
The following holds
(i) is strictly log-concave
(v) , as .
Part (i) can be easily verified. Parts (ii) and (iii) can be shown by using the tail probability , as . Part (iv) is due to the fact that the second derivative of is negative for any and goes to only when . Part (v) is taken from (2.6) in Wong (2001) with minor change. ∎
Consider model (1), as ,
The proof is a direct adaption from the calculations of the integral (2.1) in Wong (2001). We hence do not repeat the arguments. ∎
Consider model (1), as ,
From Lemma 2 part (v), it is straightforward to confirm that
So there exists such that for . Hence it suffices to show is bounded. From the model formulation (3), it is easy to see that
We thus focus on bounding . For notational simplicity, we denote
Also, for large ,
where we have used Hoeffding’s inequality in , and the inequality for large in . Hence we obtain . The proof will be completed if we can show . This is true because when is large enough. ∎
Appendix C Proof of Proposition 1
It is straightforward to verify that when . Hence we can readily have the upper bound, if
Regarding the lower bound, if there exists an absolute constant such that
where is by a Taylor expansion and is due to Lemma 2 part (ii). Lemma 1 from the main text implies that , almost surely. Hence, almost surely as ,
Appendix D Proof of Proposition 2
We first restrict our analysis to the case . We have
We focus on the first integral above. Lemma 2 part (i) says that is strictly concave. Hence is increasing in . By a change of variable we obtain
Denote . Then yielding
Based on the Taylor expansions
with , the following holds
Next we bound the second term on the right-hand side of (8). With a few calculations we get
According to Lemma 2 part (ii) and (iv), there exist absolute constants such that for any
Then using the expansions we had about and in (9), we can obtain
where is a positive constant only depending on . To bound the other term in (10), we need take the Taylor expansions to a higher order,
Plugging the above expansions together with in the numerator of the second term in (10) and use in the denominator, it is not hard to obtain