Parameter estimation and learning from truncated samples is an important and challenging problem in Statistics. The goal is to estimate the parameters of the true distribution based only on samples that fall within a (possibly small) subset of the distribution’s support.
Sample truncation occurs naturally in a variety of settings in science, engineering, economics, business and social sciences. Typical examples include selection bias in epidemiology and medical studies, and anecdotal “paradoxes” in damage and injury analysis explained by survivor bias. Statistical estimation from truncated samples goes back to at least Galton1897
, who analyzed truncated samples corresponding to speeds of American trotting horses, and includes classical results on the use of the moments method(PearsonLee1908; Lee1914) and the maximum likelihood method (fisher31)
for estimating a univariate Gaussian distribution from truncated samples (see alsoDGTZ18 for a detailed discussion on the history and the significance of statistical estimation from truncated samples).
In the last few years, there has been an increasing interest in computationally and statistically efficient algorithms for learning multivariate Gaussian distributions from truncated samples (when the truncation set is known (DGTZ18) or unknown (KTZ19)
) and for training linear regression on models based on truncated (or censored) data(DGTZ19). In addition to the elegant and powerful application of Stochastic Gradient Descent to optimizing a seemingly unknown maximum likelihood function from truncated samples, a significant contribution of (DGTZ18; KTZ19; DGTZ19) concerns necessary conditions for efficient statistical estimation of multivariate Gaussian or regression models from truncated samples. More recently, NP19
showed how to use Expectation-Maximization for learning mixtures of two Gaussian distributions from truncated samples.
Despite the strong results above on efficient learning from truncated samples for continuous settings, we are not aware of any previous work on learning discrete models from truncated samples. We note that certain elements of the prior approaches in inference from truncated data are inherently continuous and it is not clear to which extent (and under which conditions) can be adapted to a discrete setting. E.g., statistical estimation from truncated samples in a discrete setting should deal with a situation where the truncation removes virtually all randomness from certain directions, something that cannot be the result of nontrivial truncations in a continuous setting.
Motivated by this gap in relevant literature, we investigate efficient parameter estimation of discrete models from truncated samples. We start with the fundamental setting of a Boolean product distribution on the -dimensional hypercube truncated by a set , which is accessible through membership queries. The marginal of in each direction
is an independent Bernoulli distribution with parameter. Our goal is to compute an estimation
of the parameter vectorof such that
, with probability of at least, with time and sample complexity polynomial in , and . We note that such an estimation (or an estimation
of the logit parametersof similar accuracy) implies an estimation of the true distribution within total variation distance .
Significantly departing from the maximum likelihood estimation approach of DGTZ18; KTZ19; DGTZ19, we introduce a natural notion of fatness of the truncation set , under which samples from the truncated distribution reveal enough information about the true distribution . Roughly speaking, a truncated Boolean product distribution is -fat in some direction of the Boolean hypercube, if for an probability mass of the truncated samples, the neighboring sample with its -th coordinate flipped is also in . Therefore, with probability , conditional on the remaining coordinates, the -th coordinate of a sample is distributed as the marginal of the true distribution in direction . So, if the truncated distribution is -fat in all directions (e.g., the halfspace of all vectors with norm at most is a fat subset of the Boolean hypercube), a sample from is quite likely to reveal significant information about the true distribution . Building on this intuition, we show how samples from the true distribution can be generated from few truncated samples (see also Algorithm 1):
Informal Theorem 1.
With an expected number of samples from the -fat truncation of a Boolean product distribution , we can generate a sample distributed as in .
We show (Lemma 3) that fatness is also a necessary condition for Theorem 1. A stunning consequence of Theorem 1 is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, sparse recovery, uniformity or identity testing, differentially private uniformity testing) that can be performed efficiently for a Boolean product distribution , can also be performed using truncated samples from , at the expense of a factor increase in time and sample complexity. In Section 3, we obtain, as simple corollaries of Theorem 1, that the statistical tasks described in ADK15; DKS17; CDKS17; CKM+19 for Boolean product distributions can be performed using only truncated samples!
To further demonstrate the power and the wide applicability of our approach, we extend the notion of fatness to the richer and more complex setting of ranking distributions on alternatives. In Section 3.5, we show how to implement efficient statistical inference of Mallows models using samples from a fat truncated Mallows distribution (see Theorem 11).
Natural and powerful though, fatness is far from being necessary for efficient parameter estimation from truncated samples. Seeking a deeper understanding of the challenges of learning discrete models from truncated samples, we identify, in Section 4, three natural conditions that we show to be necessary for efficient parameter estimation in our setting:
- Assumption 1:
The support of the distribution on should be rich enough, in the sense that its truncation should assign positive probability to a and other vectors that remain linearly independent after we subtract from them.
- Assumption 2:
is accessible through a membership oracle that reveals whether , for any in the -dimensional hypercube.
- Assumption 3:
The truncation of by leaves enough randomness in all directions. More precisely, we require that in any direction , any two samples from the truncated distribution have sufficiently different projections on , with non-negligible probability.
Assumption 2 ensures that the learning algorithm has enough information about and is also required in the continuous setting. Without oracle access to , for any Boolean product distribution , we can construct a (possibly exponentially large) truncation set such that sampling from the truncated distribution
appears identical to sampling from the uniform distribution, until the first duplicate sample appears (our construction is similar to(DGTZ18, Lemma 12)).
Similarly to DGTZ18, Assumption 2 is complemented by the additional natural requirement that the true distribution should assign non-negligible probability mass to the truncation set (Assumption 4). The reason is that the only way for a parameter estimation algorithm to evaluate the quality of its current estimation is by generating samples in and comparing them with samples from . Assumptions 2 and 4 ensure that this can be performed efficiently.
Assumptions 1 and 3 are specific to the discrete setting of the Boolean hypercube. Assumption 1 requires that we should be able to normalize the truncation set , by subtracting a vector , so that its dimension remains . If this is true, we can recover the parameters of a Boolean product distribution from truncated samples by solving a linear system with equations and unknowns, which we obtain after normalization. We prove, in Lemma 12, that Assumption 1 is both sufficient and necessary for parameter recovery from truncated samples in our setting.
Assumption 3 is a stronger version of Assumption 1 and is necessary for efficient parameter estimation from truncated samples in the Boolean hypercube. It essentially requires that with sufficiently high probability, any set of polynomially many samples from can be normalized, subtracting a vector , so that includes a well-conditioned matrix, after normalization.
Beyond showing that these assumptions are necessary for efficient identifiability, we show that they are also sufficient and provide a computational efficient algorithm for learning Boolean product distributions. Our algorithm is based on a careful adaptation of the approach of DGTZ18 which uses Stochastic Gradient Descent on the negative log-likelihood. While the analysis consists of the same conceptual steps as that of DGTZ18, it requires dealing with a number of technical details that arise due to discreteness. One technical contribution of our work is using the necessary assumptions for identifiability to establish strong-convexity of the negative log-likelihood in a small ball around the true parameters (see Lemma LABEL:lem:str-conv and Lemma 25 in Appendix C). Our main result is that:
Our work develops novel techniques for truncated statistics for discrete distributions. As aforementioned, there has been a large number of recent works dealing inference with truncated data from a Gaussian distribution (DGTZ18; KTZ19; DGTZ19) or mixtures of Gaussians (NP19) but to the best of our knowledge there is no work dealing with discrete distributions. An additional feature of our work compared to those results is that our methods are not limited to parameter estimation but enable any statistical task to be performed on truncated datasets by providing a sampler to the true underlying distribution. While this requires a mildly stronger than necessary but natural assumption on the truncation set, we show that the more complex SGD based methods developed in prior work can also be applied in the discrete settings we consider.
The field of robust statistics is also very related to our work as it also deals with biased data-sets and aims to identify the distribution that generated the data. Truncation can be seen as an adversary erasing samples outside a certain set. Recently, there has been a lot of theoretical work for computationally-efficient robust estimation of high-dimensional distributions in the presence of arbitrary corruptions to a small fraction of the samples, allowing for both deletions of samples and additions of samples (DKK+16b; CSV17; LRV16; DKK+17; DKK+18; hopkins2019hard). In particular, the work of DKK+16b deals with the problem of learning binary-product distributions.
Another line of related work concerns learning from positive examples. The work of de2014learning considers a setting where samples are obtained from the uniform distribution over the hypercube truncated on a set . However, their goal is somewhat orthogonal to ours. It aims to accurately learn the set while the distribution is already known. In contrast, in our setting the truncation set is known and the goal is to learn the distribution. More recently, (canonne2020learning) extend these results to learning the truncation set with truncated samples from continuous distributions.
Another related literature within learning theory aims to learn discrete distributions through conditional samples. In the conditional sampling model that was recently introduced concurrently by ChakrabortyFGM13; ChakrabortyFGM16 and CanonneRS14; CanonneRS15, the goal is again to learn an underlying discrete distribution through conditional/truncated samples but the learner can change the truncation set on demand. This is known to be a more powerful model for distribution learning and testing than standard sampling (Canonne15b; FalahatgarJOPS15; AcharyaCK15b; BhattacharyyaC18; AcharyaCK15a; GouleakisTZ17; KamathT19; canonne2019random).
We use lowercase bold letters to denote -dimensional vectors. We let denote the norm and denote the norm of a vector . We let and . denotes the -dimensional Boolean hypercube.
For any vector , is the vector obtained from by removing the -th coordinate and is the vector obtained from by replacing by . Similarly, given a set , we let be the projection of to . For any and any coordinate , we let denote with its -th coordinated flipped.
For any , we let denote the Bernoulli distribution with parameter . For any ,
denotes the probability of value under . The Bernoulli distribution is an exponential family111The exponential family with sufficient statistics , carrier measure and natural parameters is the family of distributions , where the probability distribution
, where the probability distributionhas density ., where the natural parameter, denoted , is the logit of the parameter 222The base of the logarithm function used throughout the paper is insignificant.. The inverse parameter mapping is . Also, the base measure is , the sufficient statistic is the identity mapping and the log-partition function with respect to is .
Boolean Product Distribution.
We mostly focus on a fundamental family of Boolean product distributions on the -dimensional hypercube . A Boolean product distribution with parameter vector , usually denoted by , is the product of independent Bernoulli distributions, i.e., . The Boolean product distribution can be expressed in the form of an exponential family as follows:
where is the natural parameter vector with for each .
In the following, we always let (or or , when we want to emphasize the parameter vector or the natural parameter vector ) denote a Boolean product distribution. We denote (or simply , when is clear from the context) the vector of natural parameters of . We let and (or simply , when or are clear from the context) denote the probability of under . Given a subset of the hypercube, the probability mass assigned to by a distribution , usually denoted (or simply , when is clear from the context), .
Truncated Boolean Product Distribution.
Given a Boolean product distribution , we define the truncated Boolean product distribution , for any fixed . has , for all , and , otherwise. We often refer to as the truncation of (by ) and to as the truncation set.
It is sometimes convenient (especially when we discuss assumptions 1 and 3, in Section 4), to refer to some fixed element of . We observe that by swapping with (and with ) in certain directions, we can normalize so that and . In the following, we always assume, without loss of generality, that is normalized so that and .
Notions of Distance between Distributions.
Let be two probability measures in the discrete probability space .
The total variation distance between and , denoted , is defined as .
The Kullback–Leibler divergence (or simply, KL divergence), denoted , is defined as .
The following summarizes some standard upper bounds on the total variation distance and the KL divergence of two Boolean product distributions. The proof of Proposition 1 can be found in the Appendix A.
Let and be two Boolean product distributions, and let and be the vectors of their natural parameters. Then:
Identifiability and Learnability.
A Boolean product distribution is identifiable from its truncation , if given , for all , we can recover the parameter vector .
A Boolean product distribution is efficiently learnable from its truncation , if for any , we can compute an estimation of the parameter vector (or an estimation of the natural parameter vector ) of such that (or ), with probability at least , with time and sample complexity polynomial in , and using truncated samples from . By Proposition 1, an upper bound on the distance between and (or between and ) translates into an upper bound on the total variation distance between the true distribution and (or ). In this work, we identify sufficient and necessary conditions for efficient learnability of Boolean product distributions from truncated samples.
3 Boolean Product Distributions Truncated by Fat Sets
In this section, we discuss fatness of the truncation set, a strong sufficient (and in a certain sense, necessary) condition, under which we can generate samples from a Boolean product distribution using samples from its truncation (and access to through a membership oracle).
A truncated Boolean product distribution is -fat in coordinate , for some , if . A truncated Boolean product distribution is -fat, for some , if is -fat in every coordinate .
If is fat, it happens often that a sample has both . Then, conditional on the remaining coordinates , the -th coordinate of is distributed as . We next focus on truncated Boolean product distributions that are -fat.
There are several natural classes of truncation subsets that give rise to fat truncated product distributions. E.g., for each , the halfspace results in an -fat truncated distribution, if , for all . The same holds if is any downward closed333A set is downward closed if for any and any with , in all directions , . subset of and , for all .
Fatness in coordinate is necessary, if we want to distinguish between two truncated Boolean distributions based on their -th parameter only, if the remaining coordinates are correlated. Specifically, we can show that if is -fat in some coordinate , there exists a Boolean distribution with (and large enough) whose truncation by appears identical to . Therefore, if the other coordinates are arbitrarily correlated, it is impossible to distinguish between the two distributions based on their -th parameter alone. However, as we discuss in Section 4, if is rich enough, but not necessarily fat, we can recover the entire parameter vector444For a concrete example, where we can recover the entire parameter vector of a truncated Boolean product distribution , we consider , which is not fat in any coordinate, and let , for each . Then, setting , for each , we can recover , by solving the following linear system: , , . This is a special case of the more general identifiability condition discussed in Lemma 12. of .
Let , let be any subset of with , for all , and consider any . Then, for any Boolean distribution with , there exists a distribution such that .
We recall that denotes the projection of on . By hypothesis, and for each , either or , but never both. For each , we let:
For each , we let , so that is a probability distribution on . E.g., if for all , , we let
By definition, is a probability distribution on . Moreover, for all , , which implies the lemma. ∎
3.1 Sampling from a Boolean Product Distribution using Samples from its Fat Truncation
An interesting consequence of fatness is that we can efficiently generate samples from a Boolean product distribution using samples from any -fat truncation of The idea is described in Algorithm 1. Theorem 4 shows that for any sample drawn from and any such that , conditional on , is distributed as . So, we can generate a random sample by putting together such values. -fatness of the truncated distribution implies that the expected number of samples required to generate a is .
Let be the distribution of the samples generated by Algorithm 1. To prove that and are identical, we show that is a product distribution and that each , where is the parameter of in direction
We fix a direction . Let denote the projection of on . In Algorithm 1, takes the value of the -coordinate of a sample such that both . For each such sample , we have that:
Therefore, , which implies that . Since this holds for all such that both , is independent of the remaining coordinates and is distributed as . This concludes the proof of (i).
As for the sample complexity of Algorithm 1, we observe that since is -fat in each coordinate , each new sample covers any fixed coordinate (i.e., causes to become ) of with probability at least . Therefore, the probability that any fixed coordinate remains after Algorithm 1 draws samples from is at most . Setting and applying the union bound, we get that the probability that there is a coordinate of with value after samples from is at most . Therefore, the expected number of samples from before a random sample is returned by Algorithm 1 is at most
where the inequality follows from for ∎
3.2 Parameter Estimation and Learning in Total Variation Distance
Based on Algorithm 1, we can recover the parameters of any Boolean product distribution using samples from any fat truncation of .
Let be a Boolean product distribution and let be a truncation of . If is -fat in any fixed coordinate , then, for any , we can compute an estimation of the parameter of such that , with probability at least , using an expected number of samples from .
We modify Algorithm 1 to Algorithm 2, so that it generates random samples in coordinate only. As in Theorem 4.i, each of Algorithm 2 is an independent sample from . Since the truncated distribution is -fat, the expected number of samples from before is generated, is . We estimate from samples of Algorithm 2 using the empirical mean . A standard application of the Hoeffding bound555We use the following Hoeffding bound: Let be independent Bernoulli random variables, let
independent Bernoulli random variables, letand . Then, for any , . shows that if , then , with probability at least . Hence, estimating with accuracy requires an expected number of samples from . ∎
Let be a Boolean product distribution and be any -fat truncation of . Then, for any , we can compute an estimation such that , with probability at least , using an expected number of samples from .
3.3 Identity and Closeness Testing with Access to Truncated Samples
Theorem 4 implies that if we have sample access to an -fat truncation of a Boolean product distribution , we can pretend that we have sample access to the original distribution , at the expense of an increase in the sample complexity (from ) by a factor of . Therefore, we can extend virtually all known hypothesis testing and learning algorithms for Boolean product distributions to fat truncated Boolean product distributions.
For identity testing of Boolean product distributions, based on samples from fat truncated ones, we combine Algorithm 1 with the algorithm of (CDKS17, Sec. 4.1). Combining Theorem 4 with (CDKS17, Theorem 6), we obtain the following:
Corollary 7 (Identity Testing).
Let be a Boolean product distribution described by its parameters , and let be a Boolean product distribution for which we have sample access to its -fat truncation . For any , we can distinguish between and , with probability , using an expected number of samples from .
We can extend Corollary 7 to closeness testing of two Boolean product distributions, for which we only have sample access to their fat truncations. We combine Algorithm 1 with the algorithm of (CDKS17, Sec. 5.1). The following is an immediate consequence of Theorem 4 and (CDKS17, Theorem 9).
Corollary 8 (Closeness Testing).
Let , be two Boolean product distributions for which we have sample access to their -fat truncation and -fat truncation . For any , we can distinguish between and , with probability at least , using an expected number of samples from and .
3.4 Learning in Total Variation Distance
Using Algorithm 1, we can learn a Boolean product distribution , within in total variation distance, using samples from its fat truncation. The following uses a standard analysis of the sample complexity of learning a Boolean product distribution (see e.g., kamath2018privately).
Let be a Boolean product distribution and let be any -fat truncation of . Then, for any , we can compute a Boolean product distribution such that , with probability at least , using samples from .
We assume that and that for all , . Both are without loss of generality. The former can be enforced by flipping and . For the latter, we observe that there exists a distribution with that satisfies the assumption ( can be obtained from by adding uniform noise in each coordinate with probability , see also (CDKS17, Sec. 4.1)).
By Proposition 17, for any two Boolean product distributions and ,
Similarly to the proof of Corollary 9, we take samples from Algorithm 1 and estimate each parameter of as . Using the Chernoff bound in (kamath2018privately, Claim 5.16), we show that for all directions , . Drawing samples from Algorithm 1 and using (3), we get that . The sample complexity follows from the fact that each sample of Algorithm 1 requires an expected number of samples from the -fat truncation of . ∎
We can improve the sample complexity in Corollary 9, if the original distribution is sparse. We say that a Boolean product distribution is -sparse, for some and , if there is an index set , with , such that for all , . Namely, we know that of ’s parameters are equal to (but we do not know which of them). Then, we first apply Corollary 6 and estimate all parameters of within distance . We set each with to . Thus, we recover the index set . For the remaining parameters, we apply Corollary 9. The result is summarized by the following:
Let be a -sparse Boolean product distribution and let be any -fat truncation of . Then, for any , we can compute a Boolean product distribution such that , with probability at least , using samples from the truncated distribution .
3.5 Learning Ranking Distributions from Truncated Samples
An interesting application of Theorem 4 is parameter estimation of ranking distributions from truncated samples. For clarity, we next focus on Mallows distributions. Our techniques imply similar results for other well known models of ranking distributions, such as Generalized Mallows distributions FlignerV1986 and the models of Plackett; Luce, BradleyTerry and Babington.
Definition and Notation.
We start with some notation specific to this section. Let be the symmetric group over the finite set of items . Given a ranking , we let denote the position of item in . We say that precedes in , denoted by , if . The Kendall tau distance of two rankings and , denoted by , is the number of discordant item pairs in and . Formally,
The Mallows model mallows1957non is a family of ranking distributions parameterized by the central ranking and the spread parameter . Assuming the Kendall tau distance between rankings, the probability mass function is , where the normalization factor is . For a given Mallows distribution , we denote the probability that item precedes item in a random sample from .
Truncated Mallows Distributions.
We consider parameter estimation for a Mallows distribution with sample access to its truncation by a subset . Then, , for each , and , otherwise. Next, we generalize the notion of fatness to truncated ranking distributions and prove the equivalent of Theorem 5 and Corollary 6.
For a ranking , we let denote the ranking obtained from with the items and swapped. Formally, , for all items , and . We say that a truncated Mallows distribution is -fat for pair , if , for some . A truncated Mallows distribution is -fat, if is -fat for all pairs , and neighboring -fat, if is -fat for all pairs that occupy neighboring positions in the central ranking , i.e., for all pairs with .
Parameter Estimation and Learning of Mallows Distributions from Truncated Samples.
In Appendix B.1, we present Algorithm 4 that draws a sample from the truncated Mallows distribution and updates a vector with estimations of the probability that item precedes item in a sample from the true Mallows distribution . Thus, we can show (Appendix B.1) the following:
Let be a Mallows distribution with and , for some constant , and let be any neighboring -fat truncation of . Then,
For any , we can learn the central ranking , with probability at least , using an expected number of samples from .
Assuming that the central ranking is known, for any , we can compute an estimation of the spread parameter such that , with probability at least , using an expected number of samples from .
For any , we can compute a Mallows distribution so that
with probability at least using an expected number of
samples from .
4 Efficient Learnability from Truncated Samples: Necessary Conditions
We next discuss necessary conditions for identifiability and efficient learnability of a Boolean product distribution from truncated samples. For Assumption 1 and Lemma 12, we recall that we can assume without loss of generality that is normalized so that .
For the truncated Boolean product distribution , (after possible normalization) and there are linearly independent with , .
A Boolean product distribution on is identifiable from its truncation if and only if Assumption 1 holds.