This paper presents theory for Normalized Random Measures (NRMs), Normalized Generalized Gammas (NGGs), a particular kind of NRM, and Dependent Hierarchical NRMs which allow networks of dependent NRMs to be analysed. These have been used, for instance, for time-dependent topic modelling [CDB12].
Dependency models are getting more and more popular in machine learning recently due to the fact of correlated data we are facing at,e.g., real data is always correlated with each other rather than independent. The pioneer work of MachEachern [Mac99, Mac00] treats the jumps and atoms to be stochastic between dependent models. While there are many ways of constructing dependent nonparametric models, e.g., from a stick-breaking construction [GS09], or from a hierarchical construction [TJBB06], in this paper, following the idea of [LGF10], we construct dependency normalized random measures from the underlying Poisson processes of the corresponding completely random measures [Kin67]. This construction is intuitive and allow flexibly controlling of the dependencies. A related construction in the statistical literature is by Lijoi et al. [A. 12] that deals with modeling two groups of data.
In this paper, we first introduce in Section 2 some mathematical background of completely random measures (CRMs) and their construction from Poisson processes, and then introduce NRMs and NGGs. Slice sampling is also introduced to do the posterior sampling of NRMs using techniques from [GW11]. The dependency operators in Poisson processes and for the corresponding CRMs and NRMs is then introduced in Section 3 following the work of [Kin93, LGF10]. Posterior inference for the NGG are then developed in Section 4 based on the results of [JLP09]. Then we give the dependency and composition results when applying these operators to NRMs in Section 5. Proofs are given in the Appendix, Section A.
In this section we briefly introduce background of Poisson processes, the corresponding completely random measures, dependency operations on these random measures, and normalized random measures.
Section 2.1 explains how to construct completely random measures from Poisson processes. Section 3.1 introduces operations on Poisson processes to construct dependent Poisson processes. Section 3.2 adapts these operations to the corresponding completely random measures (CRMs). Constructing normalized random measures (NRMs) from CRMs is discussed in Section 2.2 along with details of the Normalized Generalized Gamma (NGG), a particular kind of NRM for which the details have been worked out. A slice sampler for sampling an NRM is described in Section 2.3.
We first give an illustration of the basic construction for an NRM. for a target domain . The Poisson process is used to create a countable (and usually) infinite set points in a product space of with the target domain , as shown in the left of Figure 2. The distribution is then a discrete one on these points. The distribution can be pictured by dropping lines from each point down to , and then normalizing all these lines so their sum is one. The resulting picture shows the set of weighted impulses that make up the constructed NRM on the target domain.
Completely random measure:
2.1 Constructing Completely Random Measures from Poisson processes
In contrast to the general class of completely random measure (CRM) [Kin67], which admits a unique decomposition as the summation over there parts: a deterministic measure, a purely atomic measure with fixed atom locations and a measure111can be continuous or discrete. with random jumps and atoms, in this paper, we restrict it to the class of pure jump processes [FK72], which has the following form
where are called the jumps of the process, and
are a sequence of independent random variables drawn from a base measurable space222 means the -algebra of , we sometimes omit this and use to denote the measurable space..
It is shown that these kinds of CRMs can be constructed from Poisson processes with specific mean measures . We will start from some definitions.
A random variable taking values in
is said to have the Poisson distribution with meanin if
then almost surely and .
Let be a measure space where is the -algebra of . Let be a measure on it. A Poisson process on is defined to be a random subset such that if is the number of points of in the measurable subset , then
is a random variable having the Poisson distribution with mean , and
whenever are in and disjoint, the random variables are independent.
The integer-value random measure is called a Poisson random measure and the Poisson process is denoted as , where is called the mean measure of the Poisson process.
In this paper, we define a random measure on to be a linear functional of the Poisson random measure , whose mean measure defined on a product space :
The mean measure is called the Lévy measure of .
The general treatment of constructing random measures from Poisson random measures can be found in [Jam05]. Note that the random measure in construction (3) has the same form as Equation (1) because is composed of a countable number of points. It can be proven to be a completely random measure [Kin67] on , meaning that for arbitrary disjoint subsets of the measurable space, the random variables are independent.
For the completely random measure defined above to always be finite, it is necessary that be finite, and therefore for every , is finite [Kin93]. It follows that there will always be a finite number of points with jumps for that . Therefore in the bounded product space the measure is finite. So it is meaningful to sample those points with by first getting the count of points sampled from a Poisson with (finite) mean , and then to sample the points according to the distribution of .
Without loss of generality, the Lévy measure of Equation (3) can be represented as , where denotes the hyper-parameters if any of a measure on ,
is a probability measure so, and is called the mass of the Lévy measure. Note the total measure of is not standardized in any way so in principle some mass could also appear in . The mass is used as a concentration parameter for the random measure.
A realization of on can be constructed by sampling from the underlying Poisson process in a number of ways, either in rounds for decreasing bounds using the logic just given, or by explicitly sampling the jumps in order. The later goes as follows [FK72]:
Lemma 1 (Sampling a CRM)
Sample a CRM with Lévy measure as follows.
Draw i.i.d. samples from the base measure .
Draw the corresponding weights for these i.i.d. samples in decreasing order, which goes as:
Draw the largest jump
from the cumulative distribution function.
Draw the second largest jump from the cumulative distribution function .
The random measure then can now be realized as .
As a random variable is uniquely determined by its Laplace transformation, the random measure is uniquely characterized by its Laplace functional through the Lévy-Khintchine representation of a Lévy process [Ç10]. That is, for any measurable function , we have
Now instead of dealing with itself, we deal with , which is called the Lévy measure of , whose role in generating the measure via a Poisson process was explained above.
In the case where the measure on the jumps is not dependent on the data , so , then is called homogeneous, which is the case considered in this paper. When does not depend on , (4) simplifies to
Note the term inside the exponential plays an important role in subsequent theory, so it is given a name.
The Laplace exponent, denoted as for a CRM with parameters is given by
Note that to guarantee the positiveness of jumps in the random measure, in the Lévy measure should satisfy [Ç10], which leads to the following equations:
That is finite for finite positive implies (or is a consequence of) being finite.
There are thus four different ways to define or interpret a CRM:
2.2 Normalized random measures
Based on (3), a normalized random measure on is defined as333In this paper, we use to denote a normalized random measure, while use to denote its unnormalized counterpart.
The original idea of constructing random probabilities by normalizing completely random measures on , namely increasing additive processes, can be found in [ELP03], where it is termed normalized random measures with independent increment (NRMI) and the existence of such random measures is proved. This idea can be easily generalized from to any parameter space , e.g., being the Dirichlet distribution space in topic modeling. Also note that the idea of normalized random measures can be taken as doing a transformation on completely random measures, that is . In the normalized random measure case, is a transformation such that . A concise survey of other kinds of transformations can be found in [LP10].
Taking different Lévy measures of (4), we can obtain different NRMs. We use to denote the normalized random measure, where is the total mass, which usually needs to be sampled in the model, and is the base probability measure, is the set of other hyper-parameters to the measure on the jumps, depending on the specific NRMs. In this paper, we are interested in a class of NRMs called normalized generalized Gamma processes:
For ease of representation and sampling, we convert the NGG into a different form using the following lemma.
Let a normalised random measure be defined using Lévy density . Then scaling by yields an equivalent NRM up to a factor. That is, the normalised measure obtained using is equivalent to the normalised measure obtained using .
By this lemma, without loss of generality, we can instead represent the NGG by eliminating the parameter above.
The NGG with shape parameter , total mass (or concentration) parameter and base distribution , denoted , has Lévy density where
Note that similar to the two parameter Poisson-Dirichlet process [PY97], the normalized generalized Gamma process with can also produce power-law phenomenon, making it different from the Dirichlet process and suitable to model real data.
Proposition 1 ([Lmp07])
Let be the number of components induced by the NGG with parameter and mass or the Dirichlet process with total mass . Then for the NGG, almost surely, where is a strictly positive random variable parameterized by and . For the DP, .
Figure 2 demonstrates the power law phenomena in the NGG compared to the Dirichlet process (DP). We sample it using the generalized Blackwell-MacQueen sampling scheme [JLP09]. Each data to be sampled can choose an existing cluster or create a new cluster, resulting in clusters with data points in total.
Many familiar stochastic processes are special/limiting cases of normalized generalized Gamma processes, e.g., Dirichlet processes arise when . Normalized inverse-Gaussian processes (N-IG) arise when and . If , we get the -stable process, and if and depends on , we get the extended Gamma process.
For the NGG, key formula used subsequently are as follows:
where is the regularized upper incomplete Gamma function. Some mathematical libraries provide it for a negative first argument, or it can be evaluated using
using an upper incomplete Gamma function defined only for positive arguments.
Finally, because probabilities for a NRM necessarily have the divisor , and thus likelihoods of the NRM should involve powers of , a trick is widely used to eliminate these terms.
Consider the case where data are observed. By introducing the auxiliary variable, called latent relative mass, where , then it follows that
Thus the -th power of the normaliser can be replaced by an exponential term in the jumps which factorizes, at the expense of introducing the new latent variable . To the best of our knowledge, the idea of this latent variable originals from [Jam05] and is future explicitly studied in [JLP06, JLP09, GW11], etc..
2.3 Slice sampling normalized random measure mixtures
Slice sampling an NRM has been discussed in several papers, here we follow the method in [GW11], to briefly introduce the ideas behind it. It deals with the normalized random measure mixture of the type
where , are the jumps of the corresponding CRM defined in (3), ’s are the components of the mixture model drawn i.i.d. from a parameter space , denotes the component that belongs to, and is the density function to generate data on component . Given the observations , we introduce a slice latent variable for each so that we only consider those components whose jump sizes ’s are larger than the corresponding ’s. Furthermore, the auxiliary variable (latent relative mass) is introduced to decouple each individual jump and their infinite sum of the jumps appeared in the denominators of ’s. For clarification, we list the notation and its description in Table 1. Based on [GW11], we have the following posterior Lemma.
The posterior of the infinite mixture model (2.3) with the above auxiliary variables is proportional to
where is a indicator function returning 1 if is true and 0 otherwise, is the density of , , , and is the jump (large than ) distribution derived from the underlying Poisson process (actually, follows a compound Poisson process, meaning that it has jumps, while each jump has density , here means Poisson distribution with mean ).
The expressions for the NGG needed to work with this lemma were given in the remark at the end of Section 2.2. Thus the integral term in Equation (3) can be turned into an expression involving incomplete Gamma functions.
|#components with jump sizes larger than a threshold|
|Components in the mixture model|
|Total mass of the random measure|
|Jump sizes of the random measure with all|
|Sum of the remaining jump sizes,|
|#data attached to each component|
|total number of data points|
|Variables indicating which component belongs to|
slice variable uniformly distributed infor
|An auxiliary variable introduced to make the sampling feasible|
|Density function to generate data on component|
|Lévy measure of the random measure with decomposition considered in this paper|
First, we denote the parameter set as , then the sampling goes as
Sampling : From (3) we get
Sampling : Similarly
which can be sampled using rejection sampling from a proposal distribution , here
means a Gamma distribution with shape parameterand scale parameter .
Sampling : The posterior of with prior density is
Sampling : Sampling for can be done separately for those associated with data points (fixed points) and for those that are not. Based on [JLP09], when integrating out in (3), the posterior of the jump with data attached () is proportional to
While for those without data attached (), based on [GW11], conditional on , the number of these jumps follows a Poisson distribution with mean
while their lengths have densities proportional to
Sampling : are uniformly distributed in the interval for each . After sampling the , is set to .
Sampling : The posterior of with prior is
is usually taken to be Gamma distributed, so the posterior of can be sampled conveniently.
This section introduces the dependency operations used. These are developed for Poisson processes, CRMs and NRMs.
3.1 Operations on Poisson processes
Given a set of Poisson processes , the superposition of these Poisson processes is defined as the union of the points in these Poisson processes:
Lemma 4 (Superposition Theorem)
Let be independent Poisson processes on with , then the superposition of these Poisson processes is still a Poisson process with .
Subsampling of a Poisson process with sampling rate is defined to be selecting the points of the Poisson process via independent Bernoulli trials with acceptance rate .
Lemma 5 (Subsampling Theorem)
Let be a Poisson process on the space and be a measurable function. If we independently draw for each with , and let for , then are independent Poisson processes on with and .
Point transition of a Poisson process on space , denoted as , is defined as moving each point of the Poisson process independently to other locations following a probabilistic transition kernel , which is defined to be a function 444In the following we will use to denote the point transition operation, while use to denote the corresponding transition kernel. such that for each , is a probability measure on that describes the distribution of where the point moves, and for each , is integrable. Thus, . With a little abuse of notation, we use to denote a sample from in this paper. Thus is a stochastic function.
Lemma 6 (Transition Theorem)
Let be a Poisson process on space , a probability transition kernel, then
where can be considered as a transformation of measures over defined as for .
3.2 Operations on random measures
3.2.1 Operations on CRMs
Given independent CRMs on , the superposition () is defined as:
Given a CRM on , and a measurable function . If we independently draw for each with , the subsampling of , is defined as
Given a CRM on , the point transition of , is to draw atoms from a transformed base measure to yield a new random measure as
3.2.2 Operations on NRMs
The operations on NRMs can be naturally generalized from those on CRMs:
Given independent NRMs on , the superposition () is:
where the weights and is the unnormalized random measures corresponding to .
Given a NRM on , and a measurable function . If we independently draw for each with , the subsampling of , is defined as
Given a NRM on , the point transition of , is to draw atoms from a transformed base measure to yield a new NRM as
The definitions are constructed so the following simple lemma holds.
Superposition, subsampling or point transition of NRMs is equivalent to superposition, subsampling or point transition of their underlying CRMs.
Thus one does not need to distinguish between whether these operations are on CRMs or NRMs.
4 Posteriors for the NGG
This section develops posteriors for the single NGG, for a standard version and a version conditioned on the latent relative mass , . The second version is done because, as shown, the first version requires computing a complex recursive function.
4.1 Simple Posterior
James et al. [JLP09] develop posterior analysis as follows. This theorem simplifies their results and specialises them to the NGG.
Theorem 2 (Posterior Analysis for the NGG)
Consider the .
For a data vector
. For a data vectorof length there are distinct values with counts respectively. The posterior marginal is given by
Moreover, the predictive posterior is given by:
where the weights sum to 1 () are derived as
Note that an alternative definition of is
and various scaled versions of this integral are presented in the literature. Introducing a prior on and then marginalising out makes the term in disappear since the integral over can be carried inside the integral over .
Let and suppose then it follows that
For computation, the issue here will be computing the terms . Therefore we present some results for this.
Lemma 8 (Evaluating :)
Have defined as in Theorem 2. Then the following formula hold:
where is the upper incomplete gamma function, defined for and . Moreover, for Equation (25), cannot be integral for .
Another recursion is needed when for some . Then
It can be seen there are two different situations. When for some , then one can recurse down on . But otherwise, one recurses down on . Moreover, is a strictly decreasing function of and , but an increasing function of and . For computation, Equation (25) can be used to compute and in terms of . This equation may not be usable for and may be unstable. Thereafter, for in the recursion of Equation (26) can be applied.
The major issue with this posterior theory is that one needs to precompute the terms . While the Poisson-Dirichlet Process has a similar style, it has a generalised Stirling number dependent only on the discount [BH12]. The difference is that for the PDP we can tabulate these terms for a given discount parameter and still vary the concentration parameter ( above, but corresponding to ) easily. For the NGG, any tables of would need to be recomputed with every change in mass parameter . This might represent a significant computational burden.
4.2 Conditional Posterior
James et al. [JLP09] also develop conditional posterior analysis as follows. This theorem simplifies their results and specialises them to the NGG.
Theorem 3 (Conditional Posterior Analysis for the NGG)
Consider the NGG and the situation of Theorem 2. The conditional posterior marginal, conditioned on the auxiliary variable , is given by
Moreover, the predictive posterior is given by:
where the weights sum to 1 () are derived as
The posterior for is given by:
A posterior distribution is also presented by James et al. as their major result of Theorem 1 [JLP09]. We adapt it here to the NGG.
In the context of Theorem 3 the conditional posterior of the normalised random measure given data of length and latent relative mass is given by
Here, , and are jointly independent and , and are jointly independent.
Note in particular the densities given for and are not independent from each other. While an explicit density is not given for , its expected value is easily computed via the Laplace transform as .
Griffin et al. [GW11] present an alternative technique for obtaining the conditional posterior. The following is adapted from their main sampler after integrating out the slice variables.
Theorem 5 (Sampling Posterior)
Consider a bound which is sufficiently small so that it is less than the jumps associated with all the observed data. For an NRM given by , the number of jumps with value is a random variable as well as their values . The resultant posterior is as follows: