Information criteria for non-normalized models

Many statistical models are given in the form of non-normalized densities with an intractable normalization constant. Since maximum likelihood estimation is computationally intensive for these models, several estimation methods have been developed which do not require explicit computation of the normalization constant, such as noise contrastive estimation (NCE) and score matching. However, model selection methods for general non-normalized models have not been proposed so far. In this study, we develop information criteria for non-normalized models estimated by NCE or score matching. They are derived as approximately unbiased estimators of discrepancy measures for non-normalized models. Experimental results demonstrate that the proposed criteria enable selection of the appropriate non-normalized model in a data-driven manner. Extension to a finite mixture of non-normalized models is also discussed.

Authors

• 18 publications
• 23 publications
• 36 publications
• Estimation of Non-Normalized Mixture Models and Clustering Using Deep Representation

We develop a general method for estimating a finite mixture of non-norma...
05/19/2018 ∙ by Takeru Matsuda, et al. ∙ 0

• Minimum Stein Discrepancy Estimators

When maximum likelihood estimation is infeasible, one often turns to sco...
06/19/2019 ∙ by Alessandro Barp, et al. ∙ 2

• Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

The normalized maximized likelihood (NML) provides the minimax regret so...
01/28/2014 ∙ by Andrew Barron, et al. ∙ 0

• Unified estimation framework for unnormalized models with statistical efficiency

Parameter estimation of unnormalized models is a challenging problem bec...
01/23/2019 ∙ by Masatoshi Uehara, et al. ∙ 0

• Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation

In this work we consider data-driven optimization problems where one mus...
02/16/2021 ∙ by Justin Fu, et al. ∙ 9

• Interpretation and Generalization of Score Matching

Score matching is a recently developed parameter learning method that is...
05/09/2012 ∙ by Siwei Lyu, et al. ∙ 0

• Bias-Compensated Normalized Maximum Correntropy Criterion Algorithm for System Identification with Noisy Input

This paper proposed a bias-compensated normalized maximum correntropy cr...
11/23/2017 ∙ by Wentao Ma, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a parametric distribution

 p(x∣θ)=1Z(θ)˜p(x∣θ), (1)

where is an unknown parameter and

is the normalization constant. Many statistical models are defined in the form of non-normalized densities or probability functions

and the calculation of is intractable: for instance, Markov random field models (Li, 2001), truncated Gaussian graphical models (Lin et al., 2016)

, and energy-based overcomplete independent component analysis models

(Teh et al., 2004). Such models are often called non-normalized models or unnormalized models. Since maximum likelihood estimation is computationally intensive for non-normalized models, several estimation methods have been developed which avoid calculation of the normalization constant. These methods include pseudo-likelihood (Besag, 1974), Monte Carlo maximum likelihood (Geyer, 1994)(Hinton, 2002), score matching (Hyvärinen, 2005), and noise contrastive estimation (Gutmann and Hyvärinen, 2010)

. Among them, noise contrastive estimation (NCE) does not require Markov chain Monte Carlo and also applicable to general non-normalized models for both continuous and discrete data. In NCE, the normalization constant

is estimated together with the unknown parameter by discriminating between data and artificially generated noise, which is related to the Generative Adversarial Networks (Goodfellow et al., 2014). On the other hand, score matching is a computationally efficient method for continuous data which is based on a simple trick of integration by parts. The idea of score matching has been generalized to the theory of proper local scoring rules (Parry et al., 2012) and also applied to Bayesian model selection with improper priors (Dawid and Musio, 2015; Shao et al., 2019).

Although non-normalized models enable more flexible modeling of data-generating processes, versatile model selection methods for these models have not been proposed so far, to the best of our knowledge. In general, model selection is the task of selecting a statistical model from several candidates based on data (Burnham and Anderson, 2002; Claeskens and Hjort, 2008; Konishi and Kitagawa, 2008). By selecting an appropriate model in a data-driven manner, we obtain better understanding of the underlying phenomena and also better prediction of future observations. Akaike (1974)

established a unified approach to model selection from the viewpoint of information theory and entropy. Specifically, he proposed the Akaike Information Criterion (AIC) as a measure of the discrepancy between the true and the estimated model in terms of the Kullback–Leibler divergence. Thus, the model with minimum AIC is selected as the best model. AIC is widely used in many areas and has been extended by several studies

(Takeuchi, 1976; Konishi and Kitagawa, 1996; Kitagawa, 1997; Spiegelhalter et al., 2002). However, these existing information criteria assume that the model is normalized and thus they are not applicable to non-normalized models.

In this study, we develop information criteria for non-normalized models estimated by NCE or score matching. For NCE, based on the observation that NCE is a projection with respect to a Bregman divergence (Gutmann and Hirayama, 2011), we derive noise contrastive information criterion (NCIC) as an approximately unbiased estimator of the model discrepancy induced by this Bregman divergence. Note that AIC (Akaike, 1974) was derived as an approximately unbiased estimator of the Kullback-Leibler discrepancy. Similarly, for score matching, we develop score matching information criterion (SMIC) as an approximately unbiased estimator of the model discrepancy induced by the Fisher divergence (Lyu, 2009). Thus, the model with the minimum NCIC or SMIC is selected as the best model. Experimental results show that these procedures successfully select the appropriate non-normalized model in a data-driven manner. Thus, this study increases the practicality of non-normalized models. We note that Ji and Seymour (1996) proposed model selection criteria for non-normalized models estimated by the pseudo-likelihood (Besag, 1974). Whereas their criteria are useful for discrete-valued data, our criteria are applicable to continuous-valued data, and NCIC is equally applicable to discrete-valued data.

This paper is organized as follows. In Sections 2 and 3, we briefly review noise contrastive estimation (NCE) and score matching, respectively. In Section 4, we explain Akaike information criterion (AIC). In Sections 5 and 6, we derive information criteria for non-normalized models estimated by NCE and score matching, respectively. In Section 7, we confirm the validity of NCIC and SMIC by numerical experiments. In Section 8, we discuss extension of NCIC to non-normalized mixture models. In Section 9, we give concluding remarks.

2 Noise contrastive estimation (NCE)

In this section, we briefly review noise contrastive estimation (NCE), which is a general method for estimating non-normalized models. For more detail, see Gutmann and Hyvärinen (2012).

2.1 Procedure of NCE

In NCE, we rewrite the non-normalized model (1) to

 logp(x∣θ,c)=log˜p(x∣θ)+c, (2)

where . We regard as an additional parameter and estimate it together with .

Suppose we have i.i.d. samples from the non-normalized model (1). In addition to data , we generate noise samples from a noise distribution

. The noise distribution should be as close as possible to the true data distribution, while having a tractable probability density function: for example, the normal distribution with the same mean and covariance with data. Then, we estimate

by discriminating between the data and noise as accurately as possible:

 (^θNCE,^cNCE)=argminθ,c^dNCE(θ,c), (3)

where

 ^dNCE(θ,c) =−1NN∑t=1logNp(xt∣θ,c)Np(xt∣θ,c)+Mn(xt)−1NM∑t=1logMn(yt)Np(yt∣θ,c)+Mn(yt). (4)

The objective function

is the negative log-likelihood of the logistic regression classifier. Note that

and so the model estimated by NCE is not exactly normalized for a finite sample. NCE has consistency and asymptotic normality under mild regularity conditions (Gutmann and Hyvärinen, 2012; Uehara et al., 2018).

2.2 Bregman divergence related to NCE

Gutmann and Hirayama (2011) pointed out that NCE is interpreted as a projection with respect to a Bregman divergence. Specifically, consider a Bregman divergence between two nonnegative measures and defined as

 DNCE(q,p)=∫df(q(x)n(x),p(x)n(x))n(x)dx, (5)

where is a probability density and

 df(a,b)=f(a)−f(b)−f′(b)(a−b),
 f(x)=xlogx−(MN+x)log(1+NMx). (6)

This divergence is decomposed as

 DNCE(q,p)=g(q)+dNCE(q,p),

where is a quantity depending only on and

 dNCE(q,p) =−∫q(x)logNp(x)Np(x)+Mn(x)dx−MN∫n(y)logMn(y)Np(y)+Mn(y)dy. (7)

Then, the objective function of NCE in (4) satisfies

 Ey[^dNCE(θ,c)]=dNCE(^q,pθ,c), (8)

where is the empirical distribution of , , and denotes the expectation with respect to noise samples . Thus, NCE is interpreted as minimizing the discrepancy between the empirical distribution and the model distribution . This is analogous to the maximum likelihood estimator being interpreted as minimizing the Kullback–Leibler discrepancy between the empirical distribution and the model distribution in (13). Uehara et al. (2018) showed that the function in (6

) is optimal in terms of asymptotic variance.

3 Score matching

In this section, we briefly review the score matching estimator (Hyvärinen, 2005), which is a computationally efficient estimation method for non-normalized models of continuous data.

The score matching method is based on a divergence called the Fisher divergence (Lyu, 2009; Gutmann and Hirayama, 2011)

. For two probability distributions

and on , the Fisher divergence is defined as

 DF(q,p)=∫d∑i=1(∂∂xilogq(x)−∂∂xilogp(x))2q(x)dx.

By using integration by parts, it is transformed as

 DF(q,p)=g(q)+dSM(q,p),

where is a quantity depending only on and

 dSM(q,p)=∫(2d∑i=1∂2∂x2ilogp(x)+d∑i=1(∂∂xilogp(x))2)q(x)dx. (9)

Now, suppose we have i.i.d. samples from an unknown distribution and fit the non-normalized model (1). Then, an unbiased estimator of in (9) is obtained as

 ^dSM(θ)=1NN∑t=1ρSM(xt,θ),

where

 ρSM(x,θ)=2d∑i=1∂2∂x2ilog˜p(x∣θ)+d∑i=1(∂∂xilog˜p(x∣θ))2.

Importantly, we do not need for computing . Thus, the score matching estimator is defined as

 ^θSM=argminθ^dSM(θ).

This estimator has consistency and asymptotic normality under mild regularity conditions (Hyvärinen, 2005).

Hyvärinen (2007) extended score matching to non-normalized models on by considering the divergence

 DF+(q,p)=∫Rd+d∑i=1(xi∂∂xilogq(x)−xi∂∂xilogp(x))2q(x)dx.

Through a similar argument to the original score matching, the score matching estimator for non-negative data is defined as

 ^θSM+=argminθ^dSM+(θ),

where

 ^dSM+(θ)=1NN∑t=1ρSM+(xt,θ),

For exponential families, the objective functions of the score matching estimators reduce to quadratic forms (Hyvärinen, 2007; Forbes and Lauritzen, 2015). Specifically, for an exponential family

 p(x∣θ)=h(x)exp(m∑k=1θkTk(x)−ψ(θ))

on or , the function or is given by a quadratic form

 12θ⊤Γ(x)θ+g(x)⊤θ+c(x). (10)

For the exact forms of , and , see Lin et al. (2016). Therefore, the score matching estimator is obtained by solving the following linear equation:

 (N∑t=1Γ(xt))^θ+N∑t=1g(xt)=0.

4 Akaike information criterion (AIC)

In this section, we briefly review the theory of Akaike information criterion. For more detail, see Burnham and Anderson (2002) and Konishi and Kitagawa (2008).

Suppose we have independent and identically distributed (i.i.d.) samples from an unknown distribution . Based on them, we predict the future observation from by using a predictive distribution. For this aim, we assume a parametric distribution with an unknown parameter and estimate from by the maximum likelihood estimator defined as

 ^θMLE(xN)=argmaxθN∑t=1logp(xt∣θ).

By plugging in the maximum likelihood estimate, a predictive distribution is obtained. Then, the difference between the true distribution and the predictive distribution is evaluated by the Kullback–Leibler divergence

 DKL(q,^θMLE(xN)) =∫q(z)logq(z)p(z∣^θMLE(xN))dz.

This Kullback–Leibler divergence is decomposed as

 DKL(q,^θMLE(xN)) =Ez[logq(z)]+dKL(q,^θMLE(xN)), (11)

where denotes the expectation with respect to and

 dKL(q,^θMLE(xN)) =−Ez[logp(z∣^θMLE(xN))]

is the Kullback–Leibler discrepancy from the true distribution to the predictive distribution . Here, the first term in (11) does not depend on . Thus, information criteria are derived as approximately unbiased estimators of the expected Kullback–Leibler discrepancy , where denotes the expectation with respect to .

Let be the empirical distribution of . Then, the quantity

 dKL(^q,^θMLE(xN))=−1NN∑t=1logp(xt∣^θMLE(xN)) (12)

can be considered as an estimator of . However, this simple estimator has negative bias, because the maximum likelihood estimate is defined to minimize :

 ^θMLE(xN)=argminθdKL(^q,θ). (13)

Therefore, information criteria are derived by correcting this bias.

Consider the asymptotics . Then, as shown in Burnham and Anderson (2002),

 Ex[dKL(^q,^θMLE(xN))]−Ex[dKL(q,^θMLE(xN))]=−1Ntr(I(θ∗)J(θ∗)−1)+Op(N−2), (14)

where

 θ∗=argminθdKL(q,θ)

and matrices and are defined as

 Iij(θ)=Ez[∂∂θilogp(z∣θ)∂∂θjlogp(z∣θ)],Jij(θ)=−Ez[∂2∂θi∂θjlogp(z∣θ)].

Based on (12) and (14), Takeuchi Information Criterion (TIC; Takeuchi, 1976) is defined as

 TIC=−2N∑t=1logp(xt∣^θMLE(x))+2tr(^I^J−1), (15)

where matrices and are given by

 ^Iij=1NN∑t=1∂∂θilogp(xt∣θ)∂∂θjlogp(xt∣θ)∣∣ ∣∣θ=^θMLE(xN),
 ^Jij=−1NN∑t=1∂2∂θi∂θjlogp(xt∣θ)∣∣ ∣∣θ=^θMLE(xN).

TIC is an approximately unbiased estimator of the expected Kullback–Leibler discrepancy:

 Ex[TIC]=2NEx[dKL(q,^θMLE(xN))]+Op(N−1/2).

Assume that the model includes the true distribution: for some . Then, both and in (14) coincide with the Fisher information matrix and so . Based on this, Akaike Information Criterion (AIC; Akaike, 1974) is defined as

 AIC=−2N∑t=1logp(xt∣^θMLE(xN))+2k. (16)

AIC is an approximately unbiased estimator of the expected Kullback–Leibler discrepancy:

 Ex[AIC]=2NEx[dKL(q,^θMLE(xN))]+Op(N−1).

Thus, information criteria enable to compare the goodness of fit of statistical models. Among several candidate models, the model with minimum information criterion is considered to be closest to the true data-generating process. In practice, TIC often suffers from instability caused by estimation errors in and , and so AIC is recommended to use regardless of whether the model is well-specified or not (see Burnham and Anderson, 2002, Section 2.3).

5 Information criteria for NCE (NCIC)

In this section, we derive information criteria for NCE, which we call the Noise Contrastive Information Criterion (NCIC).

5.1 Bias calculation

To derive the bias correction terms in NCIC, we prepare some lemmata.

Suppose we have i.i.d. samples from an unknown distribution and estimate a non-normalized model (2) by using NCE. Here, the true distribution may not be contained in the assumed non-normalized model.

For convenience, we denote , , and . Also, we define

 ξ∗=argminξdNCE(q,pξ)=(θ∗,c∗), (17)

and write . Note that when the model includes the true distribution.

Rigorous treatments of the asymptotic theory for NCE require the concept of stratified sampling (Wooldridge, 2001; Uehara et al., 2018). Namely, there are two strata: data (size ) and noise (size ). Correspondingly, we define

 ρd(x,ξ)=−logNp(x∣ξ)Np(x∣ξ)+Mn(x), (18)
 ρn(y,ξ)=−logMn(y)Np(y∣ξ)+Mn(y). (19)

Then, the objective function of NCE is represented as

 ^dNCE(ξ)=1NN∑t=1ρd(xt,ξ)+1NM∑t=1ρn(yt,ξ).

Following Gutmann and Hyvärinen (2012), we consider the asymptotics under stratified sampling where , and with . We denote the expectation with respect to and by .

Similarly to in Section 2, the quantity has negative bias as an estimator of . The bias is calculated as follows. Here, represents the gradient with respect to , represents the Hessian with respect to , and and denote the expectation and covariance matrix with respect to .

Lemma 1.
 Ex,y[^dNCE(^ξNCE)]−Ex,y[dNCE(q,^p)]=−1Ntr(I(ξ∗)J(ξ∗)−1)+op(N−1), (20)

where matrices and are defined as

 I(ξ)= NN+MCovq[∇ξρd(z,ξ)]+MN+MCovn[∇ξρn(z,ξ)],
 J(ξ)= NN+MEq[∇2ξρd(z,ξ)]+MN+MEn[∇2ξρn(z,ξ)].
Proof.

From Theorem 3.2 of Wooldridge (2001), the asymptotic distribution of NCE is

 √N(^ξ−ξ∗)→N(0,J(ξ∗)−1I(ξ∗)J(ξ∗)−1).

The left hand side of (20) is decomposed as

 Ex,y[^dNCE(^ξ)]−Ex,y[dNCE(q,^p)]= D1+D2+D3,

where

 D1=Ex,y[^dNCE(^ξ)]−Ex,y[^dNCE(ξ∗)],
 D2=Ex,y[^dNCE(ξ∗)]−Ex,y[dNCE(q,p∗)],
 D3=Ex,y[dNCE(q,p∗)]−Ex,y[dNCE(q,^p)].

From (4) and (7), we obtain .

Since at and is the Hessian of at ,

 dNCE(q,^p)=dNCE(q,p∗)+12(^ξ−ξ∗)⊤J(ξ∗)(^ξ−ξ∗)+op(N−1).

Therefore,

 D3 =−12Ex,y[(^ξ−ξ∗)⊤J(ξ∗)(^ξ−ξ∗)]+op(N−1) =−12Ntr(I(ξ∗)J(ξ∗)−1)+op(N−1).

Similarly, from at and ,

 D1 =−12Ex,y[(^ξ−ξ∗)⊤∇2ξ^dNCE(^ξ)(^ξ−ξ∗)]+op(N−1) =−12Ntr(I(ξ∗)J(ξ∗)−1)+op(N−1).

Hence, we obtain (20). ∎

When the model includes the true distribution (well-specified case), the bias takes a simpler form. Let

 b(z)=p∗(z)n(z)r(z)2,

where

 r(z)=NN+Mp∗(z)+MN+Mn(z) (21)

is a mixture distribution of and .

Lemma 2.

Assume that the model includes the true distribution: . Then,

 Ex,y[^dNCE(^ξNCE)]−Ex,y[dNCE(q,^p)]=−1N(m−Er[b(z)])+op(N−1), (22)

where denotes the expectation with respect to in (21).

Proof.

Let and be the

-th column vector of

, which corresponds to .

By straightforward calculation,

 J(ξ∗)=NM(N+M)2∫r(z)b(z)s(z∣ξ∗)s(z∣ξ∗)⊤dz,
 I(ξ∗)=J(ξ∗)−(N+M)2NMjm(ξ∗)jm(ξ∗)⊤.

Thus,

 tr(I(ξ∗)J(ξ∗)−1) =m−(N+M)2NMjm(ξ∗)⊤J(ξ∗)−1jm(ξ∗)=m−Er[b(z)]. (23)

Substituting (23) into (20), we obtain (22). ∎

Gutmann and Hyvärinen (2012) pointed out that NCE converges to the maximum likelihood estimator as . In this setting, converges to and so goes to one. As a result, the coefficient of the leading term in the right hand side of (22) goes to , which is equal to the dimension of the parameter .

Mattheou et al. (2009) derived information criteria with the density power divergence based on similar bias calculation. In comparison, the bias term here takes more complicated form because we estimate not only the parameter but also the normalization constant in NCE.

5.2 Noise Contrastive Information Criterion (NCIC)

Now, we derive NCIC by using the bias calculation in the previous subsection.

Based on (18) and (19), let

 ¯¯¯¯¯¯¯¯¯¯¯¯∇ξρd=1NN∑t=1∇ξρd(xt,^ξ),¯¯¯¯¯¯¯¯¯¯¯¯∇ξρn=1MM∑t=1∇ξρn(yt,^ξ),

and define matrices and by

 ^I=1N+M (N∑t=1(∇ξρd(xt,^ξ)−¯¯¯¯¯¯¯¯¯¯¯¯∇ξρd)(∇ξρd(xt,^ξ)−¯¯¯¯¯¯¯¯¯¯¯¯∇ξρd)⊤ +M∑t=1(∇ξρn(yt,^ξ)−¯¯¯¯¯¯¯¯¯¯¯¯∇ξρn)(∇ξρn(yt,^ξ)−¯¯¯¯¯¯¯¯¯¯¯¯∇ξρn)⊤),
 ^J=1N+M(N∑t=1∇2ξρd(xt,^ξ)+M∑t=1∇2ξρn(yt,^ξ)).

From the discussion in Section 3.3 of Wooldridge (2001), and are consistent estimators of and , respectively. Thus, Lemma 1 leads to an information criterion for NCE as follows.

Theorem 1.

The quantity

 NCIC1= N^dNCE(^ξNCE)+tr(^I^J−1) (24)

is an approximately unbiased estimator of :

 Ex,y[NCIC1]=NEx,y[dNCE(q,^p)]+op(1).

When the model includes the true distribution, we can simplify the information criterion by using Lemma 2. Let

 ^b(z)=^p(z)n(z)^r(z)2, (25)

where

 ^r(z)=NN+M^p(z)+MN+Mn(z).
Theorem 2.

Assume that the model includes the true distribution: . Then, the quantity

 NCIC2= (26)

is an approximately unbiased estimator of :

 Ex,y[NCIC2]=NEx,y[dNCE(q,^p)]+op(1).

By minimizing NCIC, we can select from non-normalized models (2) estimated by NCE. in (24) and in (26) are viewed as analogues of TIC (15) and AIC (16) for non-normalized models, respectively. As will be shown in Section 7.1, has much smaller variance than . Also, is easier to compute than . Therefore, is recommended to use when the model is considered to be not badly mis-specified. This situation is quite similar to that of TIC and AIC (see Burnham and Anderson, 2002, Section 2.3).

6 Information criteria for score matching (SMIC)

In this section, we derive information criteria for score matching, which we call the Score Matching Information Criterion (SMIC). For convenience, we focus on the original score matching estimator in the following. Analogous results for the score matching estimator for non-negative data are obtained by replacing and with and , respectively.

Suppose we have i.i.d. samples from an unknown distribution and fit a non-normalized model (1) with by score matching. Here, the true distribution may not be contained in the assumed non-normalized model. For convenience, we denote . Also, we define

 θ∗=argminθdSM(q,pθ),

and write . Note that when the model includes the true distribution. We consider the asymptotics .

Similarly to in Section 4, the quantity has negative bias as an estimator of . By using a similar argument to Lemma 1, the bias is calculated as follows. Here, and denote the gradient and the Hessian with respect to , respectively, and and denote the expectation and covariance matrix with respect to .

Lemma 3.
 Ex[^dSM(^θSM)]−Ex[dSM(q,^p)]=−1Ntr(I(θ∗)J(θ∗)−1)+op(N−1),

where matrices and are defined as

 I(θ)=Covq[∇θρSM(z,θ)],J(θ)=Eq[∇2θρSM(z,θ)].

Let

 ^I=1NN∑t=1∇θρSM(xt,θ)∇θρSM(xt,θ)⊤∣∣ ∣∣θ=^θ,^J=1NN∑t=1∇2θρSM(xt,θ)∣∣ ∣∣θ=^θ.

Since and are consistent estimators of and , Lemma 3 leads to an information criterion for score matching as follows.

Theorem 3.

The quantity

 SMIC= N^dSM(^θSM)+tr(^I^J−1) (27)

is an approximately unbiased estimator of :

 Ex[SMIC]=NEx[dSM(q,^p)]+op(