In brain oncology, it is routine to evaluate the progression or remission of the disease based on differences between a pre-treatment and a post-treatment Positron Emission Tomography (PET) three-dimensional scan (Valk et al., 2003). Using markers such as injected F-18-fluorodeoxyglucose (FDG), these images measure the activity of glucose metabolism in human brains, which is normalized by dose and patient weight to standard uptake value (SUV) units.
A standard practice in this setting is to conduct analyses based on differences between scalar summaries of the two PET scans (Young et al., 1999; Wahl et al., 2009), such as the maximum SUV, calculated within user-defined regions of interest (ROIs) (Zasadny and Wahl, 1993; Lee et al., 2009; Takeda et al., 2011). However, due to variations in scanner settings, neurological activity or pharmacological drug effects, the scans may exhibit background differences even without significant progression or remission of the disease itself (Soret et al., 2007; Bai et al., 2013; Soffientini et al., 2016). The background effect may even be spatially heterogeneous, varying according to tissue type (Guo et al., 2014; Qin et al., 2017). Moreover, lesion changes may be missed if those lesions are not included in the user-defined ROIs. For these reasons, Guo et al. (2014) proposed a new approach comparing the two PET scans voxelwise.
As a motivating example that will be detailed in Section 5, Figure 1 shows data from the phantom experiment in Qin et al. (2017) simulating pre- and post-treatment scans with a tumor lesion. A direct voxelwise difference between the two scans shows a global non-homogeneous background change while failing to detect changes in the lesion (Figure 1, Row 1 and Column 3). This observation suggests that background adjustment is necessary in voxelwise comparisons to reduce confounding by tissue-dependent changes not related to the disease, in order to isolate localized differences that are relevant to assess the disease status.
|Scan 1||Scan 2||Difference|
Statistical models have been intensively used in quantitative analysis for PET imaging to provide automated and objective assessment (Leahy and Qi, 2000; Borghammer et al., 2009; O’Sullivan et al., 2014). Gaussian mixture models (GMM) are one of most popular approaches (Zhang et al., 1994; Guo et al., 2014; Soffientini et al., 2016), commonly assuming that voxel intensities belong to a mixture model with three components representing gray matter (GM), white matter (WM) and cerebro-spinal fluid (CSF). This is standard, for example, in the widely used software Statistical Parametric Mapping or SPM (Ashburner and Friston, 2005; Ashburner, 2012).
Modeling the voxel intensities in the two scans as a bivariate mixture model with GM, WM and CSF components, the background adjustment procedure in Guo et al. (2014) consisted of standardizing a multivariate Gaussian mixture model at each voxel. Large localized changes, presumably representing tumors, were then detected by performing a statistical test at each voxel with respect to an empirical null distribution. Following up on that work, Qin et al. (2017)
made a heuristic observation that the distribution of the standardized difference scoresacross voxels was close to standard normal, supporting the use of the standard normal as the null distribution for the voxelwise tests. From a pure frequentist perspective, however, the p-value for each voxel should be computed from the distribution of the standardized score at that voxel. Qin et al. (2017) offered no investigation of whether and why the distribution of the standardized scores at each voxel, whose standardization is imperfect and different, should be standard normal.
Motivated by the problem above, this paper addresses the question of whether the distribution of the standardized score at each voxel is close to standard normal, considering a range of challenges from realistic PET image settings with lesions present, spatial heterogeneity, and tissue dependent background changes. This comparison is most important at the tails of the distribution (one-sided or two-sided), where inference usually occurs.
In its simplest form, the question can be formulated more abstractly as follows. Suppose we have observations from a multivariate Gaussian mixture model, among which there are some outliers we wish to detect. As an illustration, Figure 2(a) shows a bivariate mixture model with some outliers that do not belong to any of the classes but also do not constitute enough data to fit a third class. Is it possible to standardize the mixture so that the distribution will be centered around the origin with identity covariance? If so, the standardized mixture represents a null distribution against which the outlying observations can be detected using a statistical test, as illustrated in Figure 2(b). But will the tails of the distribution be close enough to standard normal so that p-values for detection are valid?
There is a rich literature on the estimation and application of finite mixture models (McLachlan and Peel, 2000), leading to numerous research areas such as the expectation-maximization (EM) algorithm (Dempster et al., 1977; Redner and Walker, 1984; Gupta and Chen, 2011), the estimation of unknown components (Richardson and Green, 1997; Vlassis and Likas, 1999; Stephens, 2000; Lo et al., 2001; Figueiredo and Jain, 2002), and mixture of general distributions other than Gaussian (Peel and McLachlan, 2000; Dasgupta et al., 2005; Hanson, 2006; Lin et al., 2007; Venturini et al., 2008), only to name a few. However, there is limited work on inference problems such as hypothesis testing in a mixture model context.
In the first half of this paper, we study the standardization of a Gaussian mixture model systematically in various but simple ways. We show that, surprisingly, the tail distribution of the standardized scores is favorably close to standard normal in a wide range of scenarios while being conservative at the tails, making it suitable for statistical inference. Compared to the standardization method for background adjustment in Guo et al. (2014) and Qin et al. (2017), we consider several variations using both soft and hard assignment of the observations to latent classes. In the data application in Figure 1, the analysis based on the model-based standardized differences proposed in this paper is successful (Row 2 and Column 3) as the background difference is now randomly distributed around zero and the lesion change is clearly visible; see Section 5 for more details. The distributions of the corresponding standardized scores are evaluated here theoretically, numerically and via simulations. Theoretically, it is shown that the standardized scores are indeed close to standard normal under a variety of extreme parameter settings. In non-extreme parameter settings, it is shown numerically that the soft assignment methods lead to conservative tail probabilities, making them valid for hypothesis testing purposes. It is also shown that the tail probabilities are not very sensitive to the class probabilities, which is an advantage as these are hard to estimate in practice.
The second half of the paper addresses a practical challenge in analyzing PET images, namely, the standardization of Gaussian mixture models in spatially heterogeneous data. Brain images typically require using spatial Gaussian mixtures models, where the class membership probability varies spatially. Prior knowledge of the class membership probability at each voxel is available as templates from brain atlases (for example, SPM (Ashburner and Friston, 2005; Ashburner, 2012)). However, voxels that exhibit large changes in intensity such as lesions, which are precisely the ones we want to detect, are outliers with respect to the background and do not conform to the standard brain tissue templates. A robust estimation algorithm is thus needed to properly estimate the background without being affected by these outliers.
Our proposed spatial GMM fitting procedure extends the standard spatial GMM in two ways. First, we use the probability maps produced from the baseline scan (Scan 1) by the SPM software, but we update the probability maps to include both scans 1 and 2 as bivariate measurements. Our model thus extends the univariate model of Ashburner and Friston (2005) implemented in SPM to multivariate cases, allowing multiple scans for the same subject. The usage of bivariate GMM adjusts the background of multiple scans simultaneously, considering correlations between scans from the same subject.
Furthermore, to estimate the mixture model parameters and probability maps, we propose a robust EM algorithm. The proposed algorithm uses a robust -estimation step for updating the mean and covariance parameters, which is essential for the application since there are outliers such as tumor lesions present in the observation. The proposed robust EM algorithm is more objective and reliable than the estimation approaches in Guo et al. (2014) and Qin et al. (2017). In Guo et al. (2014), the background parameters were estimated from the data in manually selected healthy slices, but this was shown in Qin et al. (2017) to be unstable. Qin et al. (2017) used the entire brain volume but iterated the background estimation with the detection of outliers in an alternate fashion. Instead, our robust M-step is directly incorporated into the EM algorithm, and is based on statistical theory of robust estimation which is shown in our simulations to give remarkably accurate results. It is worth mentioning that another option for robust estimation may be to use a GMM with extra components for the outliers. However, a prior template map would not be available for the outliers because lesions do not occur at predictable spatial locations. Moreover, prior templates are obtained from healthy subjects who have no lesions present at all.
Both univariate and bivariate simulations are used to evaluate the tail probabilities after standardization of the Gaussian mixture when all the model parameters are estimated via the proposed robust EM algorithm. We show that the proposed robust EM algorithm is accurate without outliers but robust when outliers are present, and show that the obtained standardized scores from background adjustment methods with soft assignment have tail probabilities that are either very close to or more conservative than standard normal. In addition, we observe that background adjustment methods based on estimated parameters surprisingly have slightly better performances than the one with true parameters.
For brain images, the proposed approach can be applied directly by practitioners since the calculated scores and resulting -values provide immediate reference for inferences about the change in disease status. We emphasize that although this paper is motivated by PET image analysis in oncology, the concepts of standardization of multivariate Gaussian mixtures and robust EM algorithm are generic, and can be used in other contexts. The Matlab toolbox RB-SGMM-BA is available online in Matlab Central to implement the proposed methods. Supplementary materials contain all proofs and additional simulation studies.
2 Standardization of multivariate Gaussian mixtures
Let be a random
-dimensional vector () generated by a Gaussian mixture model (GMM) with components:
where each and , and
is the probability density function of a-dimensional Gaussian variable with mean and covariance matrix i.e. . Let the allocation variable mark the class from which is generated, and let , where is the indicator function. Then we can represent the model (2.1) using the latent variables ’s as and , or equivalently using the indicators and :
2.1 Standardization methods
. Then the posterior probabilitythat belongs to the th class is given by
according to the Bayes’ theorem. The posterior probabilityis often referred to in the literature as the membership weight or responsibility of class for . The following two methods are commonly used in practice to recover the latent labels ’s:
Hard assignment: assign if ; otherwise, .
Soft assignment: assign .
Based on the membership weights, we wish to adjust the mean and covariance of each observation with the hope that its resulting distribution will be close to a multivariate standard normal. To achieve this, consider the following representation of model (2.1). Let , where is the identity matrix. Conditional on , the observation is multivariate normal with mean and covariance
, which is identically distributed as the random variable
where is the principal square root matrix of a positive definite matrix such that is positive definite and . Since the are 0-1 indicators, we can rewrite (2.4) as If the latent labels
’s were known, this equation would provide an exact linear transformation to a multivariate standard normal distribution. Since the latent labels’s are not known, we use the following plug-in transformation via the estimated labels :
This will be our transformation of choice in Section 2.2 to Section 2.5. In the same spirit, one may also consider to calculate the combination of covariance matrices first and then invert, leading to the transformation
Other similar expressions are also possible. In particular, Guo et al. (2014) proposed to use the marginal covariance of , i.e.,
The three transformations (2.5), (2.6) and (2.7) yield exact multivariate standard normal if the latent labels are replaced by the true values . They are equivalent if hard assignment is used in the estimation of latent labels, but not so if soft assignment is used. For this reason, whenever we need to distinguish between soft and hard assignment in what follows, we use the generic notations and respectively for the three transformations with the soft assignment, and use for the transformation with the hard assignment.
The goal of the rest of this section is to determine parameter scenarios under which the distributions of the standardized scores , and are close to multivariate standard normal. We focus on the transformation because it is the easiest to analyze theoretically. This is sufficient because, as we shall see later in the simulations, the other two transformations and perform similarly.
Throughout this section, we consider the situation for simplicity, although the same ideas apply to a higher number of classes.
2.2 Reparametrization of the standardized scores
When , the standardized score in equation (2.5) can be written as
where and . Let the ratio between likelihoods of the two components be and let . Then the posterior probabilities in (2.3) become
Noting that , i.e. , we have that
For hard assignment, according to equation (2.10), and . We thus can represent as follows: if ; if ; and if i.e.
In contrast, the soft assignment estimates the labels according to the posterior probabilities, i.e. and .
The full model depends on the parameter space of dimension . However, the transformation in (2.5) only depends on the full parameter space through a lower dimensional subset of dimension . This appealing property does not generally hold for the other two transformations and in (2.6) and (2.7). For example, since and are not always simultaneously diagonalizable, the term in the transformation cannot decompose into the simpler form as it is for .
2.3 Approximate normality
While our goal is to compare the tail probabilities between the standardized score and a standard multivariate normal , it can be seen that the former approximates the latter over the entire domain under several extreme parameter scenarios.
Let . The set trivially leads to perfect standardization so that , as the mixture model becomes one single component. This holds for any values of the parameters inside , so identifiability is not important in this case. The following theorem states that approximate normality also holds when the parameters are either close to or far from the set . Let denote the Euclidean norm in the case of a vector or the Frobenius norm in the case of a matrix.
For both hard and soft assignments, if are bounded, then we have the following results:
1) For fixed , assume there is a sequence of parameters such that or , then .
2) For fixed , assume that there is a sequence of parameters such as and is bounded, then .
3) If there is a sequence of parameters such as and , then .
4) If there is a sequence of parameters and a vector such as and where , then .
In Theorem 2.1 (1), we can obtain a.s. convergence for if goes to 0 or 1 fast, for instance, if by the Borel–-Cantelli lemma.
Theorem 2.1 states that the standardized score will be approximately standard multivariate normal (with convergence in probability or almost surely) if one of the following occurs in a limiting sense: (1) one of the mixture components is dominant; (2) the mean vectors of the two components are well separated relative to the covariance matrices; (3) the mean vectors and covariance matrices of the two components are close to each other; (4) the mean vectors and covariance matrices of the two components are close to each other along a particular direction, in which case normality is obtained on the corresponding contrast along that direction.
Note that it is a special case when the two covariance matrices are the same, i.e. . In this case, we have and the results in Theorem 2.1 still hold.
2.4 Explicit distribution in the case of hard assignment
To further understand the behavior of the standardized scores in scenarios other than those considered in Theorem 2.1, it is helpful to have an explicit formula for the distribution of . This is also useful in the numerical evaluations in Section 2.5 below.
Let denote the CDF of a random variable , be the CDF of standard normal and be the distribution function of a -dimensional standard normal , i.e., for a Borel set . Then Theorem 2.5 gives the exact CDF of a contrast on .
Define the two maps and as and . For a given vector such that and any , we have
where , and is the inverse map of .
The set in Theorem 2.5 involves the quadratic form in if . When , we actually can solve this quadratic equation explicitly as follows. Without loss of generality, we consider the case when .
Assume that the dimension = 1, and . Let and . Then the set , where
The situation when
means that the two variances are the same. The definitions ofin (2.16) and (2.17) can actually be obtained by applying L’Hospital’s Rule to the expressions in (2.15) when (limit from above). This suggests a continuous behavior of the set when the variances of different components change from the heterogeneous to homogeneous.
For simplicity of notation, we now drop the argument when referring to . The following Theorem 2.8 gives the explicit formula for the CDF of , which is obtained by combining Theorem 2.5 and Lemma 2.6.
Assume that the dimension = 1, and . We have
for any , where , , and the operator and are the min and max operator, i.e. for any , and
2.5 Numerical evaluation of tail probabilities
From Theorem 2.1 we learn that the scenarios in which the standardized score may be far from standard multivariate normal are those where the mixture components are not close to each other nor far from each other. In this section, we study some of these scenarios numerically. We focus on tail probabilities, which are most important for statistical testing.
Let the vector be the contrast of interest and let be the size of the test. If the decision threshold is set as , according to the standard normal distribution, then we are interested in whether the true size of the test is below or above . The true size of a two-sided hypothesis test is . Let be the ratio between the true size of the test and the size based on the standard normal distribution: means that the size is exact; means that the test is conservative; means that the test is invalid.
In this section we study the relative size for various combinations of model parameters . We may also write to emphasize the dependence of this ratio on the parameters . As a reduction on the set of parameters to be evaluated, the following Lemma 2.9 states that the relative size is symmetric with respect to the sign of ; thus changes in need only be evaluated in one direction.
For any , we have
A theoretical expression for for soft assignment is difficult to obtain. Instead, we use numerical evaluation to investigate the tail probabilities via Monte Carlo simulation. For simplicity, we focus on the bivariate case . Similar to the data analysis, we use as the contrast of interest, measuring the normalized difference between the two coordinates of . Since the distribution of depends only on the parameters , we assume without loss of generality and when generating the data . We use the following parameter settings to investigate the tail probabilities:
In the above setting, the parameter controls the correlation of the bivariate normal in the first class, while and control how the two classes differ from each other.
Figure 3 illustrate schematically the density contours of the two cases when . In Case 1, the vector is orthogonal to the contrast vector , while in Case 2 it is not.
Figures 4 and 5 below show the relative size for in the form of heatmaps when , and and vary continuously. Note that because of Lemma 2.9, it is sufficient to consider . In these figures, purple is ideal, indicating a relative size of about 1. Blue indicates a conservative relative size smaller than 1, while red indicates an invalid relative size greater than 1. Black indicates a relative size greater than 2, which may be considered unacceptable.
To see how these parameter combinations fall into the parameter settings in Theorem 2.1, we compute
As predicted by Theorem 2.1, the relative sizes are all close to 1 for extreme values of close to 0 or 1 in all subplots (satisfying conditions in Theorem 2.1 (1)), large with respect to (satisfying conditions in (2) as is large), and and (satisfying conditions in (4) as and ).
When comparing hard and soft assignment, Figure 4 shows that, in Case 1, hard assignment leads to unacceptable relative sizes regardless of the mixture proportion , especially when is small. Soft assignment, however, corrects this and leads to conservative relative sizes in all the cases shown. A similar pattern is observed in Figure 5 for Case 2, except that soft assignment leads to relative sizes slightly greater than 1 for some parameter combinations. More combination of parameters studied, including evaluation of one-sided tail probabilities, lead to similar observations and thus are not included due to space limitations.
3 Spatial and robust fitting of Gaussian mixture models
To fit the PET data as a Gaussian mixture model requires two particular features. One is to incorporate a spatial component into the model, so that the prior probabilities vary according to a spatial pattern defined by the tissue types. These spatial patterns can be incorporated into the EM algorithm via pre-defined spatial templates. The second feature is to make the M step in the EM algorithm robust to outliers so that estimation of the background is not affected by tumor voxels. The tumor voxels cannot be modeled as a separate component in the mixture such as in because they are relatively few in volume and because they do not correspond to any particular spatial pattern known a-priori.
3.1 Spatial Gaussian mixture models
Most existing approaches about spatial GMM are based on the Markov random field on such as the method of iterated conditional modes in (Besag, 1986), discussions in Chapter 13 of McLachlan and Peel (2000), and many other developments along this line (Sanjay-Gopal and Hebert, 1998a; Thanh Minh Nguyen and Wu, 2012; Vehtari and Ojanen, 2012). However, in our applications template maps of the class membership probability are available from brain atlases (Ashburner, 2012), which makes it more sensible to adopt an approach that utilizes the existing spatial templates.
Let the observations be , where is a -dimensional vector () at the th location and is the total number of voxels. For 2-dimensional or higher dimensional images where the location index has more than one direction, we can vectorize the location to have this single index . We assume that the ’s are independently generated by a GMM in (2.1) but with a spatial mixture probabilities:
where for any and . We refer to model (3.1) as a Spatial Gaussian Mixture Model (SGMM).
The traditional Gaussian Mixture Model (GMM) assumes that for all . In contrast, SGMM considers the spatial information by using different mixture probabilities at each location. In order to incorporate spatial information, we assume that the probability maps are generated by some given prior probability maps :
with the identifiability constraint , which is also used by Ashburner and Friston (2005). The probability maps are often referred to as templates. Spatial templates provide a natural way to incorporate prior knowledge from previous studies about the probability of each location belonging to each class. In the case of the brain, the templates represent reference probability maps for brain tissue types like GM, WM and CSF. Such templates, constructed from segmentations of images of healthy subjects, are available within the software SPM. Other ways to introduce the probability map are possible but may have more parameters to estimate, such as the one based on Markov random field (Sanjay-Gopal and Hebert, 1998b; Chen et al., 2001; Soffientini et al., 2016)
The usage of templates also reduces the dimension of the parameters in the model, which otherwise would be nearly proportional to the number of voxels . Under the constraint (3.2), the parameters of model (2.2) become , with dimension . For and , the dimension is 17. It can be seen again that the traditional GMM is a special case of SGMM by letting for all in (3.2), i.e. the traditional GMM uses constant template maps.
Given all parameters, the standardized scores at each location can be obtained using the methods introduced in Section 2. In Section 3.2, we propose a robust expectation-maximization (EM) algorithm to obtain maximum likelihood estimators (MLE) in model (3.1). The tail probabilities and performances of the robust EM algorithm are investigated through simulation in Section 4.
3.2 Robust EM algorithm
The EM algorithm introduced by Dempster et al. (1977) and its invariants are popular to obtain the maximum likelihood estimators (MLE) in a mixture model; see McLachlan and Krishnan (2008) for a comprehensive treatment. Since the model of SGMM (2.1) uses spatial mixture probabilities via the parameterization (3.2), the conventional EM does not apply directly and modifications are needed to make the EM algorithm work, which results in a generalized EM algorithm. Furthermore, voxels that exhibit large changes in intensity such as lesions, which are precisely the ones we want to detect, are outliers with respect to the background. A robust estimation procedure is thus necessary to properly estimate the background without being affected by these outliers. As robust estimators, -estimators in mixture models have been well developed in the literature (Campbell, 1984; McLachlan and Basford, 1988; Maronna et al., 2006). The basic idea is to reduce the weights of abnormal observations while keeping close to full weights for the others. We shall use -estimates in the M-step of an EM algorithm to achieve robust fitting of SGMM.
Let , then the log-likelihood function is given by
If the latent labels are observed, the joint log-likelihood is
Given an intermediate estimate , we first calculate the conditional expectation of in equation (3.4) denoted as (the E-step), and then maximize this conditional expectation with robustness (the robust M-step). For the robustness step, we use the Mahalanobis distance between observations and the estimated mean vectors and covariance matrices to determine whether an observation is abnormal or not. Given the distance, observations are weighted according to the weight function where is Huber’s -function (Huber, 1964; Maronna, 1976) with a tuning constant depending on the dimension . We next present the detailed generalized EM algorithm.
E-step. Conditional Expectation on :
Update the mean vector:
Update the covariance matrix:
The algorithm is terminated when the relative change of the log-likelihood function in (3.3) with respect to the previous iteration becomes smaller than a given tolerance or a maximum number of iterations is reached. The tuning parameter in the robust M-step depends on the proportion of contaminated data in the observation. We use where is the th quantile of the distribution as used in Devlin et al. (1981). In the provided toolbox RB-SGMM-BA, the default values for the tolerance, maximum number of iterations and are , respectively.
Remark 3.1 (Comparison with the the conventional EM).
The EM algorithm in the conventional GMM without spatial templates uses and in the E-step and does not update . The M-step without robustness may update the mean vector and covariance matrix by
which can be viewed as a special case of the proposed robust EM algorithm where the weight function for all .
Remark 3.2 (Derivation of the robust EM algorithm).
We just need to derive the update of as the E-step follows the conventional EM but has voxel-varying and , and the M-step is an application of -estimation. The update of is obtained by solving