1 Introduction
The development of empirical Bayesian methods for largescale hypothesis testing addresses important practical challenges to do with identifying distributional changes while controlling the false discovery rate. Often, these methods are applied relatively late in the dataanalysis pipeline, after pvalues, test statistics, or other summary statistics are computed for each testing unit. Essentially, the analyst performs univariate testing
en masse, with the final unitspecific scores and discoveries dependent upon the choice of empirical Bayesian methodology. Empirical Bayes tools account for the collective properties of the univariate statistics and thereby gain their advantage (e.g., Storey (2003), Efron (2010), Stephens (2017)). These methods may be underpowered in some applied problems when the underlying effects are relatively weak. Motivated by tasks in neuroscience and brain imaging, we describe an empiricalBayesian approach that operates earlier in the dataanalysis pipeline and that leverages regularities achieved through constraining the dimension of the parameter space. Our approach is restricted to data sets in which the variables constitute nodes of a known, undirected graph, which we use to guide regularization. We report simulation and empirical studies with structural magnetic resonance imaging to demonstrate the striking operating characteristics of the new methodology. We conjecture that power is gained for graphassociated data by moving upstream in the data reduction process and by recognizing low complexity parameter states.The following toy problem illustrates in a highly simplified setting the phenomenon we leverage for improved power. Suppose we have two sampling conditions, and two variables measured each condition, say and in the first condition and and
in the second. We aim to test the null hypothesis that
and have the same expected value; say . Conditional upon target values , and nuisance mean values and, the four observations are mutually independent, with normal distributions and some constant, known variance
. We further imagine that these four variables are part of a larger system, throughout which the distinct expected values themselves fluctuate, say according to a standard normal distribution. Within this structure, a test of may be based upon the local false discovery ratewhere we are mixing discretely over null (with probablity
) and nonnull cases. Notice in this setting the acrosssystem variation in expected values may be handled analytically and integrated out; thus in this predictive distribution is the density of a mean 0 normal distribution with variance ; and is the bivariate normal density with margins and with correlation between and . In considering data and on the second variable, it may be useful to suppose that the expected values here are no different from their counterparts on the first variable. We say the variables are blocked if both and , and we consider this a discrete possibility that occurs with probablity throughout the system, independently of . In the absence of blocking there is no information in and that could inform the test of(considering the independence assumptions). In the presence of blocking, however, data on these second variables are highly relevant. Treating blocking as random variable across the system, we would score
using the local false discovery rate , which requires consideration of a 4variate normal and joint discrete mixing over the blocking and null states for full evaluation. Fig. 1 shows the result of simulating a system with variable pairs, where the marginal null frequency , , and the blocking rate varies over three possibilities. Shown is the false discovery rate of the list formed by ranking instances by either or . The finding in this toy problem is that power for detecting differences between and increases by accounting for the blocking, since the list of discovered nonnull cases by is larger for a given false discovery rate than the list constructed using . In other words, when the dimension of the parameter space is constrained by blocking, more data become relevant to the test of and power increases.Making a practical tool from the blocking observation requires that a number of modeling and computational issues be resolved. Others have recognized the potential, and have designed computationally intensive Bayesian approaches based on Markov chain Monte Carlo (
Do and others (2005), Dahl and Newton (2007), Dahl and others (2008), Kim and others (2009)). We seek simpler methodology that may be more readily adapted in various applications. In many contexts data may be organized at nodes of an undirected graph, which will provide a basis for generalizing the concept of blocking using special graphrespecting partitions. Having replicate observations per group is a basic aspect of the data structure, but we must also account for statistical dependence among variables for effective methodology. In the proposed mixture formulation we avoid the productpartition assumption which entails independencies that greatly simplify computations but at the expense of model validity and robustness; we gain numerical efficiency and avoid Markov chain Monte Carlo through a graphlocalization of the mixture computations. The resulting tool we callGraphMM
, for graphbased mixture model. It is deployed as a freely available opensource
R package available at https://github.com/tienv/GraphMM/
. We investigate its
properties using a variety of syntheticdata scenarios, and we also apply it to identify
statistically significant changes in brain structure associated with the onset of mild
cognitive impairment.
2 Methods
2.1 Data structure and inference problem
Let denote a simple, connected, undirected graph with vertex set and edge set , and consider partitions of , such as ; that is, blocks (also called clusters) constitute nonempty disjoint subsets of for which . In the application in Section 3.2, vertices correspond to voxels at which brainimage data are measured, edges connect spatially neighboring voxels, and the partition conveys a dimensionreducing constraint. The framework is quite general and includes, for example, interesting problems from genomics and molecular biology. Recall that for any subset , the induced subgraph , where contains all edges for which and . For use in constraining a parameter space, we introduce the following property: we say that respects , or that is graphrespecting, if for all , the induced graph is connected. Fig. 2 presents a simple illustration.
It happens that any graphrespecting partition may be encoded with a vector of binary edge variables
, say with . Connected vertices in the same block have , and those in different blocks have . For general graphs not every binary edge vector corresponds to a graphrespecting partition, however if is a tree then the graphrespecting partitions are in onetoone correspondence with length binary vectors. In case the graph is complete, then the set of possible graphrespecting partitions is the same as the set of all partitions of ; i.e. the graph provides no reduction in the complexity of the set of partitions. It becomes relevant to statistical modeling that the size of the set of graphrespecting partitions, though large, still is substantially smaller than the set of all partitions as the graph itself becomes less complex. For example there are 21147 partitions of objects (the 9th Bell number), but if these 9 objects are are arranged as vertices of a regular lattice graph, then there are only 1434 graphrespecting partitions. In certain modeling settings, such as with Dirchletprocess mixture models, latent partitions allow for modeling parameter heterogeneity. We use the graphrespecting property to regularize the otherwise unwieldy set of such partitions.In our setting, the graph
serves as a known object that provides structure to a data set being analyzed for the purpose of a twogroup comparison. This is in contrast, for example, to graphicalmodeling settings where the possibly unknown graph holds the dependency patterns of the joint distribution. We write the twogroup data as
and , where , and . Here and denote the numbers of replicate samples in both groups. In Section 3.2, for example, indexes the brain of a normal control subject and indexes the brain of a subject with mild cognitive impairment. For convenience, let and denote the acrossgraph samples on subjects and , which we treat as identically distributed within group and mutually independent over and owing to the twogroup, unpaired experimental design.Our methodology tests for changes between the two groups in the expectedvalue vectors: and . Specifically, we aim to test, for any vertex ,
(2.1) 
We seek to gain statistical power over contemporary testing procedures by imposing a dimension constraint on the expected values. Although it is not required to be known or even estimated, we suppose there exists a graphrespecting partition
that constrains the expected values:(2.2) 
All vertices in block have a common mean in the first group, say , and a common mean in the second group. The contrast on test, then, is ; together with , the binary vector holding indicators is equivalent to knowing whether or not in (2.1) is true for each vertex . When data are consistent with a partition in which the number of blocks is small compared to the number of vertices , then it may be possible to leverage this reduced parameterspace complexity for the benefit of hypothesistesting power.
2.2 Graphbased Mixture Model
2.2.1 Discrete mixing.
We adopt an empirical Bayes, mixturebased testing approach, which requires that for each vertex we compute a local false discovery rate:
(2.3) 
Our list of discovered (nonnull) vertices is for some threshold
. Conditional on the data, the expected rate of typeI errors within
is dominated by the threshold (Efron (2007), Newton and others (2004)). The sum in (2.3) is over the finite set of pairs of partitions and blockchange indicator vectors . This set is intractably large for even moderatesized graphs. We have experimented with Markovchain Monte Carlo for general graphs, but present here exact computations in the context of very small graphs. Specifically, for each vertex in the original graph we consider a small local subgraph in which is one of the central vertices, and we simply deployGraphMM
on this local graph. Further details are in Supplementary Material.
By summing in (2.3) we perform marginal posterior inference, and thus have a mechanism for borrowing strength among vertices . Of course by Bayes’s rule,
and both the prior mass and the prior predictive density
need to be specified to compute inference
summaries. Various modeling approaches present themselves. For example, we could reduce data per vertex to
a test statistic (e.g., tstatistic) and model the predictive density nonparametrically,
as in the R package locFDR
(see Efron (2010)
). Alternatively, we could reduce data per vertex less severely, retaining effect estimates and estimated standard errors, as in adaptive shrinkage (
Stephens (2017)). In either case we would need to retain information about statistical dependencies between vertices; numerical experiments show that badly misspecifying this dependence leads to inflated false discovery rate. The approach reported here takes an explicit parametricmodel formulation for the predictive distribution of data given the discrete state
. This restricts the sampling model to be Gaussian, but allows general covariance among vertices and is not reliant on the productpartition assumption commonly used in partitionbased models (Barry and Hartigan (1992)).In the present work we use a simple specification for ; namely and encodes independent and identically distributed blockspecific Bernoulli indicators of a block shift. In numerical experiments we use univarite empiricalBayes techniques to estimate (Supplementary Material).
2.2.2 Predictive density given discrete structure.
We take a multivariate Gaussian sampling model:
We do not constrain the covariance matrices and , though we place a conjugage inverse Wishart prior distribution on them:
Our predictive densities are conditional on the blocking and change patterns in and ; in general there is no simple conjugate reduction owing to the lessthanfull dimension of free parameters in and . On these free parameters we further specify independent Gaussian forms:
Hyperparameters in GraphMM
include scalars , , , , df and
matrices , , which we estimate from data across the whole graph following the empirical
Bayes approach (for details see Supplementary Material).
Model (2.2.2) specifies the joint density . For the purpose of hypothesis testing, we need to marginalize most variables, since is equivalent to and for block in partition
, and local false discovery rates require marginal posterior probabilities. Integrating out inverse Wishart distributions over the covariance matrices is possible analytically. We find:
(2.4) 
where
In the above, notation denotes matrix determinant and and are sample means
Note that and are sample covariance matrices of and respectively. We using Laplace approximation to numerically integrate the freelyvarying means in order to obtain the marginal predictive density . Our explicit formula is presented in Supplementary Material.
Notably, by not constraining the sample covariance matrices and the GraphMM
model
does not adopt a productpartition form. In such, the predictive density would factor over blocks
in the graphrespecting partition, and this would lead to simpler computations. We found in
preliminary numerical experiments that various data sets are not consistent with this simplified
dependence pattern, and we therefore propose the general form here. In working on relatively
small local graphs the computations remain relatively straightforward in this case.
2.3 Datadriven simulations
Our primary evaluation of GraphMM
is through an extensive set of simulations.
As we have been motivated by a brain science problem (Section 3.2), we design these simulations
to have summary empirical characteristics matching the empirical characteristics of our primary
data set (Section 3.2).
This data set comes from the Alzheimer’s Disease Neuroimaging Initiative 2
(ADNI2), and provides a template for the simulation.
Briefly, we consider brain MRI data from a group of normal control subjects (group 1)
and a second group of subjects suffering from latestage mild cognitive impairment
(late MCI), a precursor to Alzheimer’s disease.
The simulationguiding data involve a single coronal slice with voxels.
Fig. 3 illustrates the general framework for all simulation scenarios; further
details are in Algorithm 2, Supplementary Material.
We use greedy clustering (Collins, 2014) on the empirical mean profiles to generate blocks for
the synthetic expected values, and we adjust the empirical covariances by adding diagonal weight to assure
invertibility. Three synthetic data sets are simulated in each scenario.
The first three scenarios address the issue of block size; the next two investigate the role of
the distribution of condition effects. To assess robustness,
we also consider parameter settings where partitions are not
graph respecting, and condition effects are not uniform over blocks. We also deploy two permutation
experiments; the first uses sample label permutation to confirm the control of the false discovery rate,
and the second uses voxel permutation to confirm that sensitivity drops when we disrupt the spatially
coordiated signal.
When applying GraphMM
to each synthetic data set, we estimate hyperparameters for all distributional components
and consider discoveries as for various thresholds .
We call the controlled FDR the
mean , as this is the conditional expected rate
of type1 errors on the list, given data (and computable from data).
We know the null status in each synthetic case, and so we also
call the empirical FDR to be that rate counting latent null indicators; likewise the true positive rate counts
the nonnull indicators. We compare GraphMM
to several contemporary testing methods, including
BenjaminiHochberg correction (BH adj), Benjamini and Hochberg (1995), local FDR, (locfdr) Efron (2007),
and value (qvalue), Storey (2003)
, that are all applied to voxelspecific ttests. We also compare results to adaptive shrinkage,
Stephens (2017), both the local FDR statistic (ash_lfdr) and the value (ash_qval). These methods all work on summaries of voxelspecific tests; summaries may be pvalues (for BH and qvalue), or tstatistics (for locFDR), or effect estimates and estimated standard errors (for ASH). In any case, none of the methods leverages the graphical nature of the data in the way done byGraphMM
.
3 Results
3.1 Operating characteristics of GraphMM
We first do a sanity check to confirm that statistical efficiency gains may arise by regularizing the expected values through the graphrespecting assumption. In a predictive simulation of the lattice, we generate synthetic Gaussian data as follows: expected values are guided by some generative graphrespecting partition (drawn from a prior); blockspecific means are realized as i.i.d. Gaussian variables; the 9 data elements in deviate from these means by realizations of i.i.d. Gaussian variables. Each simulated data set is one instance of data when the generative setting is graphrespecting. We take each such simulated data set and work out what two different analysts would surmise about the generative partition. Analyst A knows that the expected values follow some partition structure. Analyst B knows also that the expected values follow a graphrespecting partition. Each analyst computes a posterior distribution, say , over the set of partitions; indeed each posterior distribution is concentrated at some level around the generative partition . A simple measure of the concentration is through the induced distribution on the Adjusted Rand Index, which measures a similarity between two partitions. For any level of similarity, , each analyst has a posterior probability . Figure 4 compares analysts by the average of these posterior similarity distributions over data sets . It reveals that by enforcing regularity on the prior distribution over partitions (i.e., by enforcing the graphrespecting property), we tend to place greater posterior probability mass near the generative partition. In applications where the graph conveys structure of expected values, the graphrespecting assumption may usefully regularize the local FDR computations to benefit sensitivity.
Next we address operating characteristics of the GraphMM
methodology itself, aiming
to find FDRcontrolled lists of nonnull vertices. Synthetic data sets mimic the structure
and empirical characteristics of the brain MRI study data described in Section 3.2.
The first three syntheticdata scenarios consider a single MRI brain slice
measured on replicates from two conditions, with characteristics approximately matching
the characteristics of observed data (Table 2). These scenarios vary the
underlying size distribution of blocks, but follow the GraphMM
model in
having graphrespecting partitions of the underlying signal, blocklevel shifts between
conditions, and multivariate Gaussian errors.
The left panel of Figure 5 shows that all methods on test are able to control
the false discovery rate. All methods display sensitivity for the signals, though GraphMM
demonstrates superior power in the first two cases where blocks extend beyond the individual
voxel. The high sensitivity in Scenario 2 may reflect that the prior distribution of block sizes
used in the local GraphMM
more closely matches the generative situation. Notably, even
when this blocksize distribution is not aligned with the GraphMM
prior, we do not see
an inflation of the false discovery rate.
Scenarios 4 and 5 are similar to the first cases, however they explore different forms
of signals between the two groups; both have an average block size of 4 voxels, but in one
case the changed block effects are fewer, relatively strong and in the
other case they are more frequent, and relatively weaker (Table 3).
In both regimes, GraphMM
retains its control of FDR and exhibits good sensitivity compared to other methods (Fig. 6).














GraphMM
is designed for the case where partition blocks are graph respecting and the
changes between conditions affect entire blocks. Our next numerical experiment checks the
robustness of GraphMM
when this partition/change structure is violated
Fig. 7 shows that GraphMM
continues to control FDR and also
retains a sensitivity advantage even when its underlying model is not fully correct.
To further assess the properties of GraphMM
, we performed several permutation experiments.
Both started with the data set from Section 3.2. In the first, we simply permuted the sample
labels of the 148 control subjects and 123 late MCI subjects, repeating for ten permuted sets.
On each permuted set we applied various methods to detect differences. All discoveries are false
discoveries in this null case. The left panel of Figure 8 shows that GraphMM
and other methods are correctly recognizing the apparent signals as being consistent with the null hypothesis.
The second permutation experiment retains the samplegrouping information, but permutes
the voxels within the brain slice on test. This permutation disrupts both spatial
measurement dependencies and any spatial structure in the signal. Since GraphMM
is
leveraging spatiallycoherent patterns in signal, we expect it to produce fewer statistically
significant findings in this voxelpermutation case. The right panel of Fig. 8
shows this dampening of signal as we expect, when looking at the empirical cdf of computed
values .
sample label permutation  voxel permutation 
3.2 Analysis of Magnetic Resonance Images from the ADNI dataset
The data set used in this section comes from the Alzheimer’s disease Neuroimaging InitiativeII (ADNI2)^{0}^{0}footnotetext: The ADNI project was launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations. The overarching goal of ADNI study comprises of detecting Alzheimer’s Disease (AD) at the earliest possible stage, identifying ways to track the disease progression with biomarkers and support advances in AD intervention, prevention and treatments. ADNI is the result of the efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada.. Our goal here was two fold:

evaluate the sensitivity of our proposal in identifying grouplevel differences between participants corresponding to different disease stages in a real scientific analysis task, and

assess, via scientific interpretation of the results, the extent to which the findings from our analysis is corroborated by known results in the literature on aging and Azheimer’s disease (AD).
The latter subgoal is feasible because one could utilize standard preprocessing steps, described shortly, on our cohort of brain images and identify the brain regions pertinent to the clinical condition. Since the spatial location of these gray matter anatomical regions can be identified, we can evaluate whether the results are meaningful from the scientific perspective.
Dataset and preprocessing steps used. For the experimental analysis that is presented in this section, gray matter tissue probability maps derived from the coregistered T1weighted magnetic resonance imaging (MRI) data, downloaded from ADNI using preprocessing steps provided in the commonly used voxelbased morphometry (VBM) toolbox in Statistical Parametric Mapping software (SPM, http://www.fil.ion.ucl.ac.uk/spm). Note that prior to registration to a common template or atlas (which provides a common coordinate system for conducting the analysis), standard artifact removal and other corrections were performed, consistent with existing literature Ithapu and others (2015). Our dataset included brain images of healthy control subjects (cognitively normal, abbreviated by CN) and
subjects with late mild cognitively impairment (LMCI). We also prefilter voxels with very low marginal standard deviation
(Bourgon and others, 2010), which leaves voxels in total. We then apply rankbased inverse normal transformation in order to make data approximately normal.Analysis task and baselines. Our goal is to detect or identify regions (i.e., voxels) where the distribution of gray matter
intensities is significantly different across the clinical conditions, i.e., healthy controls (CN) and late stage Mild Cognitive impaired
individuals (LMCI).
To keep the computational burden manageable, instead of processing the entire
3D image volume at once, we applied GraphMM
(as well as a number of baseline methods)
to 2D image slices in the coronal direction.
Our baseline methods include Statistical nonparametric Mapping toolbox using Matlab SnPM,
a popular image analysis method used by various groups, and value with adaptive shrinkage using R
package ashr
, which represents an advanced voxelspecific empiricalBayes method.
Summary of main experimental findings. Fig. 9 shows a representative example output of
our analysis for a montage of 4 coronal slices extracted from the 3D image volume.
The color bar (red to yellow), for each method presented,
is a surrogate for the strength of some score describing the grouplevel difference:
for instance, for SnPM, the color is scaled based on adjusted values, for the value method, it is scaled
based on values, whereas for GraphMM
, the color is scaled based on local false discovery rates .
It is clear that while the regions reported as significantly different between CN and LMCI have some overlap
between the different methods, GraphMM
is able to identify many more significantly different voxels compared to baseline methods,
at various FDR thresholds (Fig. 11).
A closer inspection of one case is informative (Fig.10).
Voxel at coordinates is not found to be different between CN and MCI
according to SnPM (adjusted value = 0.578) and value method (value = 0.138). But when we look at the results
provided by GraphMM
, the local FDR is . It is clear that part of the reason
behind the increased sensitivity is that GraphMM
is leveraging the consistent pattern of shifts among neighboring
voxels.
Clusters. Another statistical measure often reported in the neuroimaging literature
is the size of significant clusters, where clusters refer to
spatially connected set of voxels that are significantly altered by the sampling condition (e.g., clinical condition).
The rational here is that stray voxels reported as significantly different are more likely to be an artifact relative
to a group of anatomically clustered voxels.
In the literature, results are often reported for
clusters with size greater than or equal to 20.
In the setting of graphassociated data, a significant cluster forms a connected component of the 3dimensional lattice graph
associated with the data. Fig. 12 shows a bar plot for the size of significant cluster. Here, we
see that GraphMM
performs favorably relative to the baseline methods and consistently reports
larger clusters.
Scientific interpretation of the results.
To better interpret the statistical results, we use the Matlab package xjview to link
anatomical information associated with those voxels that reported to be significantly different by the various methods.
The xjview package
maps the significant clusters from the analysis to the brain template or atlas (i.e., common coordinate system)
using automated anatomical labeling information (TzourioMazoyer and others, 2002). This is helpful because
one can then easily assign anatomically meaningful names to the voxel clusters that are significantly different.
The results are summarized in Table 1 where we report the brain regions associated with significant findings
from GraphMM
as well as SnPM. Specifically,
Column 3 in the table provides comparisons from both methods listing the number of significant voxels in the corresponding regions.
This process provides additional evidence to assess if our analysis indeed revealed regions known to be associated with AD.
We can answer this question by inspecting the neurological functions of the identified brain regions shown in column 4.
Here, we only show the top 15 brain regions that contain most number of significant voxels.
From Table 1, we see that GraphMM
discovers all the brain regions found by SnPM, with many more significant voxels in
each region with more pronounced evidence of statistical significance. The only exception is the hippocampus
where both methods identify a large number of voxels but GraphMM
finds fewer significant voxels than SnPM.
In addition, there are regions revealed to be significant by GraphMM
but not by SnPM, including the precentral gyrus,
middle frontal gyrus, inferior frontal gyrus opercular, insular, anterior cingulate, and supramarginal gyrus, which are
relevant in the aging and AD literature.
GraphMM
consolidates known alterations between CN and LMCI and reveals potentially important new findings.
No.  Brain region  # Voxels GraphMM SnPM  Neurological function 

1  Hippocampus  1411 1646  Receives and consolidates new memory about experienced events, allowing for establishment of longterm memories. [Cohen and Eichenbaum (1993)] 
2  Parahippocampal Gyrus  1008 410  Involved in episodic memory and visuospatial processing [Aminoff and others (2013)]. 
3  Amygdala  710 365  Plays an essential role in the processing and memorizing of emotional reactions [Knafo (2012)] 
4  Temporal Gyrus (superior, middle and inferior)  2031 287  Involved in various cognitive processes, including language and semantic memory processing (middle) as well as visual perception (inferior) and sound processing (superior) [Onitsuka and others (2004), Bigler and others (2007)] 
5  Putamen  793 12  Linked to various types of motor behaviors, including motor planning, learning, and execution.[Marchand and others (2008)]. 
6  Fusiform Gyrus  735 308  Influence various neurological phenomena including face perception, object recognition, and reading [Weiner and Zilles (6 03)]. 
7  Temporal Pole (superior, middle)  882 74  Involved with multimodal analysis, especially in social and emotional processing. [Plotzker and others (2007)]. 
8  Precentral Gyrus  829 0  Consists of primary motor area, controlling body’s movements. [Graziano and others (2002)]. 
9  Middle Frontal Gyrus  635 0  Plays essential role in attentional reorienting. [Japee and others (2015)]. 
10  Inferior Frontal Gyrus Opercular  573 0  Linked to language processing and speech production. [Greenlee and others (2007)]. 
11  Calcarine  381 22  Where the primary visual cortex is concentrated, processes visual information. [Meadows (2011)]. 
12  Caudate  437 46  Plays essential roles in motor processes and a variety of executive, goaldirected behaviours [Grahn and others (2008)]. 
13  Insular  275 0  Involved in consciousness, emotion and the regulation of the body’s homeostasis [Gogolla (2017)]. 
14  Anterior Cingulate  260 0  Plays a major role in mediating cognitive influences on emotion. [Stevens and others (2011)]. 
15  Supramarginal Gyrus  225 0  Linked to phonological processing and emotional responses. [Hartwigsen and others (2010)]. 
4 Discussion
Mass univariate testing remains the dominant approach to detect statistically significant changes in comparative brainimaging studies (e.g., Groppe and others (2011)). Here, a classical testing procedure, like the test of a contrast in a regression model, is applied in parallel over all testing units (voxels), leading to a large number of univariate test statistics and pvalues. Subsequently, significant voxels are identified through some filter, such as the BenjaminiHochberg (BH) procedure, that aims to control the false discovery rate. The approach is often very effective and has supported numerous important applied studies of brain function. In structural magnetic resonance image studies of Alzheimer’s disease progression, such mass univaraiate testing has failed in some cases to reveal subtle structural changes between phenotypically distinct patient populations. The underlying problem is limited statistical power for relatively small effects, even with possibly hundreds of subjects per group. Empirical Bayes procedures improve power somewhat over BH by recognizing properties of the collection of tests. In case the data are associated with a graph and are consistent with a relatively low dimension of parameter states, we have shown one way to further enhance the empirical Bayes procedures.
Empirical Bayesian procedures trace their beneficial operating characteristics to “information sharing”, or, equivalently, “borrowing strength”. In isolation, the parameters governing data at a given testing unit are inferred locally from data at that unit. Having limited data at each unit limits any inferences we may try to make. But in treating the unitlevel parameters – considered over many units – as draws from some common, systemlevel population of parameters, Bayesian theory provides a formal approach to link information between units (e.g., Bernardo and Smith (1994)). Relatively elaborate information sharing is provided by highly flexible Bayesian models. For example, the Dirichlet process mixture (DPM) model engenders a clustering of the inference units, with units in the same cluster block if (and only if) they share the same parameter values. The DPM model has been effective at representing heterogeneity in a system of parameters (e.g., Muller and Quintana (2004)), and in improving sensitivity in largescale testing (e.g., Dahl and Newton (2007), Dahl and others (2008)). Benefits come at a high computational cost, since in principle the posterior summaries require averaging over all partitions of the units (e.g., Blei and Jordan (2006)). There are also modeling costs: DPM’s usually have a productpartition form in which the likelihood function factors as a product over blocks of the partition (Hartigan (1990)). In applications, such as brain imaging, we observe that independence between blocks is violated in a way that may lead to inflation of the actual false discovery rate over a target value.
The application of main focus in our research is structural magnetic resonance imaging (sMRI) data that arise in studies of brain structure and function. sMRI provides information to describe the shape, size, and integrity of brain structures. Briefly, the contrast and brightness of structural magnetic resonance images is determined by the characteristics of brain tissues. Depending on the types of sMRI sequences (e.g, T1weighted, T2weighted, proton densityweighted), different aspects of brain tissues are emphasized. We are particularly interested in T1weighted sMRI, which provides good contrast between gray matter tissue (darker gray) and white matter tissue (lighter gray), while the cerebrospinal fluid is void of signal (black). Because brain function depends to some extent on the integrity of brain structure, sMRI has become an integral part for clinical assessment of patients with suspected Alzheimer’s disease (Vemuri and Jack (2010)). By studying structural magnetic resonance images, we are able to make inference about gray matter atrophy in the brain, which has been shown to be one of classic symptoms for Alzheimer’s disease (AD) in the neuroscience literature (e.g., Frisoni and others (2002), Scheltens and others (2002), Moller and others (2013)). Essentially, conclusions regarding the atrophy of gray matter may come from detecting changes in gray matter volume across different stages of Alzheimer’s degenerative process. Furthermore, identifying subtle changes at early stages could help with early detection of gray matter atrophy, which ultimately facilitates AD diagnosis, interventions and treatments.
The crux of methodological research on largescale testing in neuroimaging has been how to find thresholds on voxelwise test statistics that control a specified false positive rate and maintain testing power. The two approaches that are most popular to neuroimaging researchers include: familywise error control using Random Field Theory (e.g., Worsley and others (2004)) and false discovery rate control using BenjaminHochberg procedure (e.g., Genovese and others (2002)). The former is based on additional assumptions about the spatial smoothness of the MRI signal. There have been criticisms. e.g., in Eklund and others (2016), that these smoothness assumptions are usually not satisfied by the data, which would lead to alarmingly high degree of false positive. The voxelwise test statistic can be either parametric (implemented in SPM Matlab package Penny and others (2007)) or nonparametric (implemented in Statistical Nonparametric mapping (SnPM) Matlab toolbox Nichols (2001)). A review of available methods for largescale testing in neuroimaging inference can be found in Nichols (2012) and the references therein. Tansey and others (2018) presented an FDR tool that processes unitspecific test statistics in a way to spatially smooth the estimated prior proportions. As the clinical questions of interest move towards identifying early signs of AD, the changes in average brain profiles between conditions invariably become more subtle and increasingly hard to detect; the result is that very few voxels or brain regions may be detected as significantly different by standard methods. The methodology we study in this thesis aims to increase the sensitivity of largescale tests for neuroimaging and related data.
Graphs provide powerful tools for representing diverse patterns of interactions between entities of a system and have been widely used for modeling data arising in various fields of science (e.g., Gross and Yellen (2004)). In the present work, vertices of the graph correspond to variables in a data set and the undirected edges convey relational information about the connected variables, due to associations with the context of the data set, such as temporal, functional, spatial, or anatomical information. The graphs we consider constitute an auxiliary part of observed data. For clarity, these graphs may or may not have anything to do with undirected graphical representations of the dependence in a joint distribution (e.g. Lauritzen (1996)), as in the graphical models literature. For us, the graph serves to constrain patterns in the expected values of measurements. By limiting changes in expected values over the graph, we aim to capture low complexity of the system. An alternative way to model lowcomplexity is through “smoothed/bandlimited” signals (e.g., Ortega and others (2018), Chen and others (2016)). Comparisons between the approaches are warranted.‘
We have advanced the idea of latent graphrespecting partitions that constrain expected values into lowdimensional space. The partition is paired with a vector of blockspecific change indicators to convey the discrete part of the modeling specification. We used a uniform distribution over graphrespecting partitions in our numerical experiments, and have also considered more generally the distribution found by conditioning a product partition model (PPM) to be graphrespecting. In either case, two vertices that are nearby on the graph are more likely to share expected values, in contrast to the exchangeability inherent in most partition models. Graph restriction greatly reduces the space of partitions; we simply enumerated all such partitions in our proposed graphlocal computations and thereby avoided MCMC over partition space. When the generative situation is similarly graph restricted, we expect improved statistical properties; but we also showed that false discovery rates are controlled even if the generative situation is not graph respecting. Special cases of graphrestricted partitions have been studied by others. When
is a lattice graph, we have induces a spatial random partition distribution, which is the topic of study in Page and Quintana (2016). When is a decomposable graph, the random partition distribution proposed here and the one introduced in Caron and Doucet (2009) are similar in the sense that they are both restricted only on the set of graphrespecting partitions. For a general graph , the distance dependent Chinese restaurant process (ddCPR) introduced in Blei and Frazier (2011) can induce a partition distribution that is also restricted on the graphrespecting partitions, though it differs from the distributions used here. When is a complete graph there is no restriction and all partitions have positive mass. When is a line graph the graphrespecting partition model matches Barry and Hartigan (1992) for changepoint analysis.Acknowledgements
This research was supported in part by NIH grant 1U54AI117924 to the University of Wisconsin Center for Predictive Computational Phenotyping. The authors were also supported in part by NIH R01 AG040396 and NSF CAREER award RI 1252725.
References
 Aminoff and others (2013) Aminoff, Elissa, Kveraga, Kestutis and Bar, Moshe. (2013, 07). The role of the parahippocampal cortex in cognition. Trends in cognitive sciences 17.
 Barry and Hartigan (1992) Barry, Daniel and Hartigan, J. A. (1992, 03). Product partition models for change point problems. Ann. Statist. 20(1), 260–279.
 Benjamini and Hochberg (1995) Benjamini, Yoav and Hochberg, Yosef. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300.
 Bernardo and Smith (1994) Bernardo, J. M. and Smith, A. F. (1994). Bayesian Theory, Wiley series in probability and mathematical statistics. John Wiley & Sons, Ltd.
 Bigler and others (2007) Bigler, Erin D., Mortensen, Sherstin, Neeley, E. Shannon, Ozonoff, Sally, Krasny, Lori, Johnson, Michael, Lu, Jeffrey, Provencal, Sherri L., McMahon, William and Lainhart, Janet E. (2007). Superior temporal gyrus, language function, and autism. Developmental Neuropsychology 31(2), 217–238. PMID: 17488217.
 Blei and Frazier (2011) Blei, David M. and Frazier, Peter I. (2011, November). Distance dependent chinese restaurant processes. J. Mach. Learn. Res. 12, 2461–2488.
 Blei and Jordan (2006) Blei, David M. and Jordan, Michael I. (2006, 03). Variational inference for dirichlet process mixtures. Bayesian Anal. 1(1), 121–143.
 Bourgon and others (2010) Bourgon, Richard, Gentleman, Robert and Huber, Wolfgang. (2010, 05). Independent filtering increases detection power for highthroughput experiments. Proceedings of the National Academy of Sciences of the United States of America 107, 9546–51.
 Caron and Doucet (2009) Caron, François and Doucet, Arnaud. (2009). Bayesian nonparametric models on decomposable graphs. In: Proceedings of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09. USA: Curran Associates Inc. pp. 225–233.
 Chen and others (2016) Chen, S., Varma, R., Singh, A. and Kovacevic, J. (2016). Signal recovery on graphs: Fundamental limits of sampling strategies. IEEE Transactions on Signal and Information Processing over Networks 2(4), 539–554.
 Cohen and Eichenbaum (1993) Cohen, N. J. and Eichenbaum, H. (1993). Memory, amnesia, and the hippocampal system. MIT Press.
 Collins (2014) Collins, Maxwell. (2014). Project page http://pages.cs.wisc.edu/~mcollins/software/greedyclustering.html.
 Dahl and others (2008) Dahl, David B, Mo, Qianxing and Vannucci, Marina. (2008). Simultaneous inference for multiple testing and clustering via a dirichlet process mixture model. Statistical Modelling 8(1), 23–39.
 Dahl and Newton (2007) Dahl, David B and Newton, Michael A. (2007). Multiple hypothesis testing by clustering treatment effects. Journal of the American Statistical Association 102(478), 517–526.
 Do and others (2005) Do, Kim Anh, Müller, Peter and Tang, Feng. (2005). A bayesian mixture model for differential gene expression. Journal of the Royal Statistical Society. Series C: Applied Statistics 54(3), 627–644.
 Efron (2007) Efron, Bradley. (2007, 08). Size, power and false discovery rates. Ann. Statist. 35(4), 1351–1377.
 Efron (2010) Efron, Bradley. (2010). LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Institute of Mathematical Statistics Monographs. Cambridge University Press.
 Eklund and others (2016) Eklund, Anders, Nichols, Thomas E. and Knutsson, Hans. (2016). Cluster failure: Why fmri inferences for spatial extent have inflated falsepositive rates. Proceedings of the National Academy of Sciences 113(28), 7900–7905.
 Frisoni and others (2002) Frisoni, Giovanni, Testa, C, Zorzan, A, Sabattoli, F, Beltramello, A, Soininen, Hilkka and Laakso, Mikko. (2002, 12). Detection of grey matter loss in mild alzheimer’s disease with voxel based morphometry. Journal of neurology, neurosurgery, and psychiatry 73, 657–64.
 Genovese and others (2002) Genovese, Christopher R., Lazar, Nicole A. and Nichols, Thomas. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage 15(4), 870 – 878.
 Gogolla (2017) Gogolla, Nadine. (2017). The insular cortex. Current Biology 27(12), R580 – R586.
 Grahn and others (2008) Grahn, Jessica A., Parkinson, John A. and Owen, Adrian M. (2008). The cognitive functions of the caudate nucleus. Progress in Neurobiology 86(3), 141 – 155.
 Graziano and others (2002) Graziano, Michael S.A, Taylor, Charlotte S.R and Moore, Tirin. (2002). Complex movements evoked by microstimulation of precentral cortex. Neuron 34(5), 841 – 851.
 Greenlee and others (2007) Greenlee, Jeremy D.W., Oya, Hiroyuki, Kawasaki, Hiroto, Volkov, Igor O., Severson III, Meryl A., Howard III, Matthew A. and Brugge, John F. (2007). Functional connections within the human inferior frontal gyrus. Journal of Comparative Neurology 503(4), 550–559.

Groppe and others (2011)
Groppe, David M., Urbach, Thomas P. and Kutas, Marta. (2011).
Mass univariate analysis of eventrelated brain potentials/fields i: A critical tutorial review.
Psychophysiology 48(12), 1711–1725.  Gross and Yellen (2004) Gross, J.L. and Yellen, J. (2004). Handbook of Graph Theory, Discrete Mathematics and Its Applications. CRC Press.
 Hartigan (1990) Hartigan, J. A. (1990, 01). Partition models. 19, 2745–2756.
 Hartwigsen and others (2010) Hartwigsen, Gesa, Baumgaertner, Annette, Price, Cathy J, Koehnke, Maria, Ulmer, Stephan and Siebner, Hartwig R. (2010, September). Phonological decisions require both the left and right supramarginal gyri. Proceedings of the National Academy of Sciences of the United States of America 107(38), 16494—16499.

Ithapu and others (2015)
Ithapu, Vamsi K, Singh, Vikas, Okonkwo, Ozioma C, Chappell, Richard J,
Dowling, N Maritza, Johnson, Sterling C, Initiative, Alzheimer’s
Disease Neuroimaging and others. (2015).
Imagingbased enrichment criteria using deep learning algorithms for efficient clinical trials in mild cognitive impairment.
Alzheimer’s & Dementia 11(12), 1489–1499.  Japee and others (2015) Japee, Shruti, Holiday, Kelsey, Satyshur, Maureen D., Mukai, Ikuko and Ungerleider, Leslie G. (2015). A role of right middle frontal gyrus in reorienting of attention: a case study. Frontiers in Systems Neuroscience 9, 23.
 Kim and others (2009) Kim, Sinae, B. Dahl, David and Vannucci, Marina. (2009, 01). Spiked dirichlet process prior for bayesian multiple hypothesis testing in random effects models. 4, 707–732.
 Knafo (2012) Knafo, Shira. (2012, 12). Amygdala in Alzheimer’s Disease.
 Lauritzen (1996) Lauritzen, S.L. (1996). Graphical Models, Oxford Statistical Science Series. Clarendon Press.
 Marchand and others (2008) Marchand, William, Lee, James, W Thatcher, John, W Hsu, Edward, Rashkin, Esther, Suchy, Yana, Chelune, Gordon, Starr, Jennifer and Steadman Barbera, Sharon. (2008, 06). Putamen coactivation during motor task execution. Neuroreport 19, 957–60.
 Meadows (2011) Meadows, MaryEllen. (2011). Calcarine Cortex. New York, NY: Springer New York, pp. 472–472.
 Moller and others (2013) Moller, Christiane, Vrenken, Hugo, Jiskoot, Lize, Versteeg, Adriaan, Barkhof, Frederik, Scheltens, Philip and van der Flier, Wiesje M. (2013). Different patterns of gray matter atrophy in early and lateonset alzheimer’s disease. Neurobiology of Aging 34(8), 2014 – 2022.
 Muller and Quintana (2004) Muller, Peter and Quintana, Fernando A. (2004, 02). Nonparametric bayesian data analysis. Statistical Science 19(1), 95–110.
 Newton and others (2004) Newton, Michael A., Noueiry, Amine O, Sarkar, Deepayan and Ahlquist, Paul. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 2, 155–76.
 Nichols (2001) Nichols, Tom. (2001). Statistical nonparametric mapping  a toolbox for spm.
 Nichols (2012) Nichols, Thomas E. (2012). Multiple testing corrections, nonparametric methods, and random field theory. NeuroImage 62(2), 811 – 815. 20 YEARS OF fMRI.
 Onitsuka and others (2004) Onitsuka, Toshiaki, Shenton, Martha E., Salisbury, Dean F., Dickey, Chandlee C., Kasai, Kiyoto, Toner, Sarah K., Frumin, Melissa, Kikinis, Ron, Jolesz, Ferenc A. and McCarley, Robert W. (2004). Middle and inferior temporal gyrus gray matter volume abnormalities in chronic schizophrenia: An mri study. American Journal of Psychiatry 161(9), 1603–1611. PMID: 15337650.
 Ortega and others (2018) Ortega, Antonio, Frossard, Pascal, Kovacevic, Jelena, Moura, Jose and Vandergheynst, Pierre. (2018, 05). Graph signal processing: Overview, challenges, and applications. 106, 808–828.
 Page and Quintana (2016) Page, Garritt L. and Quintana, Fernando A. (2016, 03). Spatial product partition models. Bayesian Anal. 11(1), 265–298.
 Penny and others (2007) Penny, William, Friston, Karl, Ashburner, John, J. Kiebel, S and E. Nichols, T. (2007, 01). Statistical Parametric Mapping: The Analysis of Functional Brain Images.
 Plotzker and others (2007) Plotzker, Alan, Olson, Ingrid R. and Ezzyat, Youssef. (2007, 03). The Enigmatic temporal pole: a review of findings on social and emotional processing. Brain 130(7), 1718–1731.
 Scheltens and others (2002) Scheltens, Philip, Fox, Nick, Barkhof, Frederik and Carli, Charles De. (2002). Structural magnetic resonance imaging in the practical assessment of dementia: beyond exclusion. The Lancet Neurology 1(1), 13 – 21.
 Stephens (2017) Stephens, Matthew. (2017). False discovery rates: a new deal. Biostatistics 18(2), 275–294.
 Stevens and others (2011) Stevens, Francis L., Hurley, Robin A., Taber, Katherine H., Hurley, Robin A., Hayman, L. Anne and Taber, Katherine H. (2011). Anterior cingulate cortex: Unique role in cognition and emotion. The Journal of Neuropsychiatry and Clinical Neurosciences 23(2), 121–125. PMID: 21677237.
 Storey (2003) Storey, John D. (2003, 12). The positive false discovery rate: a bayesian interpretation and the q value. Ann. Statist. 31(6), 2013–2035.
 Tansey and others (2018) Tansey, Wesley, Koyejo, Oluwasanmi, Poldrack, Russell A. and Scott, James G. (2018). False discovery rate smoothing. Journal of the American Statistical Association 113(523), 1156–1171.
 TzourioMazoyer and others (2002) TzourioMazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B. and Joliot, M. (2002). Automated anatomical labeling of activations in spm using a macroscopic anatomical parcellation of the mni mri singlesubject brain. NeuroImage 15(1), 273 – 289.
 Vemuri and Jack (2010) Vemuri, Prashanthi and Jack, Clifford R. (2010, Aug). Role of structural mri in alzheimer’s disease. Alzheimer’s Research & Therapy 2(4), 23.
 Weiner and Zilles (6 03) Weiner, Kevin S. and Zilles, Karl. (201603). The anatomical and functional specialization of the fusiform gyrus. Neuropsychologia 83, 48,62.
 Worsley and others (2004) Worsley, Keith J., Taylor, Jonathan E., Tomaiuolo, Francesco and Lerch, Jason. (2004). Unified univariate and multivariate random field theory. NeuroImage 23, S189 – S195. Mathematics in Brain Imaging.
Supplementary Material
Local approximation
Our model considers a general correlation structure. Therefore, it is impractical to apply the model on the entire graph due to high dimensionality. Instead, we derive a divide and conquer heuristic strategy, called local approximation, to overcome computational obstacle, which can take advantage of parallel computing. For each node
, consider a neighborhood of node . is chosen such that it is a connected component of graph G. GraphMM, then, is applied to model the data in this neighborhoodThe nodespecific posterior probabilities (2.3) is approximated by
(4.1) 
The local approximation procedure is illustrated by Fig. 13.
For results in Sections 3.1 and 3.2, we did local approximation on a neighborhood of each node, as illustrated in Fig. 13. In this case we can enumerate all the graphrespecting partitions (we devised a dataaugmentation sampling scheme that makes use of spanning trees within the input graph; not shown). Then, we are able to enumerate all the pairs and compute the exact posterior distribution.
Simulation study details
Scenario 1  Scenario 2  Scenario 3  

Step 6  * Use greedy clustering method [Collins (2014)] * Partition is adjusted to respect lattice graph. * There are 1313 blocks, average block size is 3.9  Same as scenario 1  * Each node itself is a block. * There are 5236 blocks, block size is 1 
Step 7  * 50 blocks with size from 12 to 14 are chosen to be changedblock * Average size of changedblock is 13.6 * Percentage of changednodes: 14.3%  * 300 blocks with size from 2 to 5 are chosen to be changedblock * Average size of changedblock is 2.6 * Percentage of changednodes: 14.9%  * 15% of the nodes are chosen to be changednodes 
Step 8, 9  block average of MRI data group 1 block average of MRI data group 2 max() min() For changedblocks: Uniform()  Same as scenario 1  block average of MRI data group 1 sample block standard deviation group 1 mean() mean() Normal() block average of MRI data group 2 max() min() For changedblocks: Beta(2, 2) 
Figure 
Simulation 4  Simulation 5  

Step 6  * Use greedy clustering method [Collins (2014)] * Partition is adjusted to respect lattice graph. * There are 1313 blocks, average block size is 3.9  Same as Simulation 4 
Step 7  block average of MRI data group 1 block average of MRI data group 2 increasing function of and belongs in (0,1) Bernoulli() * Percentage of changednodes: 16.4%  * Similar to simulation 4, except that * Percentage of changednodes: 50.3% 
Step 8 & 9  * If block is a changedblock: * If block is not a changedblock:  Same as Simulation 4 
Figure 
The graph associated with data is a lattice graph representing spatial dependence, in which the vertices are the pixels and the edges connect neighboring voxels. The analysis of GraphMM involves estimating hyper parameters: prior null probability , prior mean and standard deviation of block mean of group 1, prior mean and standard deviation of difference in block mean between 2 groups, prior covariance matrix for group 1 and matrix for group 2. Different strategies for estimating hyperparameters have been considered,

Estimating prior null probability : We experimented with both qvalue or ahsr to get the estimated value of . Package qvalue produces conservative estimate of without any assumption on the distribution of effects. Hence it is a safe and conservative choice under general settings. Package ashr, on the other hand, provides conservative estimate under the assumption that the distribution of effects is unimodal. This unimodal assumption has been discussed intensively in Stephens (2017) and has been argued to be both plausible and beneficial in many contexts. Furthermore, in our graphbased mixture model 2.2, the distribution of effects was assumed to be a mixture of probability mass at 0 and normal distribution, which satisfies unimodal assumption. Therefore, using package ashr to estimate for meshes with our
GraphMM
. The estimation of is based on the whole dataset, computing prior to the local approximation procedure. In reported computations we used ashr package for . 
Estimating other hyperparameters: We consider 3 approaches: global, local and mixed estimation. With global estimation, the hyperparameters are estimated using the whole dataset and computed prior to the local approximation procedure. With local estimation, hyperparameters are estimated for each neighborhood, during the local approximation procedure. With mixed estimation, all hyperparameters are estimated locally except for matrices and , which are estimated globally. These approaches, local, mixed and global provides increasingly conservative estimates in that order. In following simulation and application, we present results using mixed estimation.
Computing marginal likelihood
We derive the marginal likelihood using Laplace approximation. Consider the notations as in section 2.2. Let be the number of blocks corresponding to partition and be the number of changed blocks, which means
Denote the ordered indices of changed blocks as . We reparametrize the model in order to remove the clustering constraints on the means
Then, the free parameters are and the marginal likelihood function can be written as
where
Apply Laplace’s approximation, we get
(4.2)  
In the next step, we derive the explicit formula for the gradient and Hessian matrix of . Let be the allocation matrix with size where if and only iff node belong to block . Let be a matrix such that column th of has value at position and has value at other postions. Then we can relate the mean vectors with the new parameters as follows
We consider following notations.
The formula for the gradient of is.
where is a vector of ones with size .
Next, the formula for Hessian matrix of is
where is the identity matrix of size .
Comments
There are no comments yet.