A typical functional magnetic resonance imaging (fMRI) neuroscience experiment consists of scanning a subject that is doing specific tasks while their brain activation is being measured. A sequence of samples in the form of brain volumes is acquired over the duration of the experiment, with a new volume obtained every time of repetition (TR) interval. This results in a few thousands samples of up to a million voxels (volume elements) each. The number of samples that can be gathered from a single subject is limited by time humans can spend in the scanner, so experiments typically span multiple subjects. As more and larger multi-subject datasets are collected and disseminated , there is a need for new large-scale analysis methods that can leverage multi-subject data.
The spatial and temporal smoothness of fMRI data [2, 3] implies that information is inherently low-dimensional compared to the number of voxels and the number of samples acquired. Factor analysis is a family of methods that can leverage this characteristic and compute the statistically shared information across samples . These methods rely on the assumption that a small number of unobserved latent variables, or factors, statistically explain the data. In other words, data can be described as a linear combination of these factors plus an additional error (or noise component). Factor analysis methods have demonstrated to be practical for neuroscience for interpreting brain activity , predicting cognitive states [6, 7], and finding interactions across brain regions . However, applying factor analysis to large multi-subject datasets presents computational challenges. The first is the sheer size of the data; a theoretical 1024-subject dataset collected using high-resolution scanners and two hours of scanning per subject is 17 TB. Furthermore, the algorithms used to find latent factors are typically very computationally demanding.
We consider two recent factor analysis techniques designed specifically for multi-subject neuroimaging data: the Shared Response Model (SRM)  and Hierarchical Topographic Factor Analysis (HTFA) . SRM provides a means of aligning the neural activity of subjects through a shared low dimensional response assuming that all subjects have been presented with the same stimuli. The model also can separate multi-subject data into subject-specific neural activity, and shared activity common among all subjects, yielding state-of-the-art predictive results. HTFA provides a means of analyzing functional connectivity between regions of activation in the neuroimaging data, which can either be unique to each subject or follow a global template. HTFA is also useful in interpreting the dynamics of brain activity. We show an example visualization created with our optimized HTFA implementation from real data in Figure 1.
The algorithm currently used to fit SRM requires a matrix inversion equal to the size of the overall number of voxels for all subjects, which would be an 11-million by 11-million matrix inversion for a 1024-subject high-resolution dataset. For HTFA, the core computation is the repeated solving of non-linear least squares on a Jacobian matrix, the size of which depends on the size of a subsampled single subject data and scales quadratically with the number of factors. As such, it is currently prohibitive to apply these models to datasets with hundreds of thousands of voxels and hundreds of subjects. In this paper, we address these challenges with (1) linear algebra transformations which reduce computational complexity for both applications, (2) algorithmic improvements to further reduce communication and speed up convergence, and (3) performance optimizations to increase efficiency for distributed and parallel execution.
Our algorithmic optimizations alone achieve a speedup for SRM and a speedup for HTFA on a single node. Parallel performance optimizations on a single-node provide an additional and speedup for SRM and HTFA, respectively. We demonstrate the scaling of our implementations on various real and synthetic datasets. For strong scaling on a real 40-subject dataset, our distributed implementations of SRM and HTFA achieve a 5.5 and speedup, respectively, on 20 machines compared to single-node execution. We also weak scale our distributed implementations with a synthetically-generated 1024-subject dataset, simulating two hours worth of high-resolution scanning per subject. We are able to process the entire 17 TB dataset, computing 10 SRM iterations in less than 3 minutes on 16,384 cores, and 1 HTFA iteration in 17 minutes on 32,768 cores, including the time to read the data from a distributed filesystem.
To summarize, our contributions are the following:
We present a new distributed algorithm for SRM with reduced asymptotic complexity and demonstrate a single-node speedup of 1812 from algorithmic improvements, code optimization, and parallel tuning.
We describe algorithmic and implementation improvements for HTFA and demonstrate a single-node speedup of 99 from algorithmic improvements, code optimization, and parallel tuning.
We demonstrate strong scaling for our implementations up to 5.5 on 20 nodes for real datasets, and successful weak scaling on a 1024-subject synthetic high-resolution dataset, up to 16,384 cores for SRM and 32,768 cores for HTFA.
This work is organized as follows. In Section II, we describe SRM, analyze its limitations, and propose optimizations to achieve a scalable version. We treat HTFA in a similar manner in Section III. We describe our parallel and distributed implementations of SRM and HTFA in Section IV. In Section V, we describe our experimental setup for real and synthetic datasets. The results are presented in Section VI. We discuss a series of mathematical and implementation optimizations in Section VII and conclude with Section VIII.
Ii Shared Response Model
In this section, we describe a recent multi-subject factor analysis method, the Shared Response Model (SRM), and the algorithm used to approximate a solution. SRM has been shown to work well in cases when predictive performance is the primary concern. We describe mathematical transformations and algorithmic optimizations made to the published method which improve the performance of fitting SRM models by reducing asymptotic complexity. We also introduce a distributed algorithm with optimizations which reduce communication between nodes.
Ii-a Model Overview
A challenge of working with multi-subject datasets is that subjects have distinct anatomical and functional structure and that makes direct aggregation of subject data infeasible. A partial solution to this problem is to apply standard anatomical alignment methods [8, 9, 10], but this does not ensure that functional brain topographies align [11, 12, 13]. Recently, functional alignment techniques have appeared that address the misalignment of functional topographies between subjects [14, 15, 16, 17, 18, 12, 19, 20, 7]. SRM  is the state-of-the-art functional alignment method. Given subjects’ volumes synchronously acquired at time steps, SRM maps functional topographies from every subject to a shared response on a low-dimensional feature space, thus learning an individual mapping for each subject and a shared response across subjects. The model assumes that a volume with voxels sampled at time for subject
is the outcome of a random vector:
where is the mapping from the -dimensional shared space to subject volume space, is the shared response for all subject volumes at time , is the subject specific mean, and is a realization of a noise vector. The model is completed with the assumptions that with and that the mappings are orthogonal, i.e., . The orthogonality constraint allows a simple mapping of samples from subject to subject by projecting the sample to the feature space and then to the other subject’s voxel space, i.e., .
Ii-B Constrained EM for Model Approximation
The authors in 
propose a constrained Expectation-Maximization (EM) strategy to approximate the transforms and the shared responses of the samples. In the E-step, we compute the sufficient statistics of each sample’s shared response:
where with rows representing all the transformations vertically stacked, is a block diagonal matrix with diagonal . The vector represents the demeaned samples at time vertically stacked for all subjects with
. Note that the estimated shared response is obtained as.
Once we obtain and
in the E-step, we compute the M-step that updates the hyperparameters, , and as follows:
This method for approximating SRM iteratively computes the E-step and M-step until some stopping criterion is met.
To understand the model’s computational limitations we first note that the model size depends on samples (TRs), subjects and their voxels, and the number of features . The latter value determines the cost of some computations and the memory needed to store the mappings and the shared responses . The assumption that the shared information has a low rank dictates that should be small. It was shown that the best accuracy is obtained when the parameter is set to a value ranging from tenths to hundreds of features . The importance of being small will be evident below.
The EM algorithm described in Equations (2)–(6) has several advantages. Equations (4) and (5) in the M-step can be computed in parallel across subjects. Equation (6) is a summation over small matrices of size by , which also does not pose any challenge. In contrast, the E-step requires inversion of , a matrix of size by elements depending on the number of subjects and the numbers of voxels per subject . Computing and storing and its inverse is very challenging and sometimes may be nonviable. Because is symmetric, Cholesky factorization is the most stable method for computing its inverse. While requires memory, the Cholesky factorization requires additional memory and has a runtime complexity. For instance, when subjects with voxels each, the size of the matrix would be 500,000 by 500,000 and would require 20 GB for storing only, even before computing the Cholesky factorization.
Ii-C Reducing the Inversion
As discussed in Section II-B, the runtime and memory bottleneck is given by the inversion of the big matrix . We derive an analytical alternative formula to avoid computing , and hence reducing the computational runtime and memory usage. First, we apply the matrix inversion lemma to the first two terms in Equation (3):
While Equation (3) requires inverting a matrix, Equation (7) supplants that by computing three much smaller inversions. The matrices and have a very small size, by , and their inversions can be computed fast with a Cholesky factorization on one machine. The matrix is a diagonal matrix and its inversion is computed by inverting its diagonal elements. Based on the orthogonality property of , i.e., , we note that
The result above requires the same three inversions needed in Equation (7). Therefore, the simplified computations in the E-step require only to invert dense matrices of by elements or a diagonal matrix with different values. Hence, the memory complexity is reduced from to , much less than originally needed to invert . It is worth noting that the required memory for these computations depends only on the number of features and is independent of the number of aggregated voxels in the data (where ).
Ii-D Reducing Communication for Distributed Computation
Recall that computing the M-step is embarrassingly parallel assuming data is distributed by subject. However, the E-step requires centralized computation of the shared responses. Therefore, we study next the communications needed for a distributed version of the EM algorithm. Plugging Equation (9) into Equation (2) and rewriting it to partition by subject samples, we obtain:
This equation allows us to identify what needs to be transferred to compute the shared responses in the E-step. To compute (10), we should transfer each to a master process and reduce the summation . Each transferred matrix (size ) is relatively small. The resulting (a matrix) should be broadcast to all processes for the M-step computation. The matrices from (3) in the E-step are used for updating in (6) and computing the trace for in (5). Although broadcasting of such matrices is not limiting, we suggest avoiding it by updating in the master process by noting that
and that . Therefore, we first compute in the master process and then broadcast its trace. This reduces the amount of communication from to . The distributed EM algorithm for SRM is summarized in Algorithm 1.
Iii Hierarchical Topographic Factor Analysis
In this section, we discuss Hierarchical Topographic Factor Analysis (HTFA), an interpretable model that can be used by neuroscientists analyze and visualize brain networks. As we did with the SRM, we apply both mathematical transformations and algorithmic optimizations to improve the performance of HTFA compared to the baseline published algorithm. We also present a scalable distributed implementation.
Iii-a Model Overview
A cognitive state can be interpreted as a network of nodes and edges, representing regions of activity in the brain and their correlated activation. Topographic factor analysis (TFA) 
discovers these activity regions and their connectivity pattern through time. This method is based on a Bayesian model that describes an fMRI volume as a linear combination of factors, which are assumed to be Gaussian distributions. Each such distribution depicts a three-dimensional sphere representing an activity region. For a given series of brain volumes, TFA uses variational inference to estimate the centers and widths of the factors and the linear combination weights.
HTFA  is a multi-subject extension of TFA that further assumes that all subjects exhibit a global template of activity regions. Thus, every subject is described as a small perturbation of this global template. Let represent subject ’s data as a matrix with fMRI samples of voxels each and vectorized in ’s rows. Then, each subject is approximated with a factor analysis model
where is an error term, are the weights of the factors in . The factor (row) in
represents a “sphere” or normal distribution with centerand width located relatively to the positions of the voxels in the volume.
HTFA defines the local factors in as perturbations of the factors of a global template in . Therefore, the factor centers for all subjects are obtained from a normal distribution with mean and covariance . The mean represents the center of the global factor, while determines the distribution of the possible distance between the global and the local center of the factor. Similarly, the widths for all subjects are drawn from a normal distribution with mean , the width of the global
factor, and variance. The model assumes that and are constants and the same for all factors. On top of the global parameters and , the model defines Gaussian priors for each, respectively, and . In addition, the columns of the weight matrices are modeled with a distribution and the elements in the noise term are assumed to be independent with a distribution. The associated graphical model is shown in Figure 2.
Iii-B Approximating the MAP Estimator
Based on this hierarchical model, the maximum a-posteriori probability (MAP) estimator can find the centers and widths of the distributions. However, the exact estimator is hard to compute. Hence, Manning et al. suggested a black box variational inference technique to achieve an approximate estimate. The method consists of a global and a local step that iteratively update the parameters. The global step updates the parameters of the distributions in the global template. The local step updates for each subject the weight matrices , the local centers and widths . To update the parameters of the factors in , the local step solves
where is a subsampling coefficient defined below. Each row of
is a factor obtained by evaluating the radial basis function (RBF) with center atand width
in positions for all the voxels in the three-dimensional voxel space of the brain.
The objective function in (13) is non-linear because of the definition of the factors in . Therefore, Manning et al.  propose to decompose the objective function as a sum of subfunctions and compute a solution using an unconstrained non-linear least squares (NLLS) solver, implemented with a trust-region reflective method . This is an iterative method that computes the Jacobian matrix of partial derivatives at each iteration. Hence, the size of the Jacobian matrix depends on the number of variables and the number of subfunctions defined in the problem. In the case of HTFA, there are variables: factors with a three dimensional center and a scalar width . The first term of the objective function containing the Frobenius norm can be described as a summation over subfunctions, each given by a squared element of the matrix . The second and third terms in (13) can be partitioned in subfunctions each. Summing up, there are functions, and the size of the Jacobian matrix is by . Although the Jacobian matrix is computed per subject and does not depend on the number of subjects , it still can be prohibitively large. For example, solving the problem for a subject with voxels, samples and for factors, this matrix will occupy 300 GB of memory. Manning et al.  mitigate this limitation by randomly subsampling the rows (TRs) and columns (voxels) from the data matrix , reducing it to samples by voxels ( and ). This requires adding the sampling coefficient to compensate between the weights of the first and other terms in (13).
After the subject centers and widths are updated via (13), we update the weights matrix for each subject by solving
) admits a closed-form solution of the form of ridge regression, i.e.,. Due to the subsampling of the data matrix employed to update the factors of subject , only a subset of the rows in are updated. The weights and local parameters are updated solving (13) and (15), alternately, until a convergence criterion is met.
Then, the hyperparameters of the global template priors are updated given the local estimates and under the assumption that the posterior has a conjugate prior with normal distribution. This yields the following formulas for the updating the hyperparameters, , , and :
where , . The above parameters , , , and are initialized to , , , and , respectively. The parameters , and are computed during initialization and remain constant through the iterations.
The local update step that solves (13) and (15) is embarrassingly parallel across subjects. Nevertheless, the NLLS solver is a runtime bottleneck because of its dependence on the big Jacobian matrix for the solver computations and because the problem is solved several times in each local update step. As the centers and widths of the local factors change when solving (13), the factor matrix is re-computed whenever the cost function in (13) is evaluated inside each iteration of the NLLS solver, adding to the runtime. Despite subsampling the data matrix , in terms of memory, the Jacobian matrix size dictates the complexity of the algorithm. In the following section, we describe optimizations targeting mainly the local updates step which (1) reduce the size of the Jacobian matrix, (2) use problem-specific assumptions to simplify the problem, (3) optimize the factors calculation, and (4) reduce the number of matrix inversions.
Iii-C Optimizations to the MAP estimator
The computation complexity of NLLS solver makes it a direct target for optimization. We recall that the size of Jacobian matrix depends on the number of variables and the number of subfunctions. It is possible to partition the Frobenius norm term in (13) with a different subfunction granularity to reduce the number of subfunctions and hence, the size of the Jacobian matrix. However, the runtime does not necessarily reduce, as recomputing the Jacobian matrix in internal iterations of the NLLS solver requires the same number of operations independent of the granularity. Instead, we consider an alternative approach. We partition the variables into two blocks: (a) the centers and (b) the widths . We fix the values of the variables in block (b) and solve (13) only for the block (a). Then, we do the opposite and fix the variables in block (a) and solve for those in block (b). Although this block partitioning requires us to solve two subproblems with an NLLS solver, the computation complexity of the Jacobian matrix is smaller. The reasons are that the number of variables is reduced to the block size and subfunctions can be dropped as they are constant when fixing any of the blocks. In addition, this has the potential to reduce the number of steps to converge in the local updates, resembling a block coordinate descent approach.
The original algorithmic implementation utilizes an unconstrained NLLS solver. Since the model assumes that the location of each factor’s center is within the human brain, we add constraints to (13): the centers are bound to be within the brain volume and the value of widths are at least 4% of the brain diameter and at most 180% of it; these two parameters can be tuned for different datasets. This modification to the problem has the potential to speed-up the solution by restricting the feasible region and maintaining the centers within the brain area in intermediate solutions of the NLLS solver. The unconstrained version may yield negative or zero width values in some iterations, leading to positive exponents in the RBF evaluation and hence to large values that might cause numerical overflow issues in the NLLS solver. Moreover, the contrained implementation may yield better solutions by avoiding factors of small sizes that represent a single noisy voxel. A solution to the constrained version of (13) can be obtained using a constrained NLLS. We opt for a trust-region reflective solver.
During the local factor and weight updates, each factor is evaluated to build the updated matrix explicitly. For this purpose, the value of the function is calculated for all the voxel locations . In particular, represents the positions of the voxels in a three-dimensional grid. Because of the bounded size of the brain and the spatial contiguity of the voxels, the number of unique elements in each dimension () for all locations is much lower than the number of voxels. In addition, the Euclidean distance term in the exponent of the RBF can be expressed as . Therefore, we can reduce the number of subtractions and multiplications (i.e., the squares) for computing by caching the values for each coordinate , , or in a look up table. Assuming that the voxel locations are inside a cube of size , the effect of caching reduces from floating point subtractions and multiplications to for each factor.
where , and . In this manner, we reduce the number of inversions per factor in the global update step. The optimized version of the method is summarized in Algorithm 2.
Iv Parallel and Distributed Implementation
Our parallel implementations, D-SRM and HTFA D-MAP, both follow a similar map-reduce structure, which is shown in Figure 3. Both algorithms start by fitting local models in parallel for each subject (Step 2), then combining the local models with a gather or reduction (Step 3), and fitting a global model for all subjects (Step 4). The global model is then broadcast (Step 5) and the local models are re-fit using the updated global information. This process continues in an iterative manner for both algorithms. The columns on the right of the figure show a summary of the main computations, at each step, for both D-SRM and HTFA D-MAP. The following subsections provide further details of our parallel and distributed implementation and specific code optimizations.
Iv-a Single-node Implementation
Our baseline SRM and HTFA implementations are written in Python and use the NumPy and SciPy libraries for matrix representations and operations. We use the Anaconda distribution of Python, which recently added support for using the Intel Math Kernel Library (MKL) to accelerate NumPy and SciPy in time-consuming math operations such as matrix multiplication and matrix inversion. Our new optimized implementations continue to use Python for compatibility with other neuroscience tools which in turn use Python for productivity.
Some operations in NumPy and SciPy, such as matrix multiplication, run very efficiently on all cores automatically using Anaconda’s MKL-backed Numpy. However, we found that certain operations were running serially. These serial functions became the bottleneck in certain parallel configurations, so we wrote small OpenMP-parallelized NumPy extension modules in C++ to speed them up. The specific functions we wrote are:
For HTFA we found that memory usage was increasing after every iteration, so we force garbage collection at the end of each local step using the Python gc module.
Iv-B Distributed Implementation
We use MPI to enable multi-node parallelism. Subjects are distributed across MPI processes and we rely on parallelism within matrix operations to leverage multiple cores within a process. Communication between subjects maps directly to optimized MPI collective operations such as Reduce, Gather, and Bcast. We use these collectives exclusively for inter-subject communication. We found that having multiple MPI processes (i.e., multiple subjects) per node was using cores more efficiently than having fewer MPI processes and more shared-memory parallelism within a subject. As a result, we generally pack more subjects onto nodes until memory limitations arise.
In order to enable parallel I/O from distributed filesystems, we store each subject’s data in different files using NumPy’s standard binary format. With this approach, each MPI process loads its subject’s data in parallel with other processes.
V Experimental Setup
In this section, we describe the experimental setup, the real fMRI datasets we used for experiments, as well as the algorithm we designed to generate large-scale synthetic fMRI data based on the real fMRI datasets.
We run our experiments on the Cori Phase I machine at the National Energy Research Scientific Computing Center (NERSC) . Cori Phase I is a Cray XC40111Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance, currently with 1630 compute nodes and a Cray Aries interconnect providing 5.6 TB global bandwidth. Each compute node contains two 16-core Intel Xeon222Intel and Xeon are trademarks of Intel corporation in the U.S. and/or other countries. E5-2698 v3 CPUs running at 2.3 GHz, and 128 GB DDR4 2133 MHz memory. Cori Phase I also has a distributed Lustre filesystem with 130 router nodes and 248 I/O servers, capable of providing 744 GB/s in bandwidth. We use the Anaconda distribution of Python version 2.7.11 with the following packages: MKL 11.3 Update 1, SciPy 0.17.0, and NumPy 1.10.4. We use mpi4py version 2.0.0, compiled with Cray-provided wrappers.
For HTFA, we sample voxels and TRs as described in Section III-B using random sampling with replacement. To select the number of subsamples, we use the mean between a percent of the data (25% for voxels, 10% for TRs) and a maximum number of samples (3000 for voxels, 300 for TRs). To estimate D-SRM performance in Gflop/s, we count the number of floating point operations used by the four main computations in D-SRM. For each subject and iteration, there are two matrix multiplies, one matrix multiply, and one economy SVD operation. For a lower bound estimate, we assume that the SVD performs the same number of floating point operations as Householder QR ().
We consider the following real neuroimaging datasets:
raider: subjects received two different stimuli while in the fMRI scanner: the film “Raiders of the Lost Ark” (110 min) and a set of still images (7 categories, 8 runs) .
forrest: subjects listened to an audio version of the film “Forrest Gump” (120 min) .
greeneyes: subjects listened to an ambiguous story (15 min) .
An overview of the datasets is shown in Table I. All datasets have the same number of voxels per subject, except raider. The datasets were stored in either float64 or float32 format, but all data is cast to float64 before processing. All data is stored in row-major order.
We create a large synthetic dataset to test the weak scalability of D-SRM and HTFA D-MAP. We start from a real dataset and randomly permute the temporal order of the data for each voxel. To preserve the spatial structures of the brain, we spatially partition the voxels and use the same permutation for all voxels in a partition.
The precise approach we use for synthetic data generation is described in Algorithm 3. Each partition is filled with data from a different randomly-chosen seed subject. Next, the data in each partition is permuted across TRs, using the same permutation within each partition. We use forrest as seed dataset to generate a new 1024-subject synthetic dataset. We split each brain volume using -voxel partitions. The resulting forrest-1024 dataset is 17 TB in float64.
We examine the performance improvements of our optimized algorithms compared to the baseline versions. We measure the impact of the number of factors on performance. We test strong scaling using real datasets and weak scaling using the synthetic dataset. Finally, we present scientific results obtained using our optimizations.
Vi-a Improvements compared to baseline implementations
We demonstrate the performance advantage of our our improved SRM code using raider and our improved HTFA code using 20 subjects from greeneyes. The results are shown in Tables II and Table III for a single node, and . ”Baseline HTFA” and ”Baseline SRM” are our own implementations of the currently published algorithms. The ”Baseline + Algorithmic Opt.” performance includes only linear algebra transformations and algorithmic improvements, which are described in Sections II and III. These optimizations achieved and speedups for SRM and HTFA respectively. Finally, D-SRM and HTFA D-MAP include code optimizations, such as custom parallel NumPy extensions, and across-subject MPI parallelism (20 MPI processes and 3 threads per process). These additional optimizations added and speedups for D-SRM and HTFA D-MAP, respectively.
|Baseline SRM + Algorithmic Opt.||3.95||671|
|Baseline HTFA + Algorithmic Opt.||361.34||42|
Vi-B Impact of number of factors on performance
Figure 4 shows the effect of the number of latent factors on runtime. For D-SRM, the runtime is linear in for certain operations, such as two of the three matrix multiplications, and quadratic for other operations such as the SVD. For HTFA D-MAP, the runtime is linear in for the global template updates. The computation of the Jacobian matrix depends on , the number of voxels and the number of TRs. Overall, for both algorithms we observed a super linear increase of runtime as grows and a relatively large increase for HTFA D-MAP. For the remainder of the paper we fix to be 60, which was chosen based on prior neuroscience experiments using these algorithms [6, 4].
Vi-C Strong scaling
Figures 5 and 6 show strong scaling performance on greeneyes. We measure compute time of 10 iterations on setups ranging from 1 node with 32 cores to 40 nodes with 1280 cores in total. Note that the number of MPI processes we use must divide the number of subjects in the real dataset, 40.
For both D-SRM and HTFA D-MAP, we found that parallelizing across subjects using MPI was generally more effective in utilizing cores than relying on OpenMP parallelism to saturate the cores. This is only a consideration for datasets in which multiple subjects’ data can fit in one node’s memory. We experimented with different numbers of processes and threads and chose those that gave good performance. For example, on a single node, we use 20 processes with 3 threads per process (using Hyper-Threading), just like in the baseline comparison. For D-SRM, this configuration achieved 225 double precision Gflop/s out of a theoretical peak of 1177.6 Gflop/s. For both applications, we saw improved performance from 1 to 20 nodes, however the scaling slowed significantly after 5 nodes. We attribute this behavior to (1) matrix operations having a tall-skinny shape and therefore not scaling as well as more compute-bound square matrix operations, and (2) increasing parallel overheads for larger node counts on the relatively small greeneyes dataset. Note that performance drops when using 1 process per node with 32 threads instead of 2 processes with 16 threads, because of parallel overhead and NUMA effects.
Vi-D Weak scaling
We test D-SRM and HTFA D-MAP weak scaling on forrest-1024. For D-SRM, we processes two subjects per node, and for HTFA D-MAP we process one subject per node, which is driven by memory limitations. The results in Figure 7 show that compute time per subject (excluding I/O) rises by only for D-SRM from 1 node to 512 nodes and for HTFA D-MAP from 1 node to 1024 nodes. Disk I/O takes a significantly larger portion of total time for D-SRM compared to HTFA D-MAP, because D-SRM spends relatively less time computing than HTFA D-MAP while having to load the same amount of data.
The computation and memory requirements of multi-subject analysis techniques, such as SRM and HTFA, are largely driven by the dimensionality of the data. Namely, neuroimaging data tends to have extremely high dimensionality per sample (e.g., 600,000 voxels), compared to relatively few samples (e.g., 3,500 TRs). This extreme skew in the data dimension makes linear algebra transformations and algorithmic optimizations potentially very profitable, because we have the potential to change the computational complexity from something that depends on the number of voxels, to something that depends only on the number of samples or latent factors. The matrix inversion lemma does this, and we applied it to both SRM and HTFA. The shape of the data also means that our optimized implementations mainly operated on so-called ”tall-skinny” matrices, where the number of rows is much larger than the number of columns. This had a direct impact on our performance and our ability to scale to multiple cores within a node.
The computational structure of the multi-subject analysis techniques we studied was very amenable to multi-node parallelism. This also follows directly from the nature of the models we are fitting. Both models allow for a global (across-subject) representation of neural activity, and a local (per-subject) configuration. The algorithms to fit the models will alternate between fitting local models with the global model fixed, and vice versa. This translates directly to map-reduce style parallelism. In the case of HTFA and SRM, fitting the global model was relatively inexpensive compared to fitting local models, which helped scalability. However, other multi-subject model-fitting algorithms may not have this feature.
We have introduced highly-parallel codes for two applications of factor analysis on multi-subject neuroimaging datasets, SRM and HTFA. We demonstrated good single-node performance and scaling on a state-of-the-art supercomputer, up to 32,768 cores. This makes the discovery of latent factors present in large subject populations practical, allowing subject-specific and shared responses to stimuli to be readily identified and interpreted. To achieve these results, we created highly-optimized implementations of the factor analysis algorithms starting from validated baseline implementations. For D-SRM, we achieved a speedup over the baseline code by deriving an alternate formula for the large matrix inversion in the critical E-step of the EM algorithm, reducing distributed communications during the M-step through changes to hyperparameter computation, an d explicit node-level parallelization with MPI. For HTFA D-MAP, we achieved 99 speedup over the baseline on a medium size dataset by splitting the variables of the non-linear least squares problem to reduce the size of the Jacobian matrix and converge faster, using cache-based table lookups for distance calculations, reducing the number of matrix inversions, and explicit garbage collection. Our experience with these two applications allowed us to identify several key performance engineering recommendations that apply to factor analysis of multi-subject neuroimaging data in general.
We expect that our work will enable neuroscientists to develop population-based denoising techniques, design studies to better understand similarities and differences in cognitive function within populations, and perform more sophisticated studies of human interaction. To this end, we are releasing our code as open-source software on GitHub to allow the use our optimized versions of these algorithms on a supercomputer while retaining the programming benefits of Python.
-  “Announcing the hcp 900 subjects data release,” http://humanconnectome.org/about/pressroom/project-news/announcing-the-hcp-900-subjects-data-release/, accessed: 2016-04-04.
-  S. Huettel, A. Song, and G. McCarthy, Functional Magnetic Resonance Imaging. Freeman, 2009. [Online]. Available: https://books.google.com/books?id=BNhMPgAACAAJ
-  H. P. O. de Beeck, “Against hyperacuity in brain reading: Spatial smoothing does not hurt multivariate fmri analyses?” NeuroImage, vol. 49, no. 3, pp. 1943 – 1948, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S105381190900202X
-  J. R. Manning, R. Ranganath, K. A. Norman, and D. M. Blei, “Topographic factor analysis: a bayesian model for inferring brain networks from neural data,” PLoS One, vol. 9, no. 5, p. e94914, 2014.
-  S. J. Gershman, D. M. Blei, F. Pereira, and K. A. Norman, “A topographic latent source model for fmri data,” NeuroImage, vol. 57, no. 1, pp. 89 – 100, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1053811911004411
-  P.-H. C. Chen, J. Chen, Y. Yeshurun, U. Hasson, J. Haxby, and P. J. Ramadge, “A reduced-dimension fmri shared response model,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 460–468. [Online]. Available: http://papers.nips.cc/paper/5855-a-reduced-dimension-fmri-shared-response-model.pdf
R. Rustamov and L. Guibas, “Hyperalignment of multi-subject fMRI data by
synchronized projections,” in
NIPS Machine Learning and Interpretation in Neuroimaging workshop, 2013.
-  J. Talairach and P. Tournoux, Co-planar stereotaxic atlas of the human brain. 3-Dimensional proportional system: an approach to cerebral imaging. Thieme, 1988.
-  B. Fischl, M. I. Sereno, R. Tootell, and A. M. Dale, “High-resolution intersubject averaging and a coordinate system for the cortical surface,” Human brain mapping, vol. 8, no. 4, pp. 272–284, 1999.
-  J. Mazziotta, A. Toga et al., “A probabilistic atlas and reference system for the human brain,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 356, no. 1412, pp. 1293–1322, 2001.
-  M. Brett, I. S. Johnsrude, and A. M. Owen, “The problem of functional localization in the human brain,” Nat Rev Neurosci, vol. 3, no. 3, pp. 243–249, 03 2002.
-  B. R. Conroy, B. D. Singer, J. V. Haxby, and P. J. Ramadge, “fMRI-based inter-subject cortical alignment using functional connectivity,” in Advances in Neural Information Processing Systems, 2009.
-  B. R. Conroy, B. D. Singer, J. S. Guntupalli, P. J. Ramadge, and J. V. Haxby, “Inter-subject alignment of human cortical anatomy using functional connectivity,” NeuroImage, 2013.
V. Calhoun, T. Adali, G. Pearlson, and J. Pekar, “A method for making group inferences from functional MRI data using independent component analysis,”Human brain mapping, vol. 14, no. 3, pp. 140–151, 2001.
-  O. Friman, J. Cedefamn et al., “Detection of neural activity in functional MRI using canonical correlation analysis,” Magnetic Resonance in Medicine, vol. 45, no. 2, pp. 323–330, 2001.
-  J.-H. Lee, T.-W. Lee, F. A. Jolesz, and S.-S. Yoo, “Independent vector analysis (IVA): multivariate approach for fMRI group study,” Neuroimage, vol. 40, no. 1, pp. 86–109, 2008.
-  T. Adali, M. Anderson, and G.-S. Fu, “Diversity in independent component and vector analyses: Identifiability, algorithms, and applications in medical imaging,” Signal Processing Magazine, IEEE, 2014.
-  Y.-O. Li, T. Adali, W. Wang, and V. D. Calhoun, “Joint blind source separation by multiset canonical correlation analysis,” Signal Processing, IEEE Transactions on, vol. 57, no. 10, pp. 3918–3929, 2009.
-  J. V. Haxby, J. S. Guntupalli et al., “A common, high-dimensional model of the representational space in human ventral temporal cortex,” Neuron, vol. 72, no. 2, pp. 404–416, 2011.
-  A. Lorbert and P. J. Ramadge, “Kernel hyperalignment,” in Adv. in Neural Inform. Proc. Systems, 2012.
-  D. B. R. A. P. Dempster, N. M. Laird, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [Online]. Available: http://www.jstor.org/stable/2984875
-  J. R. Manning, R. Ranganath, W. Keung, N. B. Turk-Browne, J. D. Cohen, K. A. Norman, and D. M. Blei, “Hierarchical topographic factor analysis,” in Pattern Recognition in Neuroimaging, 2014 International Workshop on, June 2014, pp. 1–4.
-  T. Coleman and Y. Li., “An interior, trust region approach for nonlinear minimization subject to bounds,” SIAM Journal on Optimization, vol. 6, p. 418–445, 1996.
-  “Cori configuration,” http://www.nersc.gov/users/computational-systems/cori/configuration/, accessed: 2016-04-03.
-  M. Hanke, F. J. Baumgartner, P. Ibe, F. R. Kaule, S. Pollmann, O. Speck, W. Zinke, and J. Stadler, “A high-resolution 7-tesla fmri dataset from complex natural stimulation with an audio movie,” Scientific Data, Jan 2014.
-  Y. Yeshurun, S. Swanson, J. Chen, E. Simony, C. Honey, P. C. Lazaridi, and U. Hasson., “How does the brain represent different ways of understanding the same story?” Society for Neuroscience Abstracts, 2014.