I Introduction
A typical functional magnetic resonance imaging (fMRI) neuroscience experiment consists of scanning a subject that is doing specific tasks while their brain activation is being measured. A sequence of samples in the form of brain volumes is acquired over the duration of the experiment, with a new volume obtained every time of repetition (TR) interval. This results in a few thousands samples of up to a million voxels (volume elements) each. The number of samples that can be gathered from a single subject is limited by time humans can spend in the scanner, so experiments typically span multiple subjects. As more and larger multisubject datasets are collected and disseminated [1], there is a need for new largescale analysis methods that can leverage multisubject data.
The spatial and temporal smoothness of fMRI data [2, 3] implies that information is inherently lowdimensional compared to the number of voxels and the number of samples acquired. Factor analysis is a family of methods that can leverage this characteristic and compute the statistically shared information across samples [4]. These methods rely on the assumption that a small number of unobserved latent variables, or factors, statistically explain the data. In other words, data can be described as a linear combination of these factors plus an additional error (or noise component). Factor analysis methods have demonstrated to be practical for neuroscience for interpreting brain activity [5], predicting cognitive states [6, 7], and finding interactions across brain regions [4]. However, applying factor analysis to large multisubject datasets presents computational challenges. The first is the sheer size of the data; a theoretical 1024subject dataset collected using highresolution scanners and two hours of scanning per subject is 17 TB. Furthermore, the algorithms used to find latent factors are typically very computationally demanding.
We consider two recent factor analysis techniques designed specifically for multisubject neuroimaging data: the Shared Response Model (SRM) [6] and Hierarchical Topographic Factor Analysis (HTFA) [4]. SRM provides a means of aligning the neural activity of subjects through a shared low dimensional response assuming that all subjects have been presented with the same stimuli. The model also can separate multisubject data into subjectspecific neural activity, and shared activity common among all subjects, yielding stateoftheart predictive results. HTFA provides a means of analyzing functional connectivity between regions of activation in the neuroimaging data, which can either be unique to each subject or follow a global template. HTFA is also useful in interpreting the dynamics of brain activity. We show an example visualization created with our optimized HTFA implementation from real data in Figure 1.
The algorithm currently used to fit SRM requires a matrix inversion equal to the size of the overall number of voxels for all subjects, which would be an 11million by 11million matrix inversion for a 1024subject highresolution dataset. For HTFA, the core computation is the repeated solving of nonlinear least squares on a Jacobian matrix, the size of which depends on the size of a subsampled single subject data and scales quadratically with the number of factors. As such, it is currently prohibitive to apply these models to datasets with hundreds of thousands of voxels and hundreds of subjects. In this paper, we address these challenges with (1) linear algebra transformations which reduce computational complexity for both applications, (2) algorithmic improvements to further reduce communication and speed up convergence, and (3) performance optimizations to increase efficiency for distributed and parallel execution.
Our algorithmic optimizations alone achieve a speedup for SRM and a speedup for HTFA on a single node. Parallel performance optimizations on a singlenode provide an additional and speedup for SRM and HTFA, respectively. We demonstrate the scaling of our implementations on various real and synthetic datasets. For strong scaling on a real 40subject dataset, our distributed implementations of SRM and HTFA achieve a 5.5 and speedup, respectively, on 20 machines compared to singlenode execution. We also weak scale our distributed implementations with a syntheticallygenerated 1024subject dataset, simulating two hours worth of highresolution scanning per subject. We are able to process the entire 17 TB dataset, computing 10 SRM iterations in less than 3 minutes on 16,384 cores, and 1 HTFA iteration in 17 minutes on 32,768 cores, including the time to read the data from a distributed filesystem.
To summarize, our contributions are the following:

We present a new distributed algorithm for SRM with reduced asymptotic complexity and demonstrate a singlenode speedup of 1812 from algorithmic improvements, code optimization, and parallel tuning.

We describe algorithmic and implementation improvements for HTFA and demonstrate a singlenode speedup of 99 from algorithmic improvements, code optimization, and parallel tuning.

We demonstrate strong scaling for our implementations up to 5.5 on 20 nodes for real datasets, and successful weak scaling on a 1024subject synthetic highresolution dataset, up to 16,384 cores for SRM and 32,768 cores for HTFA.
This work is organized as follows. In Section II, we describe SRM, analyze its limitations, and propose optimizations to achieve a scalable version. We treat HTFA in a similar manner in Section III. We describe our parallel and distributed implementations of SRM and HTFA in Section IV. In Section V, we describe our experimental setup for real and synthetic datasets. The results are presented in Section VI. We discuss a series of mathematical and implementation optimizations in Section VII and conclude with Section VIII.
Ii Shared Response Model
In this section, we describe a recent multisubject factor analysis method, the Shared Response Model (SRM), and the algorithm used to approximate a solution. SRM has been shown to work well in cases when predictive performance is the primary concern. We describe mathematical transformations and algorithmic optimizations made to the published method which improve the performance of fitting SRM models by reducing asymptotic complexity. We also introduce a distributed algorithm with optimizations which reduce communication between nodes.
Iia Model Overview
A challenge of working with multisubject datasets is that subjects have distinct anatomical and functional structure and that makes direct aggregation of subject data infeasible. A partial solution to this problem is to apply standard anatomical alignment methods [8, 9, 10], but this does not ensure that functional brain topographies align [11, 12, 13]. Recently, functional alignment techniques have appeared that address the misalignment of functional topographies between subjects [14, 15, 16, 17, 18, 12, 19, 20, 7]. SRM [6] is the stateoftheart functional alignment method. Given subjects’ volumes synchronously acquired at time steps, SRM maps functional topographies from every subject to a shared response on a lowdimensional feature space, thus learning an individual mapping for each subject and a shared response across subjects. The model assumes that a volume with voxels sampled at time for subject
is the outcome of a random vector:
(1) 
where is the mapping from the dimensional shared space to subject volume space, is the shared response for all subject volumes at time , is the subject specific mean, and is a realization of a noise vector. The model is completed with the assumptions that with and that the mappings are orthogonal, i.e., . The orthogonality constraint allows a simple mapping of samples from subject to subject by projecting the sample to the feature space and then to the other subject’s voxel space, i.e., .
IiB Constrained EM for Model Approximation
The authors in [6]
propose a constrained ExpectationMaximization (EM) strategy
[21] to approximate the transforms and the shared responses of the samples. In the Estep, we compute the sufficient statistics of each sample’s shared response:(2)  
(3)  
where with rows representing all the transformations vertically stacked, is a block diagonal matrix with diagonal . The vector represents the demeaned samples at time vertically stacked for all subjects with
. Note that the estimated shared response is obtained as
.Once we obtain and
in the Estep, we compute the Mstep that updates the hyperparameters
, , and as follows:(4)  
(5)  
(6) 
This method for approximating SRM iteratively computes the Estep and Mstep until some stopping criterion is met.
To understand the model’s computational limitations we first note that the model size depends on samples (TRs), subjects and their voxels, and the number of features . The latter value determines the cost of some computations and the memory needed to store the mappings and the shared responses . The assumption that the shared information has a low rank dictates that should be small. It was shown that the best accuracy is obtained when the parameter is set to a value ranging from tenths to hundreds of features [6]. The importance of being small will be evident below.
The EM algorithm described in Equations (2)–(6) has several advantages. Equations (4) and (5) in the Mstep can be computed in parallel across subjects. Equation (6) is a summation over small matrices of size by , which also does not pose any challenge. In contrast, the Estep requires inversion of , a matrix of size by elements depending on the number of subjects and the numbers of voxels per subject . Computing and storing and its inverse is very challenging and sometimes may be nonviable. Because is symmetric, Cholesky factorization is the most stable method for computing its inverse. While requires memory, the Cholesky factorization requires additional memory and has a runtime complexity. For instance, when subjects with voxels each, the size of the matrix would be 500,000 by 500,000 and would require 20 GB for storing only, even before computing the Cholesky factorization.
IiC Reducing the Inversion
As discussed in Section IIB, the runtime and memory bottleneck is given by the inversion of the big matrix . We derive an analytical alternative formula to avoid computing , and hence reducing the computational runtime and memory usage. First, we apply the matrix inversion lemma to the first two terms in Equation (3):
(7) 
While Equation (3) requires inverting a matrix, Equation (7) supplants that by computing three much smaller inversions. The matrices and have a very small size, by , and their inversions can be computed fast with a Cholesky factorization on one machine. The matrix is a diagonal matrix and its inversion is computed by inverting its diagonal elements. Based on the orthogonality property of , i.e., , we note that
(8) 
where . Equation (8) helps in further reducing the number of matrix multiplications for computing (3).
Now, we derive an alternative formulation for in Equation (2), by applying the matrix inversion lemma as in (7) and some additional linear algebra steps:
(9) 
The result above requires the same three inversions needed in Equation (7). Therefore, the simplified computations in the Estep require only to invert dense matrices of by elements or a diagonal matrix with different values. Hence, the memory complexity is reduced from to , much less than originally needed to invert . It is worth noting that the required memory for these computations depends only on the number of features and is independent of the number of aggregated voxels in the data (where ).
IiD Reducing Communication for Distributed Computation
Recall that computing the Mstep is embarrassingly parallel assuming data is distributed by subject. However, the Estep requires centralized computation of the shared responses. Therefore, we study next the communications needed for a distributed version of the EM algorithm. Plugging Equation (9) into Equation (2) and rewriting it to partition by subject samples, we obtain:
(10) 
This equation allows us to identify what needs to be transferred to compute the shared responses in the Estep. To compute (10), we should transfer each to a master process and reduce the summation . Each transferred matrix (size ) is relatively small. The resulting (a matrix) should be broadcast to all processes for the Mstep computation. The matrices from (3) in the Estep are used for updating in (6) and computing the trace for in (5). Although broadcasting of such matrices is not limiting, we suggest avoiding it by updating in the master process by noting that
(11) 
and that . Therefore, we first compute in the master process and then broadcast its trace. This reduces the amount of communication from to . The distributed EM algorithm for SRM is summarized in Algorithm 1.
Iii Hierarchical Topographic Factor Analysis
In this section, we discuss Hierarchical Topographic Factor Analysis (HTFA), an interpretable model that can be used by neuroscientists analyze and visualize brain networks. As we did with the SRM, we apply both mathematical transformations and algorithmic optimizations to improve the performance of HTFA compared to the baseline published algorithm. We also present a scalable distributed implementation.
Iiia Model Overview
A cognitive state can be interpreted as a network of nodes and edges, representing regions of activity in the brain and their correlated activation. Topographic factor analysis (TFA) [4]
discovers these activity regions and their connectivity pattern through time. This method is based on a Bayesian model that describes an fMRI volume as a linear combination of factors, which are assumed to be Gaussian distributions. Each such distribution depicts a threedimensional sphere representing an activity region. For a given series of brain volumes, TFA uses variational inference to estimate the centers and widths of the factors and the linear combination weights.
HTFA [22] is a multisubject extension of TFA that further assumes that all subjects exhibit a global template of activity regions. Thus, every subject is described as a small perturbation of this global template. Let represent subject ’s data as a matrix with fMRI samples of voxels each and vectorized in ’s rows. Then, each subject is approximated with a factor analysis model
(12) 
where is an error term, are the weights of the factors in . The factor (row) in
represents a “sphere” or normal distribution with center
and width located relatively to the positions of the voxels in the volume.HTFA defines the local factors in as perturbations of the factors of a global template in . Therefore, the factor centers for all subjects are obtained from a normal distribution with mean and covariance . The mean represents the center of the global factor, while determines the distribution of the possible distance between the global and the local center of the factor. Similarly, the widths for all subjects are drawn from a normal distribution with mean , the width of the global
factor, and variance
. The model assumes that and are constants and the same for all factors. On top of the global parameters and , the model defines Gaussian priors for each, respectively, and . In addition, the columns of the weight matrices are modeled with a distribution and the elements in the noise term are assumed to be independent with a distribution. The associated graphical model is shown in Figure 2.IiiB Approximating the MAP Estimator
Based on this hierarchical model, the maximum aposteriori probability (MAP) estimator can find the centers and widths of the distributions. However, the exact estimator is hard to compute. Hence, Manning et al.
[4] suggested a black box variational inference technique to achieve an approximate estimate. The method consists of a global and a local step that iteratively update the parameters. The global step updates the parameters of the distributions in the global template. The local step updates for each subject the weight matrices , the local centers and widths . To update the parameters of the factors in , the local step solves(13)  
where is a subsampling coefficient defined below. Each row of
is a factor obtained by evaluating the radial basis function (RBF) with center at
and width(14) 
in positions for all the voxels in the threedimensional voxel space of the brain.
The objective function in (13) is nonlinear because of the definition of the factors in . Therefore, Manning et al. [4] propose to decompose the objective function as a sum of subfunctions and compute a solution using an unconstrained nonlinear least squares (NLLS) solver, implemented with a trustregion reflective method [23]. This is an iterative method that computes the Jacobian matrix of partial derivatives at each iteration. Hence, the size of the Jacobian matrix depends on the number of variables and the number of subfunctions defined in the problem. In the case of HTFA, there are variables: factors with a three dimensional center and a scalar width . The first term of the objective function containing the Frobenius norm can be described as a summation over subfunctions, each given by a squared element of the matrix . The second and third terms in (13) can be partitioned in subfunctions each. Summing up, there are functions, and the size of the Jacobian matrix is by . Although the Jacobian matrix is computed per subject and does not depend on the number of subjects , it still can be prohibitively large. For example, solving the problem for a subject with voxels, samples and for factors, this matrix will occupy 300 GB of memory. Manning et al. [4] mitigate this limitation by randomly subsampling the rows (TRs) and columns (voxels) from the data matrix , reducing it to samples by voxels ( and ). This requires adding the sampling coefficient to compensate between the weights of the first and other terms in (13).
After the subject centers and widths are updated via (13), we update the weights matrix for each subject by solving
(15) 
Problem (15
) admits a closedform solution of the form of ridge regression, i.e.,
. Due to the subsampling of the data matrix employed to update the factors of subject , only a subset of the rows in are updated. The weights and local parameters are updated solving (13) and (15), alternately, until a convergence criterion is met.Then, the hyperparameters of the global template priors are updated given the local estimates and under the assumption that the posterior has a conjugate prior with normal distribution. This yields the following formulas for the updating the hyperparameters
, , , and :(16)  
(17)  
(18)  
(19) 
where , . The above parameters , , , and are initialized to , , , and , respectively. The parameters , and are computed during initialization and remain constant through the iterations.
The local update step that solves (13) and (15) is embarrassingly parallel across subjects. Nevertheless, the NLLS solver is a runtime bottleneck because of its dependence on the big Jacobian matrix for the solver computations and because the problem is solved several times in each local update step. As the centers and widths of the local factors change when solving (13), the factor matrix is recomputed whenever the cost function in (13) is evaluated inside each iteration of the NLLS solver, adding to the runtime. Despite subsampling the data matrix , in terms of memory, the Jacobian matrix size dictates the complexity of the algorithm. In the following section, we describe optimizations targeting mainly the local updates step which (1) reduce the size of the Jacobian matrix, (2) use problemspecific assumptions to simplify the problem, (3) optimize the factors calculation, and (4) reduce the number of matrix inversions.
IiiC Optimizations to the MAP estimator
The computation complexity of NLLS solver makes it a direct target for optimization. We recall that the size of Jacobian matrix depends on the number of variables and the number of subfunctions. It is possible to partition the Frobenius norm term in (13) with a different subfunction granularity to reduce the number of subfunctions and hence, the size of the Jacobian matrix. However, the runtime does not necessarily reduce, as recomputing the Jacobian matrix in internal iterations of the NLLS solver requires the same number of operations independent of the granularity. Instead, we consider an alternative approach. We partition the variables into two blocks: (a) the centers and (b) the widths . We fix the values of the variables in block (b) and solve (13) only for the block (a). Then, we do the opposite and fix the variables in block (a) and solve for those in block (b). Although this block partitioning requires us to solve two subproblems with an NLLS solver, the computation complexity of the Jacobian matrix is smaller. The reasons are that the number of variables is reduced to the block size and subfunctions can be dropped as they are constant when fixing any of the blocks. In addition, this has the potential to reduce the number of steps to converge in the local updates, resembling a block coordinate descent approach.
The original algorithmic implementation utilizes an unconstrained NLLS solver. Since the model assumes that the location of each factor’s center is within the human brain, we add constraints to (13): the centers are bound to be within the brain volume and the value of widths are at least 4% of the brain diameter and at most 180% of it; these two parameters can be tuned for different datasets. This modification to the problem has the potential to speedup the solution by restricting the feasible region and maintaining the centers within the brain area in intermediate solutions of the NLLS solver. The unconstrained version may yield negative or zero width values in some iterations, leading to positive exponents in the RBF evaluation and hence to large values that might cause numerical overflow issues in the NLLS solver. Moreover, the contrained implementation may yield better solutions by avoiding factors of small sizes that represent a single noisy voxel. A solution to the constrained version of (13) can be obtained using a constrained NLLS. We opt for a trustregion reflective solver.
During the local factor and weight updates, each factor is evaluated to build the updated matrix explicitly. For this purpose, the value of the function is calculated for all the voxel locations . In particular, represents the positions of the voxels in a threedimensional grid. Because of the bounded size of the brain and the spatial contiguity of the voxels, the number of unique elements in each dimension () for all locations is much lower than the number of voxels. In addition, the Euclidean distance term in the exponent of the RBF can be expressed as . Therefore, we can reduce the number of subtractions and multiplications (i.e., the squares) for computing by caching the values for each coordinate , , or in a look up table. Assuming that the voxel locations are inside a cube of size , the effect of caching reduces from floating point subtractions and multiplications to for each factor.
The global posterior updates in Equations (16)–(19) require three matrix inversions per factor. Applying the matrix inversion lemma on these equations we obtain
(20)  
(21)  
(22)  
(23) 
where , and . In this manner, we reduce the number of inversions per factor in the global update step. The optimized version of the method is summarized in Algorithm 2.
Iv Parallel and Distributed Implementation
Our parallel implementations, DSRM and HTFA DMAP, both follow a similar mapreduce structure, which is shown in Figure 3. Both algorithms start by fitting local models in parallel for each subject (Step 2), then combining the local models with a gather or reduction (Step 3), and fitting a global model for all subjects (Step 4). The global model is then broadcast (Step 5) and the local models are refit using the updated global information. This process continues in an iterative manner for both algorithms. The columns on the right of the figure show a summary of the main computations, at each step, for both DSRM and HTFA DMAP. The following subsections provide further details of our parallel and distributed implementation and specific code optimizations.
Iva Singlenode Implementation
Our baseline SRM and HTFA implementations are written in Python and use the NumPy and SciPy libraries for matrix representations and operations. We use the Anaconda distribution of Python, which recently added support for using the Intel Math Kernel Library (MKL) to accelerate NumPy and SciPy in timeconsuming math operations such as matrix multiplication and matrix inversion. Our new optimized implementations continue to use Python for compatibility with other neuroscience tools which in turn use Python for productivity.
Some operations in NumPy and SciPy, such as matrix multiplication, run very efficiently on all cores automatically using Anaconda’s MKLbacked Numpy. However, we found that certain operations were running serially. These serial functions became the bottleneck in certain parallel configurations, so we wrote small OpenMPparallelized NumPy extension modules in C++ to speed them up. The specific functions we wrote are:
For HTFA we found that memory usage was increasing after every iteration, so we force garbage collection at the end of each local step using the Python gc module.
IvB Distributed Implementation
We use MPI to enable multinode parallelism. Subjects are distributed across MPI processes and we rely on parallelism within matrix operations to leverage multiple cores within a process. Communication between subjects maps directly to optimized MPI collective operations such as Reduce, Gather, and Bcast. We use these collectives exclusively for intersubject communication. We found that having multiple MPI processes (i.e., multiple subjects) per node was using cores more efficiently than having fewer MPI processes and more sharedmemory parallelism within a subject. As a result, we generally pack more subjects onto nodes until memory limitations arise.
In order to enable parallel I/O from distributed filesystems, we store each subject’s data in different files using NumPy’s standard binary format. With this approach, each MPI process loads its subject’s data in parallel with other processes.
V Experimental Setup
In this section, we describe the experimental setup, the real fMRI datasets we used for experiments, as well as the algorithm we designed to generate largescale synthetic fMRI data based on the real fMRI datasets.
Va Configuration
We run our experiments on the Cori Phase I machine at the National Energy Research Scientific Computing Center (NERSC) [24]. Cori Phase I is a Cray XC40^{1}^{1}1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance, currently with 1630 compute nodes and a Cray Aries interconnect providing 5.6 TB global bandwidth. Each compute node contains two 16core Intel Xeon^{2}^{2}2Intel and Xeon are trademarks of Intel corporation in the U.S. and/or other countries. E52698 v3 CPUs running at 2.3 GHz, and 128 GB DDR4 2133 MHz memory. Cori Phase I also has a distributed Lustre filesystem with 130 router nodes and 248 I/O servers, capable of providing 744 GB/s in bandwidth. We use the Anaconda distribution of Python version 2.7.11 with the following packages: MKL 11.3 Update 1, SciPy 0.17.0, and NumPy 1.10.4. We use mpi4py version 2.0.0, compiled with Crayprovided wrappers.
For HTFA, we sample voxels and TRs as described in Section IIIB using random sampling with replacement. To select the number of subsamples, we use the mean between a percent of the data (25% for voxels, 10% for TRs) and a maximum number of samples (3000 for voxels, 300 for TRs). To estimate DSRM performance in Gflop/s, we count the number of floating point operations used by the four main computations in DSRM. For each subject and iteration, there are two matrix multiplies, one matrix multiply, and one economy SVD operation. For a lower bound estimate, we assume that the SVD performs the same number of floating point operations as Householder QR ().
VB Datasets
We consider the following real neuroimaging datasets:

raider: subjects received two different stimuli while in the fMRI scanner: the film “Raiders of the Lost Ark” (110 min) and a set of still images (7 categories, 8 runs) [25].

forrest: subjects listened to an audio version of the film “Forrest Gump” (120 min) [25].

greeneyes: subjects listened to an ambiguous story (15 min) [26].
An overview of the datasets is shown in Table I. All datasets have the same number of voxels per subject, except raider. The datasets were stored in either float64 or float32 format, but all data is cast to float64 before processing. All data is stored in rowmajor order.
Dataset  Subjects  Voxels  TRs  Size 

raider  10  3,000  2201  658 MB 
greeneyes  40  42,855  475  3.1 GB 
forrest  18  629,620  3535  299 GB 
We create a large synthetic dataset to test the weak scalability of DSRM and HTFA DMAP. We start from a real dataset and randomly permute the temporal order of the data for each voxel. To preserve the spatial structures of the brain, we spatially partition the voxels and use the same permutation for all voxels in a partition.
The precise approach we use for synthetic data generation is described in Algorithm 3. Each partition is filled with data from a different randomlychosen seed subject. Next, the data in each partition is permuted across TRs, using the same permutation within each partition. We use forrest as seed dataset to generate a new 1024subject synthetic dataset. We split each brain volume using voxel partitions. The resulting forrest1024 dataset is 17 TB in float64.
Vi Results
We examine the performance improvements of our optimized algorithms compared to the baseline versions. We measure the impact of the number of factors on performance. We test strong scaling using real datasets and weak scaling using the synthetic dataset. Finally, we present scientific results obtained using our optimizations.
Via Improvements compared to baseline implementations
We demonstrate the performance advantage of our our improved SRM code using raider and our improved HTFA code using 20 subjects from greeneyes. The results are shown in Tables II and Table III for a single node, and . ”Baseline HTFA” and ”Baseline SRM” are our own implementations of the currently published algorithms. The ”Baseline + Algorithmic Opt.” performance includes only linear algebra transformations and algorithmic improvements, which are described in Sections II and III. These optimizations achieved and speedups for SRM and HTFA respectively. Finally, DSRM and HTFA DMAP include code optimizations, such as custom parallel NumPy extensions, and acrosssubject MPI parallelism (20 MPI processes and 3 threads per process). These additional optimizations added and speedups for DSRM and HTFA DMAP, respectively.
Algorithm  Runtime (s)  Speedup 

Baseline SRM  2655.25   
Baseline SRM + Algorithmic Opt.  3.95  671 
DSRM  1.46  1812 
Algorithm  Runtime (s)  Speedup 

Baseline HTFA  15162.96   
Baseline HTFA + Algorithmic Opt.  361.34  42 
HTFA DMAP  153.84  99 
ViB Impact of number of factors on performance
Figure 4 shows the effect of the number of latent factors on runtime. For DSRM, the runtime is linear in for certain operations, such as two of the three matrix multiplications, and quadratic for other operations such as the SVD. For HTFA DMAP, the runtime is linear in for the global template updates. The computation of the Jacobian matrix depends on , the number of voxels and the number of TRs. Overall, for both algorithms we observed a super linear increase of runtime as grows and a relatively large increase for HTFA DMAP. For the remainder of the paper we fix to be 60, which was chosen based on prior neuroscience experiments using these algorithms [6, 4].
ViC Strong scaling
Figures 5 and 6 show strong scaling performance on greeneyes. We measure compute time of 10 iterations on setups ranging from 1 node with 32 cores to 40 nodes with 1280 cores in total. Note that the number of MPI processes we use must divide the number of subjects in the real dataset, 40.
For both DSRM and HTFA DMAP, we found that parallelizing across subjects using MPI was generally more effective in utilizing cores than relying on OpenMP parallelism to saturate the cores. This is only a consideration for datasets in which multiple subjects’ data can fit in one node’s memory. We experimented with different numbers of processes and threads and chose those that gave good performance. For example, on a single node, we use 20 processes with 3 threads per process (using HyperThreading), just like in the baseline comparison. For DSRM, this configuration achieved 225 double precision Gflop/s out of a theoretical peak of 1177.6 Gflop/s. For both applications, we saw improved performance from 1 to 20 nodes, however the scaling slowed significantly after 5 nodes. We attribute this behavior to (1) matrix operations having a tallskinny shape and therefore not scaling as well as more computebound square matrix operations, and (2) increasing parallel overheads for larger node counts on the relatively small greeneyes dataset. Note that performance drops when using 1 process per node with 32 threads instead of 2 processes with 16 threads, because of parallel overhead and NUMA effects.
ViD Weak scaling
We test DSRM and HTFA DMAP weak scaling on forrest1024. For DSRM, we processes two subjects per node, and for HTFA DMAP we process one subject per node, which is driven by memory limitations. The results in Figure 7 show that compute time per subject (excluding I/O) rises by only for DSRM from 1 node to 512 nodes and for HTFA DMAP from 1 node to 1024 nodes. Disk I/O takes a significantly larger portion of total time for DSRM compared to HTFA DMAP, because DSRM spends relatively less time computing than HTFA DMAP while having to load the same amount of data.
Vii Discussion
The computation and memory requirements of multisubject analysis techniques, such as SRM and HTFA, are largely driven by the dimensionality of the data. Namely, neuroimaging data tends to have extremely high dimensionality per sample (e.g., 600,000 voxels), compared to relatively few samples (e.g., 3,500 TRs). This extreme skew in the data dimension makes linear algebra transformations and algorithmic optimizations potentially very profitable, because we have the potential to change the computational complexity from something that depends on the number of voxels, to something that depends only on the number of samples or latent factors. The matrix inversion lemma does this, and we applied it to both SRM and HTFA. The shape of the data also means that our optimized implementations mainly operated on socalled ”tallskinny” matrices, where the number of rows is much larger than the number of columns. This had a direct impact on our performance and our ability to scale to multiple cores within a node.
The computational structure of the multisubject analysis techniques we studied was very amenable to multinode parallelism. This also follows directly from the nature of the models we are fitting. Both models allow for a global (acrosssubject) representation of neural activity, and a local (persubject) configuration. The algorithms to fit the models will alternate between fitting local models with the global model fixed, and vice versa. This translates directly to mapreduce style parallelism. In the case of HTFA and SRM, fitting the global model was relatively inexpensive compared to fitting local models, which helped scalability. However, other multisubject modelfitting algorithms may not have this feature.
Viii Conclusion
We have introduced highlyparallel codes for two applications of factor analysis on multisubject neuroimaging datasets, SRM and HTFA. We demonstrated good singlenode performance and scaling on a stateoftheart supercomputer, up to 32,768 cores. This makes the discovery of latent factors present in large subject populations practical, allowing subjectspecific and shared responses to stimuli to be readily identified and interpreted. To achieve these results, we created highlyoptimized implementations of the factor analysis algorithms starting from validated baseline implementations. For DSRM, we achieved a speedup over the baseline code by deriving an alternate formula for the large matrix inversion in the critical Estep of the EM algorithm, reducing distributed communications during the Mstep through changes to hyperparameter computation, an d explicit nodelevel parallelization with MPI. For HTFA DMAP, we achieved 99 speedup over the baseline on a medium size dataset by splitting the variables of the nonlinear least squares problem to reduce the size of the Jacobian matrix and converge faster, using cachebased table lookups for distance calculations, reducing the number of matrix inversions, and explicit garbage collection. Our experience with these two applications allowed us to identify several key performance engineering recommendations that apply to factor analysis of multisubject neuroimaging data in general.
We expect that our work will enable neuroscientists to develop populationbased denoising techniques, design studies to better understand similarities and differences in cognitive function within populations, and perform more sophisticated studies of human interaction. To this end, we are releasing our code as opensource software on GitHub to allow the use our optimized versions of these algorithms on a supercomputer while retaining the programming benefits of Python.
References
 [1] “Announcing the hcp 900 subjects data release,” http://humanconnectome.org/about/pressroom/projectnews/announcingthehcp900subjectsdatarelease/, accessed: 20160404.
 [2] S. Huettel, A. Song, and G. McCarthy, Functional Magnetic Resonance Imaging. Freeman, 2009. [Online]. Available: https://books.google.com/books?id=BNhMPgAACAAJ
 [3] H. P. O. de Beeck, “Against hyperacuity in brain reading: Spatial smoothing does not hurt multivariate fmri analyses?” NeuroImage, vol. 49, no. 3, pp. 1943 – 1948, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S105381190900202X
 [4] J. R. Manning, R. Ranganath, K. A. Norman, and D. M. Blei, “Topographic factor analysis: a bayesian model for inferring brain networks from neural data,” PLoS One, vol. 9, no. 5, p. e94914, 2014.
 [5] S. J. Gershman, D. M. Blei, F. Pereira, and K. A. Norman, “A topographic latent source model for fmri data,” NeuroImage, vol. 57, no. 1, pp. 89 – 100, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1053811911004411
 [6] P.H. C. Chen, J. Chen, Y. Yeshurun, U. Hasson, J. Haxby, and P. J. Ramadge, “A reduceddimension fmri shared response model,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 460–468. [Online]. Available: http://papers.nips.cc/paper/5855areduceddimensionfmrisharedresponsemodel.pdf

[7]
R. Rustamov and L. Guibas, “Hyperalignment of multisubject fMRI data by
synchronized projections,” in
NIPS Machine Learning and Interpretation in Neuroimaging workshop
, 2013.  [8] J. Talairach and P. Tournoux, Coplanar stereotaxic atlas of the human brain. 3Dimensional proportional system: an approach to cerebral imaging. Thieme, 1988.
 [9] B. Fischl, M. I. Sereno, R. Tootell, and A. M. Dale, “Highresolution intersubject averaging and a coordinate system for the cortical surface,” Human brain mapping, vol. 8, no. 4, pp. 272–284, 1999.
 [10] J. Mazziotta, A. Toga et al., “A probabilistic atlas and reference system for the human brain,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 356, no. 1412, pp. 1293–1322, 2001.
 [11] M. Brett, I. S. Johnsrude, and A. M. Owen, “The problem of functional localization in the human brain,” Nat Rev Neurosci, vol. 3, no. 3, pp. 243–249, 03 2002.
 [12] B. R. Conroy, B. D. Singer, J. V. Haxby, and P. J. Ramadge, “fMRIbased intersubject cortical alignment using functional connectivity,” in Advances in Neural Information Processing Systems, 2009.
 [13] B. R. Conroy, B. D. Singer, J. S. Guntupalli, P. J. Ramadge, and J. V. Haxby, “Intersubject alignment of human cortical anatomy using functional connectivity,” NeuroImage, 2013.

[14]
V. Calhoun, T. Adali, G. Pearlson, and J. Pekar, “A method for making group inferences from functional MRI data using independent component analysis,”
Human brain mapping, vol. 14, no. 3, pp. 140–151, 2001.  [15] O. Friman, J. Cedefamn et al., “Detection of neural activity in functional MRI using canonical correlation analysis,” Magnetic Resonance in Medicine, vol. 45, no. 2, pp. 323–330, 2001.
 [16] J.H. Lee, T.W. Lee, F. A. Jolesz, and S.S. Yoo, “Independent vector analysis (IVA): multivariate approach for fMRI group study,” Neuroimage, vol. 40, no. 1, pp. 86–109, 2008.
 [17] T. Adali, M. Anderson, and G.S. Fu, “Diversity in independent component and vector analyses: Identifiability, algorithms, and applications in medical imaging,” Signal Processing Magazine, IEEE, 2014.
 [18] Y.O. Li, T. Adali, W. Wang, and V. D. Calhoun, “Joint blind source separation by multiset canonical correlation analysis,” Signal Processing, IEEE Transactions on, vol. 57, no. 10, pp. 3918–3929, 2009.
 [19] J. V. Haxby, J. S. Guntupalli et al., “A common, highdimensional model of the representational space in human ventral temporal cortex,” Neuron, vol. 72, no. 2, pp. 404–416, 2011.
 [20] A. Lorbert and P. J. Ramadge, “Kernel hyperalignment,” in Adv. in Neural Inform. Proc. Systems, 2012.
 [21] D. B. R. A. P. Dempster, N. M. Laird, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. [Online]. Available: http://www.jstor.org/stable/2984875
 [22] J. R. Manning, R. Ranganath, W. Keung, N. B. TurkBrowne, J. D. Cohen, K. A. Norman, and D. M. Blei, “Hierarchical topographic factor analysis,” in Pattern Recognition in Neuroimaging, 2014 International Workshop on, June 2014, pp. 1–4.
 [23] T. Coleman and Y. Li., “An interior, trust region approach for nonlinear minimization subject to bounds,” SIAM Journal on Optimization, vol. 6, p. 418–445, 1996.
 [24] “Cori configuration,” http://www.nersc.gov/users/computationalsystems/cori/configuration/, accessed: 20160403.
 [25] M. Hanke, F. J. Baumgartner, P. Ibe, F. R. Kaule, S. Pollmann, O. Speck, W. Zinke, and J. Stadler, “A highresolution 7tesla fmri dataset from complex natural stimulation with an audio movie,” Scientific Data, Jan 2014.
 [26] Y. Yeshurun, S. Swanson, J. Chen, E. Simony, C. Honey, P. C. Lazaridi, and U. Hasson., “How does the brain represent different ways of understanding the same story?” Society for Neuroscience Abstracts, 2014.
Comments
There are no comments yet.