1 Introduction
2 Related Work
Our work can be considered as an example of comparative visualization [9]. In this respect, the most closely related works are the Tender glyph, which explicitly encodes differences between two symmetric positive definite tensors [10], and recent followup works that have extended this encoding and combined it with complementary visualizations to enable visualization of tensor field ensembles [11], as well as a visual comparison of groups of tensor fields at multiple levels of detail [12]
. Unlike this existing approach, which focuses on a visual assessment, our goal is to closely integrate visual with quantitative statistical analysis. This represents a significant extension of our own previous work on visualizing tensor normal distributions
[13], which did not include any mechanisms for group comparison. Our integrated approach will be motivated in more detail in Section 3 and leads to new and different design choices.Seen in a wider context, our work contributes to addressing an urgent need for visual analytics technology that helps us to handle the increasing volume, variety, and velocity of image data that is used in medical and biological studies. Other recent examples include a series of works for interactive visualization and segmentation proofreading in connectomics research [14], tools that support navigation of parameter spaces for automated image analysis [15, 16, 17], a tool by Klemm et al., which supports hypothesis generation in epidemiology by integrating imagebased with nonimage data [18], and one by Hermann et al., which allows biologists to explore differences in anatomical shapes based on interactively defined groups of 3D images [19].
Our proposed tool makes use of several stateoftheart techniques for diffusion tensor analysis, including for spatial normalization [20, 21], interpretable reparametrization [22], and cluster enhancement of the resulting statistical maps [23]. Details on these will be provided as needed throughout the paper.
3 Finding Group Differences in Tensor Fields as a Visual Analytics Problem
Our current work arose out of a previous collaboration between some of the computer science and clinical coauthors, which resulted in a new statistical and machine learning technique
[24], and in findings that have been reported in clinical journals, e.g., [25]. Indepth discussions about the statistical analysis of tensor fields led us to conclude that there is a need for a visual analytics approach that tightly integrates formal quantitative with visual analysis. We will provide some theoretical background and intuition in Section 3.1, followed by a requirements analysis in Section 3.2.3.1 Background on Multivariate Hypothesis Testing
A direct visual encoding that supports the comparison of groups of tensor fields has been proposed very recently [12]. When applied to clinical data, such a strategy will show at least small differences everywhere: Due to noise and other random effects, even two very similar groups of healthy adults are very unlikely to be exactly the same. A benefit of integrating visual and statistical analysis, which is an important goal of our system, is that it allows us to focus on differences that are likely to be due to the studied disease.
Intuitively, statistical hypothesis tests decide on the significance of specific differences by comparing their magnitude to the amount of variance in the data. This is illustrated in Figure 2: The red regions in the left part, in which statistically significant differences aligned with FA have been detected, have a clear correlation with regions of strong group difference, which is color coded in the central part. However, there are regions with a strong group difference, some of them highlighted with blue arrows, that are not significant in a statistical sense. This happens where the variance, color coded in the right part of the figure, is so large that even large group differences might have arisen due to chance.
Most clinical studies perform hypothesis testing on only one or two scalar values derived from the tensor field, such as Fractional Anisotropy (FA) or Mean Diffusivity (MD). To illustrate the need for multivariate hypothesis tests, Figure 3 presents two synthetic tensor fields. It is obvious from their visualization using superquadric glyphs [26] that the two fields differ at all locations, with respect to all six degrees of freedom in symmetric tensors: Reduced overall magnitude, as measured by reduced Frobenius norm, i.e., norm of eigenvalues (1); reduced degree of directional dependence, as measured by reduced FA (2); change from a more linear towards a more planar type of anisotropy as measured by reduced tensor mode, while keeping FA constant (3), as well as rotations around three orthogonal axes (4–6).
In this example, the widely used univariate hypothesis test based on FA would be blind to all differences except those in region 2. In contrast to this, a multivariate test that accounts for all available information in the tensor should be able to detect all different types of variation. Accordingly, multivariate testing is often applied in the hope that it will be able to detect more and larger regions of significant differences. For example, as a key result of their proposed multivariate Cramér test, Whitcher et al. report that they observed a 169% increase “in the volume of a significant cluster compared to the univariate FA test” [6].
Disappointingly, in our own experiments on clinical data, we often observed a decrease, rather than an increase of the overall number of significant voxels when replacing the widely used univariate tests with its multivariate counterpart, Hotelling’s test. An example is shown in Figure 4. The tests will be explained in more detail in Section 4.2.1. For now, we observe on the left that clear and extended clusters (red) have been detected when looking for changes along one specific axis in tensor space, namely changes in Fractional Anisotropy (FA). One might hope that a test that accounts for the full tensor information would highlight the same regions, plus others. Unfortunately, it is clear from the result on the right that this is not the case. We note that even the 169% increase in volume reported in [6] pertains to a single selected cluster; the authors do not report whether, and by how much, the overall volume of all clusters in their data increased.
To better understand why, depending on the data characteristics, a multivariate test may or may not be more powerful than a simple univariate test, Figure 5
presents two toy examples. In the first one, the blue squares are sampled from a multivariate Gaussian distribution whose center is at the top right compared to the center of the distribution from which the red circles are taken. Variance along both axes is so high that, with a univariate
test, neither horizontal nor vertical distance alone is sufficient to detect a significant difference ( and , respectively). In contrast, by accounting for the differences in both dimensions simultaneously, a Hotelling test achieves a clearly significant result ().In the second example, the center of the blue squares is to the right compared to the red circles, but at the same height. Here, a test along the horizontal axis shows a significant difference (), but the Hotelling test fails to produce a significant result (). The reason is that, in this case, including the second dimension adds pure noise, which reduces the power of the test. Of course, if we increase the number of samples sufficiently, also the multivariate test will indicate a significant difference. However, due to the fact that many clinical studies face difficulties in finding a large number of patients that can be included, and due to the high cost of brain imaging, we have to carefully pick our null hypothesis in a way that the resulting test will be sensitive enough even for a very limited number of samples.
When comparing tensor fields, the number of possible null hypotheses is immense, especially if we consider not just individual tensor properties, but also null hypotheses that combine several degrees of freedom, such as “The two groups do not differ with respect to the amount or type of anisotropy” (which requires a 2D multivariate test) or “The two groups do not differ with respect to orientation” (which requires a multivariate test in 3D orientation space). Since the choice of sensible null hypotheses should be guided both by the data and by the domain knowledge of the medical expert, we consider visual analytics as the only viable approach.
3.2 Requirements Analysis
Based on our initial experiments with multivariate hypothesis testing and discussions with our clinical coauthors, we identified the following six requirements that our visual analytics system should meet:

The system should implement stateoftheart techniques for statistical hypothesis testing. It should make it easy to steer them, and to explore and interpret their results. The need for a close integration of statistical methods results from the fact that publishing findings from group studies in clinical journals requires a formal and quantitative data analysis.

The system should offer the twodimensional slice views that clinical researchers are used to and that are frequently found in clinical publications.

The system should also provide threedimensional views when it helps assess threedimensional anatomical structures, e.g., via fiber tracking.

To facilitate interpretation of the results from multivariate hypothesis tests, the system should include visualizations that reveal which tensor properties differed most substantially between groups, and whether different properties were correlated.

In case of findings that might be surprising and difficult to explain, the system should support investigating the potential role of residual misalignment from subject normalization.

The system should support the direct comparison of the results from different null hypotheses, to judge the extent to which the corresponding regions might spatially overlap.
4 Visual Analytics of Tensor Field Group Differences: A Practical Framework
4.1 Preprocessing
In order to compare all tensor fields on a pervoxel basis, we first have to bring them into spatial correspondence by nonlinear registration to a common template. As part of this process, tensors were rotated according to the incurred changes of their frame of reference. We used the publicly available Diffusion Tensor Imaging ToolKit (DTITK) [20, 21] to achieve this.
Our visual analytics framework was implemented in C++ with Qt for the user interface, Teem (teem.sf.net) for standard tensor visualization including fiber tracking, and OpenGL for 3D graphics. Data preprocessing, such as transforming all tensors into the IGRT basis (Section 4.2.1), or computation of means and covariances of each subgroup, was performed using inhouse Python scripts.
4.2 Steerable Statistical Hypothesis Testing
As an important aspect of requirement R1, the user should be able to steer statistical hypothesis tests on tensor fields in a meaningful way. We achieve this by combining the multivariate Hotelling test with a specific reparameterization of the tensor field [22]. Both techniques have been proposed previously, but combining them to achieve steerable hypothesis testing is one of our contributions.
4.2.1 Multivariate Testing with Meaningful Projections
The most common way to identify group differences in imaging studies is mass univariate testing. This amounts to running a large number of statistical tests, each accounting only for a single value at a single location of the brain. Spatially mapping locations where the null hypothesis was rejected highlights regions in which the groups differ in a statistically significant manner [3].
Tensors are intrinsically multivariate, reflecting not just a single property (such as amount of anisotropy), but also the overall amount of diffusion, type of anisotropy, and preferred diffusion directions. Our framework accounts for this by testing more complex null hypotheses, stating that the mean tensors in the two groups are the same with respect to multiple or even all their properties. The Hotelling test provides the corresponding extension of the widely used test [27]. Its statistics is defined as
(1) 
where ,
are group mean vectors and
is a pooled estimate of the covariance matrix, , where and are the number of subjects and sample covariance for group , respectively. To apply this test to symmetric diffusion tensors with coefficients , we embed them isometrically into by setting(2) 
As we demonstrated in Section 3.1, blindly applying a multivariate test to all available degrees of freedom can lead to a loss of sensitivity due to a swamping with noise. To avoid this, our visual analytics framework allows the analyst to interactively formulate null hypotheses that relate to a meaningful subset of the information contained in the tensor fields, and that deliberately exclude other aspects. Unfortunately, the components in Eq. (2) are not suitable for this task, since they depend on the chosen frame of reference, and lack an intuitive interpretation.
We address this by expressing tensors in the socalled IGRT (“invariant gradient and rotation tangent”) basis, a local orthonormal basis that can be constructed from the gradients of tensor invariants such a tensor norm and Fractional Anisotropy, and from the tangent vectors corresponding to infinitesimal rotations [22]. In particular, after rotating difference vectors between individual tensor values and the grand mean into the IGRT basis constructed at , we can interpret their six rotated coefficients as changes in tensor value related to

Changes in Frobenius norm ()

Changes in the amount of anisotropy, as measured by

Changes in the type of anisotropy, as measured by

Rotations around the three eigenvectors
We allow the analyst to build interpretable null hypotheses by interactively selecting arbitrary combinations of these six degrees of freedom, which are illustrated in Figure 3. Two properties of the IGRT basis are particularly relevant to its use in multivariate hypothesis testing, which goes beyond its previous use in visualization [13]:
First, coefficients in the rotated difference vectors denote differences in tensor value (associated with changes in norm, FA, etc.), and thus all have the physical units of diffusivity. This means that they are measurements on a common scale, which is useful for combining them into a single multivariate test. However, it also implies a – somewhat subtle – mathematical difference between our framework, which tests FArelated tensor changes, and the more widely used practice of testing precomputed FA values.
Second, Kindlmann et al. [22] describe two variants of the IGRT basis: The one that was used in [13] is derived from a cylindrical coordinate system, whose invariants are written as and include tensor trace, which is proportional to the widely used Mean Diffusivity (MD). In this work, we use the second one, which is derived from a spherical coordinate system, written as , and includes the widely used Fractional Anisotropy (FA). Unfortunately, the gradients of MD and FA are non orthogonal, making it impossible to combine both into a common orthonormal IGRT frame. We have chosen the one that includes FA rather than MD to facilitate a direct comparison with a previous work that has tested FA on the same data [25].
4.2.2 Cluster Enhancement
Requirement R1 states that our statistical tests should conform to the state of the art in neuroimage analysis. This requires addressing the problem of multiple comparisons.
When designing a statistical hypothesis test, we have to set its type I error rate
. In our context, a type I error amounts to reporting a group difference in a region where there appears to be one in the sample – e.g., due to noise or due to the specific choice of subjects – even though there is none in the underlying populations. In neuroimaging, it is widely accepted to use .Performing a statistical test in a huge number of voxels greatly increases the familywise error,
i.e., the probability that the null hypothesis will be falsely rejected in at least one voxel. This could be addressed by reducing the
of each individual test, a strategy known as Bonferroni correction. However, this approach is overly conservative when nearby voxels are correlated, as it is typically the case in practice, and leads to an unnecessarily drastic reduction of that no longer allows us to detect any of the true group differences.A common way to correct for familywise errors while preserving a useful amount of sensitivity is to rely on the size of contiguous regions in which an effect is observed, since large regions are less likely to occur due to chance [28]. Our framework implements thresholdfree cluster enhancement (TFCE) [23], a stateoftheart variant of this idea. TFCE automatically tries out a wide range of possible thresholds to form contiguous regions, then sums up the volumes of the clusters that each voxel belongs to at all these thresholds, so that voxels that belong to larger neighborhoods receive greater values without the need to select any specific cluster forming treshold.
Usually, TFCE would be used within the framework of a permutation based statistical test [29]
, which involves computing the test statistic and its TFCE transformation for a large number of randomly permuted group labels. Based on this procedure, one can determine a value at which the TFCE map should be thresholded to yield a statistical test with a given familywise error rate. Unfortunately, permutation based testing can take many hours on a standard workstation, and is thus much too timeconsuming for use in an interactive visual analytics framework.
Within an exploratory analysis, which is the main task of our tool, it is anyway quite common to look at results not just for a single setting of , but to explore different “levels of significance”. Therefore, we allow the analyst to threshold TFCE maps at arbitrary values, guided by a cumulative histogram that visualizes how the number of superthreshold voxels depends on the threshold value. Once the analyst has found a set of interesting clusters in an interactive session and would like to determine their exact familywise error corrected values, he or she can run the permutation tests offline. An important benefit of our framework is that it allows him or her to get a quick impression of the results for many potential hypotheses, and to easily discard those that are obviously fruitless.
4.3 Spatial Views
After specifying and running a hypothesis test, the analyst would like to investigate the anatomical location, spatial extent, and shape of the regions in which a group difference was detected. According to requirement R2, axisaligned slice views play a prominent role in our user interface and in many of our results. Our slices follow the radiological convention, i.e., the patient’s right hemisphere is shown on the left hand side of each picture. Labels for left and right are included as a reminder of this (e.g., see Figure 6).
Statistical analysis involves a spatial warping of the involved subjects to a common reference space. By default, anatomical context for the detected regions is provided by superimposing them on averaged anatomical MR images that have been transformed into the same reference space. On demand, a more exact assessment of the affected tract can be made by displaying an XYZRGB color encoding of the principal diffusion direction [30].
In agreement with requirement R3, our system also supports seeding a streamlinebased tractography algorithm [31] in an affected region, and provides a threedimensional view of the result, with orthogonal slices as optional anatomical context. In order not to bias the tractography towards any of the involved subjects, we run it on an average of all coregistered tensor fields. The average is taken in LogEuclidean space [32], which is known to minimize blurring [33]. We found that this strategy preserves the characteristic shapes of the major bundles. For example, parts of the forceps minor, anterior thalamic radiation, inferior frontooccipital fasciculus, and uncinate fasciculus are wellrecognizable in Figure 7.
4.4 Table View
In neuroimaging, contiguous sets of voxels in which a significant difference was detected are referred to as clusters. Quickly and objectively assessing the number and volume of all detected clusters is part of requirement R1, and we support it by displaying this quantitative information in a table, sorted by cluster size. This table can be seen in the immediate neighborhood of the slice views in Figure 1. It also includes spatial information (center of gravity of each cluster) and is linked to the slices, so that clicking on a table row moves the slices to the respective cluster center.
To highlight and better distinguish clusters of particular interest in the spatial views, the user may assign a color to them in the table view. In this mode, unselected clusters are shown in gray, to make them less visually salient than the more relevant ones, while still distinguishing them from the rest of the brain and the background. An example can be seen in the inset pictures in Figure 6.
4.5 Scatter Plot Matrix
The spatial views indicate where a userdefined null hypothesis has been rejected, but not for which reason, which is the main concern of requirement R4. For example, when viewing the results of the null hypothesis “The two groups do not differ with respect to any tensor property,” the analyst will want to drill down into the exact nature of the detected group differences, which requires viewing all local properties of all involved subjects.
In tensor field visualization, glyphs are commonly used to show all available information at a given location [34], and our system offers a glyph view that will be discussed in the next subsection. Even though glyphs have been proposed that simultaneously visualize variations in tensor scale, shape, and orientation [11], these glyphs are based on aggregating some of the available information. In particular, they do not encode correlations between different tensor properties, which is part of requirement R4.
To convey the available information in full detail, and in a way that enables the investigation of correlations, our system employs a scatter plot matrix, whose six dimensions reflect the IGRT embedding that was introduced in Section 4.2.1, so that they remain interpretable. As shown in Figure 6, 15 scatter plots visualize the correlations between any pair of IGRT axes. Each subject is shown as a point, and colored according to its group (here: patients red, healthy controls blue). For example, the top left scatter plot in Figure 6 shows a strong negative correlation between changes in Frobenius norm and Fractional Anisotropy. When the mouse pointer hovers over a plot, the exact Pearson correlation coefficient is displayed as a tooltip.
On the diagonal of the scatter plot matrix, we use boxplots to show the distributions along the respective axis in the two groups. This immediately reveals which axes exhibit a strong group difference (in Figure 6, norm and FA), and for which others the groups overlap. Here, tooltips contain the results of a univariate test, answering the question if the respective dimension alone would have been sufficient for statistical significance.
The scatter plot matrix can be used in two ways: The first is simply to browse the data, by clicking on any voxel in a spatial view. The scatter plot matrix then displays the local tensor data from all subjects. The second is for a posthoc analysis of spatial clusters which were previously highlighted by a hypothesis test. In this case, the scatter plot matrix reflects the data from all subjects after averging over the respective cluster. Special care has to be taken when averaging values from axes 4–6, which correspond to the rotation tangents. Since they are derived from eigenvectors, which lack an intrinsic orientation, their signs could flip randomly from one voxel to the next. Therefore, we average the absolute values in these cases. The results still reflect the amount, but no longer the direction of rotations.
4.6 Small Multiple Glyph View
Even though our clinical coauthors were quickly convinced that the scatter plot matrix excels at showing correlations, they pointed out that it remains rather abstract. Especially for requirement R5, i.e., the assessment of residual misalignments or cases of failed image registration, they still preferred to see a more traditional tensor glyph visualization.
Our system provides a small multiple glyph view, shown in Figure 8, which contains one tensor glyph per subject. In contrast to recently proposed glyphs that summarize groups of tensor fields [11], this immediately reveals the identity of potential outliers. This is important so that the analyst can reattempt a failed registration with modified parameters or, if no reasonable registration can be achieved, exclude the affected subject from the study.
Each subwindow is labeled with the respective subject ID, and all camera parameters are coupled, so that glyphs can easily be compared from different viewing directions. Figure 8 shows a voxel near the boundary between white matter and the corticospinal fluid. The considerable variation in tensor norm and amount of anisotropy that is evident from the glyphs is a telltale sign of an imperfect registration. In this case, the glyph view led the analyst to disregard a statistical difference that had been detected in this voxel.
4.7 Comparative View
To address requirement R6, our framework supports the comparison of results from different tests, thresholds, or levels of data smoothing. After running a test, optionally with cluster enhancement, one can save the resulting binary mask to a file. Up to three different test results can then be loaded into an overlay map, shown in Figure 9.
It is a difficult task to visually encode the overlap of different classes in a manner that is intuitive and easy to interpret. After trying out several alternatives, we chose an encoding that has been proposed in a very different application context, namely, in the “splatterplot” approach to overcoming overdraw in scatter plots [35]. This encoding combines specific rules for color blending and modulation with contouring.
Colors that encode regions of overlap are obtained in two steps: First, the colors that represent the overlapping classes are averaged in the CIE Lab color space to obtain a color that has approximately the same perceptual distance to all involved classes. Second, in order to more clearly indicate the presence of multiple classes, lightness and chroma of the mixed color are attenuated in the LCH color space, with an increasing effect for an increasing number of overlapping classes. The exact equations can be found in [35], and the result can be observed in Figure 9. To further facilitate interpretation of the overlay map, our user interface displays a Venn diagram as a color legend.
Despite the advanced color blending and additional visual cues from contours, which we draw in slightly darker shades of the respective color, we came across cases in which the underlying maps were so complex that trying to understand the exact relationship of all three from a single image remained challenging. For these cases, our interface allows the user to temporarily hide some of the classes in order to build a better understanding in an interactive and iterative manner.
The exact number of voxels included in each mask, and the number of voxels in the overlapping regions, are displayed as a tooltip of the Venn diagram. We also update the table view to reflect the connected components of the regions where the results of all three tests overlap. Specific use cases, and interpretation of Figure 9, are discussed in Sections 5.3 and 5.4.
5 Results And User Feedback
We designed, implemented, iteratively refined and extended the presented system over a twoyear period during which we accounted for feedback from our clinical coauthors. Feedback from a case study is reported in Section 5.1. In the remainder of this section, we report specific results that have been achieved with our system. We focused on 56 diffusion tensor fields from a clinical study of systemic lupus erythematosus (SLE), for which we had previously reported results from a standard analysis in a medical journal paper that also provides detailed information about the data acquisition, clinical parameters, and criteria for inclusion [25]. Among the 56 subjects, 38 suffered from the disease, 19 of them with neuropsychiatric symptoms (NPSLE), 19 without (nonNPSLE). The remaining 18 subjects were in a healthy control group.
5.1 Feedback From Case Study
We had a twohour inperson meeting in which we provided a live demo of the system to one of our collaborators (TSW) and went through several examples together, discussing both the design of our software and the clinical implications of our findings.
Our collaborator particularly liked the fact that our system tightly integrates different techniques for statistical and quantitative data analysis with visualization, which is not achieved in the standard software packages available to him, and which encourages data exploration and hypothesis generation. He considered this helpful for DTI analysis, particularly highlighting the ability to easily assess correlations between different tensor properties, the option to directly seed tractography from regions in which significant differences were found, and the fact that the system displays quantitative results on demand, e.g., scores and values as tooltips in the SPLOM. He also emphasized that he would like to see similar systems for other modalities, such as functional MRI or Voxel Based Morphometry.
In preparation for the meeting, we reorganized our original user interface by placing controls that we considered to be less crucial to the main workflow into an optional “advanced” view, as indicated in Figure 1. Our collaborator found that, despite this effort, a certain level of complexity remained, but he thought it was commensurate with the complexity of the analysis task and told us that a certain amount of training is also needed to become familiar with the standard software packages in the field. He found the structure of the interface logical, and he asked for a few changes that we made in the final version, e.g., by adding the glyph view described in Section 4.6.
5.2 Hypothesis Generation on Clinical Data
As it was explained in Section 4.2.1, our framework tests changes in tensor values associated with certain tensor invariants or rotations. This makes a univariate test of Fractional Anisotropy (FA) in our tool similar, but not equivalent to traditional hypothesis tests, which are based on changes in (precomputed) FA. Therefore, comparing the results to those from a traditional analysis can serve as a valuable initial validation. Figure 10 (a) shows the result from comparing healthy controls to NPSLE patients in our framework, based on FArelated changes and thresholdfree cluster enhancement. The detected regions are remarkably similar to those from our previous analysis of the same data with standard methods [25]. Seeding tractography in them also reveals most of the fibers that have been reported in a previous tractographybased study [36] and have been ranked highly in terms of their predictiveness [24], including the genu of the corpus callosum, left and right inferior frontooccipital fasciculus, and left uncinate fasciculus (see also Figure 7).
The remaining subfigures of Figure 10 show the results from other univariate hypothesis tests that are enabled by our framework, and for which no similar results have been reported previously. Subfigure (b) shows regions in which the Frobenius norm of the diffusion tensors is increased in the patients, indicating increased overall diffusivity. The largest cluster (yellow arrows) is immediately adjacent to, and partly overlapping with, the previously detected changes in FA. Only a small part of the second largest cluster is visible in Figure 10 (b) (blue arrows). It is directly adjacent to a ventricle, and further inspection with the glyph view indicates that this second cluster might be caused by a difference in ventricle sizes between the groups, which might have been compensated only partially by the registration [37].
Subfigure (c) shows a cluster in which tensor mode is decreased in the patients, which means that diffusion is restricted more to a plane than a line, and which might indicate a larger degree of fiber spread in the patients. A tractography result seeded in this cluster indicates that the difference extends over parts of the forceps major and inferior frontooccipital fasciculus.
The only result we observed when testing for differences in orientation was a cluster in the corona radiata of the right hemisphere, where tensors are systematically twisted (rotated around their principal eigenvector) in the patient group compared to the controls. It is unclear whether this is an artifact of the specific sample or the registration algorithm, or if it corresponds to a true difference between the populations, especially given that the cluster is small.
In summary, our framework was successfully used to produce new hypotheses that should be investigated further in future clinical studies.
5.3 Exploring Results of Multivariate Tests
All experiments in the previous subsection used univariate hypothesis tests. To illustrate use of the scatter plot matrix to investigate the results from a multivariate test, we looked for any differences in tensor shape (i.e., in the subspace spanned by norm, FA, and mode) between the healthy controls and the nonNPSLE patients, again using thresholdfree cluster enhancement (TFCE).
As it can be seen in Figure 6, this results in several mediumsized clusters. Three of them, shown in red, green, and purple in the inset slice views, exhibit significantly reduced FA ( in the largest, red cluster), and increased Frobenius norm () in the patient population, with a strong correlation between both measures (). Again, these might be due to differences in ventricle size. In the blue cluster, FA is strongly decreased in nonNPSLE patients (), but norm is less affected (), and the correlation is much weaker (). Decreased FA in the same region was previously reported based on standard methods [25]. In addition, our framework allows us to observe that this difference goes along with a reduction in mode ().
One question that the comparative view from Section 4.7 allows us to answer is how the results from two univariate tests, e.g., ones that compare NPSLE patients to healthy controls based on FA or tensor mode, differ from a single multivariate test that combines both. Figure 9 shows the corresponding overlay.
As it was mentioned in Section 4.2.2, our software does not convert the results from TFCE to values, since this would require time consuming permutation testing. In order to still compare univariate and multivariate tests in a meaningful manner, Figure 9 was created using the Hotelling test without TFCE. In this case, the null distribution has a parametric form [27], which allows us to compute values within interactive runtimes. To compensate for the fact that testing FA and mode separately amounts to performing twice as many tests as a single combined test, we made the thresholds for the univariate tests twice as restrictive ( vs. ).
In this example, the comparative view revealed that the volume highlighted by the multivariate test () mostly agrees with the union of the two univariate tests ( for FA and for mode), with almost no overlap between FA and mode. This leads us to the hypothesis that testing mode in future studies of SLE will result in new findings that complement the ones from FA.
5.4 Visualizing the Effect of Data Smoothing
There has been some controversy about whether to smooth diffusion tensor data in preparation for statistical analysis. On one hand, spatial smoothing helps to compensate for anatomical misalignment that may remain after registration, it smoothes out noise that might otherwise lead to false positive detections, it makes the data distribution more Gaussian and, when the bandwidth is matched to the spatial scale of regions of difference, it can boost statistical power. On the other hand, the ideal spatial scale is usually not known in advance, and there is no principled approach for selecting the bandwidth parameter, which might have a substantial effect on the results [38]. Moreover, it has been argued that, since smoothing amounts to artificially decreasing the image resolution, it is counterproductive for diffusion MRI in particular, which specifically measures diffusion in order to overcome the limits of image resolution and to infer tissue parameters at a microscale [37].
For these reasons, TractBased Spatial Statistics (TBSS), a widely used method for statistical hypothesis testing of diffusion tensor fields, avoids spatial smoothing and instead corrects for residual misalignment by projecting FA values onto a socalled white matter skeleton, a medial surface representation of the major fiber tracts [37]. Unfortunately, current version of TBSS cannot deal with the full tensor information. Therefore, recent studies that have analyzed the full tensor have still employed smoothing [8].
We have used the comparative view from Section 4.7 to explore the effects of data smoothing, as shown in Figure 11. In this experiment, we compared the healthy and NPSLE populations with respect to the full tensor information. As in [8], we compare the spatial regions that result from different processing pipelines by showing the “most significant” voxels, rather than setting some a priori threshold.
The top row of Figure 11 compares results on the original data (red) to data that has been smoothed with bandwidth voxels (green) and voxels (blue), respectively. Many isolated voxels and small regions are shown in red, indicating that they are removed by smoothing, while many of the larger ones grow, as indicated by blue halos around them. Optionally, the analyst can focus on the dark brown regions, indicating differences that are considered significant irrespective of the amount of smoothing.
The bottom row shows results from the same experiment, but additionally uses thresholdfree cluster enhancement (TFCE). Even without any smoothing, TFCE eliminates many of the small clusters, and leads to larger connected ones. It also reduces the overall effect of smoothing, leading to a increase in the number of consensus voxels. Based on these results, we decided to omit smoothing, but use TFCE in most experiments reported above.
6 Conclusion
Diffusion tensor MRI is widely used for studying how different brain diseases affect white matter microstructure. The multivariate nature of the resulting tensor fields makes it a challenge to comprehensively explore group differences. In this paper, we have argued to approach this task using visual analytics, based on the observation that judging which of the differences are likely to be related to the respective disease requires combining the analyst’s prior anatomical knowledge with complex statistical computation.
We presented a tool that builds on the formalism of statistical hypothesis testing and provides a suitable visual interface to specify different null hypotheses and to explore the spatial extent of the resulting significant regions, based on the assumption that larger connected clusters are more likely to be relevant to the analyst. By linking them to threedimensional visualizations such as fiber tractography, it becomes easy to identify the affected bundles. We also provide support for a posteriori investigation of the results from multivariate tests, allowing us to identify which specific attributes of the tensor fields exhibit the strongest differences, and how they are correlated. Finally, overlay views make it easy to compare the result from different tests, thresholds, or levels of smoothing. Throughout this pipeline, we closely integrate visual with quantitative analysis.
In terms of the specific results shown on data from a clinical study on systemic lupus erythematosus, a logical next step will be to verify, on data from an independent cohort, the new hypotheses that we arrived at by using our tool, in particular the hypotheses that the disease also goes along with changes in overall diffusivity, as well as with changes in tensor mode in regions that are disjoint from the ones in which Fractional Anisotropy changes.
We note that, even though the vast majority of hypothesis testing in clinical studies that use diffusion MRI is still based on the diffusion tensor model, this model has long been known to be insufficient for modeling multiple fiber directions within the same voxel, and highangular resolution diffusion imaging is now considered to be a stateoftheart alternative for fiber tractography [39]
. More recently, socalled multishell models, such as diffusional kurtosis
[40] or NODDI [41], which provide even more detailed information, have started to enter clinical research, and only very few visualization techniques are currently available for them [42]. This is a very natural direction for future extensions of our framework.References
 [1] P. J. Basser and C. Pierpaoli, “Microstructural and physiological features of tissues elucidated by quantitativediffusiontensor MRI,” Journal of Magnetic Resonance, Series B, vol. 111, pp. 209–219, 1996.
 [2] M. Cercignani, “Strategies for patientcontrol comparison of diffusion MR data,” in Diffusion MRI, D. Jones, Ed. Oxford University Press, 2010, pp. 485–499.
 [3] L. J. O’Donnell and T. Schultz, “Statistical and machine learning methods for neuroimaging: examples, challenges, and extensions to diffusion imaging data,” in Visualization and Processing of Higher Order Descriptors for MultiValued Data. Springer, 2015, pp. 299–319.

[4]
P. J. Basser and S. Pajevic, “A normal distribution for tensorvalued random variables: applications to diffusion tensor mri,”
Medical Imaging, IEEE Transactions on, vol. 22, no. 7, pp. 785–794, 2003. 
[5]
L. Baringhaus and C. Franz, “On a new multivariate twosample test,”
Journal of multivariate analysis
, vol. 88, no. 1, pp. 190–206, 2004.  [6] B. Whitcher, J. J. Wisco, N. Hadjikhani, and D. S. Tuch, “Statistical group comparison of diffusion tensors via multivariate hypothesis testing,” Magnetic Resonance in Medicine, vol. 57, no. 6, pp. 1065–1074, 2007.
 [7] A. Schwartzman, R. F. Dougherty, and J. E. Taylor, “Group comparison of eigenvalues and eigenvectors of diffusion tensors,” Journal of the American Statistical Association, vol. 105, no. 490, pp. 588–599, 2010.
 [8] A. Bouchon, V. Noblet, F. Heitz, J. Lamy, F. Blanc, and J.P. Armspach, “Which is the most appropriate strategy for conducting multivariate voxelbased group studies on diffusion tensors?” NeuroImage, vol. 142, pp. 99–112, 2016.
 [9] H.G. Pagendarm and F. H. Post, “Comparative visualization – approaches and examples,” in Visualization in Scientific Computing, M. Göbel, H. Müller, and B. Urban, Eds. Springer, 1995.
 [10] C. Zhang, T. Schultz, K. Lawonn, E. Eisemann, and A. Vilanova, “Glyphbased comparative visualization for diffusion tensor fields,” Visualization and Computer Graphics, IEEE Transactions on, vol. 22, no. 1, pp. 797–806, 2016.
 [11] C. Zhang, M. Caan, T. Höllt, E. Eisemann, and A. Vilanova, “Overview + detail visualization for ensembles of diffusion tensors,” Computer Graphics Forum, vol. 36, no. 3, pp. 121–132, 2017.
 [12] C. Zhang, T. Höllt, M. Caan, E. Eisemann, and A. Vilanova, “Comparative visualization for diffusion tensor imaging group study at multiple levels of detail,” in EG Workshop on Visual Computing for Biology and Medicine, S. Bruckner, A. Hennemuth, and B. Kainz, Eds., 2017, pp. 53–62.
 [13] A. Abbasloo, V. Wiens, M. Hermann, and T. Schultz, “Visualizing tensor normal distributions at multiple levels of detail,” Visualization and Computer Graphics, IEEE Transactions on, vol. 22, no. 1, pp. 975–984, 2016.
 [14] H. Pfister, V. Kaynig, C. P. Botha, S. Bruckner, V. J. Dercksen, H.C. Hege, and J. B. T. M. Roerdink, “Visualization in connectomics,” in Scientific Visualization – Uncertainty, Multifield, Biomedical, and Scalable Visualization, ser. Mathematics and Visualization. Springer, 2014, pp. 221–245.
 [15] A. J. Pretorius, M.A. P. Bray, A. E. Carpenter, and R. A. Ruddle, “Visualization of parameter space for image analysis,” IEEE Trans. on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2402–2411, 2011.
 [16] T. TorsneyWeir, A. Saad, T. Möller, H.C. Hege, B. Weber, and J.M. Verbavatz, “Tuner: Principled parameter finding for image segmentation algorithms using visual response surface exploration,” IEEE Trans. on Visualization and Computer Graphics, vol. 17, no. 12, pp. 1892–1901, 2011.

[17]
T. Schultz and G. Kindlmann, “Openbox spectral clustering: Applications to medical image analysis,”
IEEE Trans. on Visualization and Computer Graphics, vol. 19, no. 12, pp. 2100–2108, 2013.  [18] P. Klemm, S. OeltzeJafra, K. Lawonn, K. Hegenscheid, H. Volzke, and B. Preim, “Interactive visual analysis of imagecentric cohort study data,” IEEE Trans. on Visualization and Computer Graphics, vol. 20, no. 12, pp. 1673–1682, 2014.
 [19] M. Hermann, A. C. Schunke, T. Schultz, and R. Klein, “Accurate interactive visualization of large deformations and variability in biomedical image ensembles,” IEEE Trans. on Visualization and Computer Graphics, vol. 22, no. 1, pp. 708–717, 2016.
 [20] H. Zhang, B. B. Avants, P. A. Yushkevich, J. H. Woo, S. Wang, L. F. McCluskey, L. B. Elman, E. R. Melhem, and J. C. Gee, “Highdimensional spatial normalization of diffusion tensor images improves the detection of white matter differences: an example study using amyotrophic lateral sclerosis,” IEEE transactions on medical imaging, vol. 26, no. 11, pp. 1585–1597, 2007.
 [21] H. Zhang, P. A. Yushkevich, D. Rueckert, and J. C. Gee, “Unbiased white matter atlas construction using diffusion tensor images,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2007, pp. 211–218.
 [22] G. Kindlmann, D. B. Ennis, R. T. Whitaker, and C.F. Westin, “Diffusion tensor analysis with invariant gradients and rotation tangents,” Medical Imaging, IEEE Transactions on, vol. 26, no. 11, pp. 1483–1499, 2007.
 [23] S. M. Smith and T. E. Nichols, “Thresholdfree cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference,” Neuroimage, vol. 44, no. 1, pp. 83–98, 2009.
 [24] M. Khatami, T. SchmidtWilcke, P. Sundgren, A. Abbasloo, B. Schölkopf, and T. Schultz, “BundleMAP: anatomically localized classification, regression, and statistical analysis in diffusion MRI,” Pattern Recognition, vol. 63, pp. 593–600, 2017.
 [25] T. SchmidtWilcke, P. Cagnoli, P. Wang, T. Schultz, A. Lotz, W. J. Mccune, and P. C. Sundgren, “Diminished white matter integrity in patients with systemic lupus erythematosus,” NeuroImage: Clinical, vol. 5, pp. 291–297, 2014.
 [26] G. Kindlmann, “Superquadric tensor glyphs,” in EG/IEEE Symposium on Visualization (SymVis), 2004, pp. 147–154.
 [27] M. S. Srivastava, Methods of Multivariate Statistics. Wiley, 2002.
 [28] T. Nichols and S. Hayasaka, “Controlling the familywise error rate in functional neuroimaging: a comparative review,” Statistical methods in medical research, vol. 12, no. 5, pp. 419–446, 2003.
 [29] T. E. Nichols and A. P. Holmes, “Nonparametric permutation tests for functional neuroimaging: a primer with examples,” Human brain mapping, vol. 15, no. 1, pp. 1–25, 2002.
 [30] S. Pajevic and C. Pierpaoli, “Color schemes to represent the orientation of anisotropic tissues from diffusion tensor data: application to white matter fiber tract mapping in the human brain,” Magnetic Resonance in Medicine, vol. 42, no. 3, pp. 526–540, 1999.
 [31] P. J. Basser, S. Pajevic, C. Pierpaoli, J. Duda, and A. Aldroubi, “In vivo fiber tractography using DTMRI data,” Magnetic Resonance in Medicine, vol. 44, pp. 625–632, 2000.
 [32] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Logeuclidean metrics for fast and simple calculus on diffusion tensors,” Magnetic Resonance in Medicine, vol. 56, no. 2, pp. 411–421, 2006.
 [33] S. Keihaninejad, H. Zhang, N. S. Ryan, I. B. Malone, M. Modat, M. J. Cardoso, D. M. Cash, N. C. Fox, and S. Ourselin, “An unbiased longitudinal analysis framework for tracking white matter changes using diffusion tensor imaging with application to alzheimer’s disease,” NeuroImage, vol. 72, pp. 153–163, 2013.
 [34] T. Schultz and G. Kindlmann, “Superquadric glyphs for symmetric secondorder tensors,” IEEE Trans. on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1595–1604, 2010.
 [35] A. Mayorga and M. Gleicher, “Splatterplots: Overcoming overdraw in scatter plots,” IEEE Trans. on Visualization and Computer Graphics, vol. 19, no. 9, pp. 1526–38, 2013.
 [36] R. K. Shastri, G. V. Shah, P. Wang, P. Cagnoli, T. SchmidtWilcke, J. McCune, R. Harris, and P. C. Sundgren, “MR diffusion tractography to identify and characterize microstructural white matter tract changes in systemic lupus erythematosus patients,” Academic Radiology, vol. 23, no. 11, pp. 1431–1440, 2016.
 [37] S. M. Smith, M. Jenkinson, H. JohansenBerg, D. Rueckert, T. E. Nichols, C. E. Mackay, K. E. Watkins, O. Ciccarelli, M. Z. Cader, P. M. Matthews et al., “Tractbased spatial statistics: voxelwise analysis of multisubject diffusion data,” Neuroimage, vol. 31, no. 4, pp. 1487–1505, 2006.
 [38] D. K. Jones, M. R. Symms, M. Cercignani, and R. J. Howard, “The effect of filter size on VBM analyses of DTMRI data,” NeuroImage, vol. 26, no. 2, pp. 546–554, 2005.
 [39] B. Jeurissen, M. Descoteaux, S. Mori, and A. Leemans, “Diffusion MRI fiber tractography of the brain,” NMR in Biomedicine, 2017, early view. DOI 10.1002/nbm.3785.
 [40] J. H. Jensen and J. A. Helpern, “MRI quantification of nongaussian water diffusion by kurtosis analysis.” NMR in Biomedicine, vol. 23, no. 7, pp. 698–710, 2010.
 [41] H. Zhang, T. Schneider, C. A. WheelerKingshott, and D. C. Alexander, “NODDI: practical in vivo neurite orientation dispersion and density imaging of the human brain,” NeuroImage, vol. 61, no. 4, pp. 1000–1016, 2012.
 [42] S. Bista, J. Zhuo, R. P. Gullapalli, and A. Varshney, “Visualization of brain microstructure through spherical harmonics illumination of high fidelity spatioangular fields,” IEEE Trans. on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2516–2525, 2014.
Comments
There are no comments yet.