Introduction
Functional MRI recordings enable the investigation of activation patterns that characterize the working brain. The main goal is to detect whether the neural pattern of a region of interest correlates with a cognitive task, like, for example, object category identification. Such investigations are usually focused on a specific cognitive modality, i.e.: visual perception, real auditory, visual imagery, auditory imagery, etc. A qualitative discrimination task can be designed to extract relevant information from patterns activated by two (or more) stimuli categories (like body and car) in one of these cognitive modalities.
Identification of activation patterns that are shared across modalities has been the subject of numerous neurocognitive studies, with such modalities as mental calculations [1], sensory/motor stimulation [2] or words and pictureviewing [3], to name a few. The most common approach here is to cast the problem of identifying common activation patterns as a supervised learning, or brain decoding, problem [4, 5]. Nastase and colleagues[9] point out that successful classification in this setting allows to conclude that neural patterns elicited by relevant cognitive factors in one modality generalize accross to the patterns in the other modality.
In other words, a low misclassification error on a modality which is different from the one used to train the classifier provides empirical evidence that a given region of interest is involved in the cognitive task encoded in both modalities. In the literature, such approach is referred to as crossmodal decoding analysis (Cmda). Statistical significance of its result can be assessed using a test on the accuracy obtained by the classifier [6, 1], or by a permutation test based on computing the null distribution [2, 7, 8] of such statistic. However, Cmda
suffers from a number of practical issues. As we are going to discuss in section Methods, the accuracy of a decoding model is often low in a crossmodal setting, which most probably implies an exaggerated amount of Type II errors (failure to reject the null hypothesis).
Neuroscientific investigations into the activity patterns in the fMRI data studies can be formulated as “confirmatory” or “exploratory” analysis. Confirmatory analysis is centered on a preestablished region of interest (ROI). The Cmda method presented above is often used for the confirmatory approach, when it is run on the data coming from a predefined ROI (or a set of predefined ROIs). Exploratory analysis aims at localization of areas containing information about the presented stimuli. Here, Cmda is employed in conjunction with Searchlight technique to explore the spatial structure of crossmodal activations[9]. The outcome of the Searchlight procedure are maps, where each voxel is assigned some quantitative measure of information that the voxel contains about the stimulus. The procedure of obtaining the maps consists in applying decoding sphere by sphere on time series extracted from the sphere voxels, and most commonly used information measure to produce Searchlight maps is classification accuracy. Maps are first calculated individually for each subject and then pooled together to run group analysis, where the significance of the obtained values is typically established running tests voxelwise with respect to chance level. Classification accuracy and tests have been subject to numerous criticisms with regard to their role in Searchlight analysis [23]. Besides, use of Searchlight for crossmodal analysis faces interpretation challenges introduced by the asymmetries both in accuracies and values when training and testing on data coming from two different cognitive modalities[9].
In this work we develop a permutation test for the investigation of shared patterns across modalities that we denote crossmodal permutation test (Cmpt). This test builds on a long tradition of randomization inference in the statistics literature, which can be traced back to the first half of the 20th century[10, 11, 12]. Permutation tests have recently seen renewed interest in neuroimaging[13, 14, 15, 16] thanks to their minimal distributional assumptions and the availability of cheap computational resources. We provide empirical evidence on synthetic datasets that this method reduces Type II errors (failure to reject a false null hypothesis) while maintaining Type I errors comparable (incorrect rejection of the null hypothesis) with respect to Cmda. Our results highlight particular advantages of Cmpt in the small sample/high dimensional regime, a setting of practical importance in neuroimaging studies. Next, we compare Cmpt and Cmda on an fMRI study of three cognitive modalities: visual attention, imagery and perception. We conduct confirmatory analysis comparing the performance of Cmpt and Cmda when identifying the presence of shared patterns within a functionally defined Region of Interest (ROI). Finally, we present the results of an exploratory analysis with Searchlight making use of the proposed Cmpt test for the information based mapping to explore the presence of common patterns between different modalities at the whole brain level. The use of Cmpt allows to overcome major methodological drawbacks that had been pointed out for Searchlight in the literature[23].
Methods
Crossmodal permutation test (Cmpt)
In this section we describe a statistical test for crossmodal activation pattern analysis that we denote Cmpt. We formulate the problem of assessing crossmodal activation as a hypothesis testing problem and propose an inference procedure for this test based on a permutation schema.
Setting.
We assume that the experimental task consists of two modalities (e.g., auditory and visual perception) and each image in the dataset containing an activation pattern has an associated condition or (e.g., two stimuli categories like human body and car). In total, we observe activation patterns for each modality, corresponding to the number of conditions in the experiment, where each activation image is a mean image (averaged by the number of trials in the experiment) representative of one condition, and the goal is to decide whether there is a common condition effect across the different modalities. Let us formalize this in the language of statistical hypothesis testing. Consider the set of pairs
sampled iid from some unknown probability distribution
, where (resp. ) are the activation patterns corresponding to the first (resp. second) modality, and where the experimental paradigm is designed such that the and are associated with the same condition (but different modality). Since the image pairs belong to the same condition, as long as there is a condition effect shared across modality, the sequences and cannot be independent. We hence formulate the null hypothesis (which we want to reject) that both sequences are independent and so their joint probability distribution factorizes over their marginal:(1) 
Test statistic.
Given the set of image pairs and described in the previous paragraph, let (resp. ) be the set of indices for the (resp. ) category. We define (resp. ) as the average of activation patterns in with index in (resp. ). (resp. ) are defined in similar way as the average of activation patterns in with index in (resp. ). Note that the index set is computed from images of (and not ) on both cases. This asymmetry will be useful when designing the permutation scheme. Consider also that we have access to a similarity measure between images that we denote by . For simplicity we will initially suppose that this measure is the Pearson correlation coefficient, although we will see later that this can be generalized to any similarity measure between images.
We now have all necessary ingredients to present the test statistic that we propose to distinguish the null hypothesis from the alternative. This test statistic has values in and is defined as
(2) 
At first, its form might seem strange. Let us give two intuitions on the form of this test statistic:

As a difference of similarities. The test statistic can be split as a difference of two terms. The first term is the sum of similarities for images from the same condition (and different modalities), while the second term is a sum of similarities for images of different conditions (and different modalities). Hence, large values of the test statistic are achieved whenever the withincondition similarity is larger than the betweencondition similarity, bringing evidence for the existence of a conditionspecific activation across modalities.

As a singularity test. If we compute all pairwise similarities between the images , , and , we obtain 4 scalars that can be arranged in a by matrix as follows:
(3) Under the null hypothesis, the samples and are independent and so (recall that the indexing was derived from images in ). Whenever the matrix above becomes colinear. A standard way to test for colinearity is through its determinant. Computing the determinant of the above equation we obtain our test statistic (modulo the normalizing factor ).
Statistical inference.
We will estimate the distribution of this test statistic under the null hypothesis from the sample by repeatedly computing the test statistic over a permuted version of the initial sample, a technique often known as
permutation or randomization test. For this to be valid, it is necessary to identify the quantities that we wish to permute and verify that under the null hypothesis, all permutations yield the same sample distribution[17].Consider the sequence which results from a random reordering of the activation images in and the sequence of pairs . Under the null hypothesis, since the probability distribution factorizes over its marginal, the permuted sequence is distributed as , where is the distribution of . Now, by the iid assumption made previously (which is commonplace in the context of permutation testing), this distribution is invariant to permutations, and so and the condition is verified. After computing the permuted test statistic for a large number of random permutations (typically around 10000), the significance of this test, i.e., the probability of observing a test statistic equal or as large as the one obtained, can be computed as
(4) 
Extensions.
This test extends naturally to the setting of group analysis. In this case, the test statistic (2) can be taken as the sum over all subjects of the subjectspecific test statistic. Ideally, the same permutation should be used across subjects to obtain each value of the permuted test statistic [18]. It is theoretically possible to perform a twotailed test using this test statistic. A large negative value of the test statistic would also bring evidence to reject the null hypothesis of independence. However, since the neuroscientific interpretation of such negative values is not useful for our practical purpose, we will only use the one tailed test in this paper.
For simplicity, we have considered the Pearson correlation coefficient as similarity measure, but the method remains valid using any other similarity measures. The Pearson correlation, being a measure of the linear correlation, works best when the effect is (close to) linear, but other more complex similarities can be used such as a (negative) Malahanobis[19] or Wasserstein[20] distance.
Relationship with crossmodal decoding analysis (CMDA).
Cmda can be regarded within the same hypothesis testing framework outlined before, but with a different test statistic. In Cmda, the test statistic is the accuracy of a classifier on images from one modality when it was trained on images from the other modality.
Since both Cmda and Cmpt follow the same permutation test approach to computing significance, both rely implicitly on a label exchangeability assumption behind the datagenerating process. As we have seen in the previous subsection, a sufficient condition for this is to assume that the data we observe is sampled iid. Note that this iid assumption is on the pairs from different modalities and also on the experimental paradigm but not on the decoding train/test split, which divides the data by modality and is obviously not iid. This is a much weaker assumption than the distributional assumptions made by traditional parametric methods, it is important to keep in mind that permutation tests are not fully assumptionfree methods and at the bare minimum require exchangeability of the observations.
A practical difference between both approaches is that the crossmodal permutation test is symmetric with respect to modalities while brain decoding is not. That is, Cmpt would yield the same value regardless of the order in which the different modalities are labeled. This is not true for Cmda, where two possible tests can be performed (train on and test on or train on and test on ), and both can (and typically do) yield different values.
Datasets
Synthetic datasets
We construct a synthetic dataset according to a model in which the signal is a superposition of a modalityspecific effect (), a conditionspecific effect () and a Gaussian noise ():
where
are scalars that regulate the amount of modalityspecific and conditionspecific signal in the image, respectively. We then generated a total of 20 different images according to this model, considering two different modalityspecific signals and two different conditionspecific signals, all of them randomly generated from a Gaussian distribution. We generate 3 versions of this dataset, one with 10 voxels, one with 100 and another one with 1000 voxels.
fMRI dataset
We performed empirical analysis of the data coming from a neurocognitive study of visual attention. fMRI data were collected to investigate object categorization during preparatory activity in a visual search experiment, designed in a similar manner to the one illustrated in [21].
Participants.
24 participants (8 male, mean age 27.1, st.dev. 4.3 years) were recruited and accessed the research facility. All participantssubjects, before starting the experiment, signed a form confirming their informed consent to participate in the experimental study. After the experiment, they received monetary compensation. Each participant was instructed in advance about undergoing 2 experimental sessions (S1, S2) on two different days. The data on all three modalities in question (perception, imagery and visual search) were acquired in the same session, S1. Out of all 24, 22 completed a significant part (6/22) or the whole (16/22) of S1. During both S1 and S2, participants were also given other tasks, which we do not report here. Out of 16 participants that underwent the whole S1 only nine participants completed 4 runs of both perception/imagery task and visual search task (8 functional runs in total). Other participants failed to reach 4 runs at least in one of the task types. So, for the analysis we are using the data of the 9 participants that have the total of 8 functional runs each. The tasks are explained in the next section.
Stimuli.
Two distinctive stimulus categories were presented to the participants throughout the tasks: people (whole body image) and cars. Participants were instructed to deal with these categories in three different ways: in perception modality, they had to attend to 8 presentations of 16 seconds long blocks of different instances of the same category (people, here depicted by whole body figures with no face, or cars), interspersed with 16 seconds long fixation periods. Participants were equipped with a twobutton box. They were requested to perform a oneback task  i.e., to press a specific button whenever they detected the same image repeated twice in a row. In imagery modality participants were instructed to close their eyes and mentally visualize instances of the category, indicated by the letter cue shown at the beginning of the trial. They had to press the response button whenever they achieved a mental image that was sufficiently detailed, and then they had to switch to mental visualization of another instance of the same category. At the end of the 16 seconds block, an auditory cue told the participants to open their eyes and go on with the experiment. Perception and Imagery blocks were randomly presented within the same functional run. In visual search modality, participants were briefly (450 ms) shown images representing natural scenes (e.g.: crowded places, urban landscapes, etc …). They were instructed through a visual cue (letter) to look for instances of one of the two categories within the scene. After scene presentation, they had 1.6 s to attend to the presentation of a mask and to give a positive or negative response by pressing a button. Visual search preparatory periods occurring between the presentation of the cue and the presentation of the scene had different lengths: 2, 4, 6, 8 or 10 seconds. Participants did not know in advance neither the lengths of preparatory period, nor their order of presentation throughout the task, which was random. Participants also had to perform a blockdesigned task, where we alternated presentation of images of intact and scrambled everyday objects, in order to functionally define an object selective region of interest (ROI) localized in the temporal  occipital cortex (Fig. 1). So, for perception and imagery the overall number of trials was 32 per modality (8 trials per 4 runs). Each visual search run consisted of 40 trials, where for each particular type of delay duration there were 8 trials. The total number of visual search trials is 160 (40 per 4 runs), while for each duration there are 32 trials (8 per 4 runs) in the dataset. All experimental procedures had been approved by the Ethical Committee of the University of Trento and were carried out in accordance with applicable guidelines and regulations on safety and ethics.
Data acquisition.
Images were acquired with a 4T Bruker (https://www.bruker.com/) scanner. For each participant, we started both experimental sessions by acquiring a structural scan using a 3D T1weighted Magnetization Prepared RApid Gradient Echo (MPRAGE) sequence (TR/TE = 2700/4.18 ms, flip angle = , voxel size = 1 mm isotropic, matrix = , sagittal slices). Perception / Imagery and Visual Search tasks were performed while acquiring, respectively, 177 and 195 functional scans with the following parameters: (TR/TE = 2000 / 33 ms, flip angle = , voxel size = 3x3x3 mm, 1 mm slice spacing, matrix = 64x64, 34 axial slices covering the entire brain), during session 1. Same acquisition parameters were used for 165 scans acquired during Functional Localizer task in session 2.
Preprocessing.
For data preprocessing FSL tools were used along with inhouse built Python code. In all functional runs 5 initial volumes were discarded as dummy volumes. The skull was removed from both functional and structural images to extract the brain. Functional images were subsequently corrected for slice timing and motion artifacts. Transformation of the functional images to standard space was carried out in the following sequence. First structural scans were coregistered to the mean functional scan of each experimental run. Structuralinfunctionalspace images were then coregistered to standard (MNI) space to finally compute affine parameters. To extract taskrelated effects from functional localizer data, beta maps for both localizer conditions (Intact vs. Scrambled objects) were computed with linear regression and next fed into contrast analysis (Intact vs. Scrambled). This analysis resulted in a ROI located in bilateral temporal occipital cortex. We selected one cluster including 625 voxels from each hemisphere, ending up with a bilateral ROI with 1250 voxels overall. We applied the ROI mask to functional data coming from Perception / Imagery and Visual Search tasks and we obtained matrices containing timeseries of the ROI voxels. For CMPT analysis, 1250x16 matrices of Perception / Imagery data and 1250x40 matrices of Visual Search data were considered per run.
Results
For the rest of the paper we will refer to the rejection of the null hypothesis with the traditional significance of without explicitly mentioning this number.
Experiments on synthetic data
In Figure 2 we plot the resulting value after performing both Cmpt and Cmda on the synthetic dataset described in section Methods, for varying magnitudes of the conditionspecific effect () and different image sizes. In the case of decoding, this value was computed as described in [27]. For , the dataset has no conditionspecific signal and so the test is not expected to produce a statistically significant result. Indeed, the value of CMPT is around . Note that because of the discreteness of the test statistic (test set accuracy), the average value need not converge towards as goes to zero in the case for CMDA. As the magnitude of the effect () increases, the method that yields a lower value has greater statistical power, because it is able to reject the null hypothesis with a greater probability. We can see in the figure, that in general Cmda values are higher, which translates into a lower probability of rejecting the null hypothesis under this approach and hence higher Type II error.
In Figure 3 (top row) we can see that in the absence of signal (), the distribution of values generated with Cmpt(for 6000 repetitions) is relatively flat, showing that the false positive rate (Type I error) for a significance level of is at the expected value of . In the bottom row of that figure, we can see the same experiment for CMDA. In this case because of the discreteness of the test statistic, the distribution is not completely flat.
From the simulation results (Figure 2) we see that the average values yielded by Cmpt are always below those of Cmda. This implies that smaller effects can be detected, and hence, that Cmpt has a higher sensitivity than Cmda. Furthermore, this effect is replicated across images with different number of voxels, highlighting the benefits of Cmpt in the highdimensional setting, which is of great practical importance in neuroimaging.
Comparison of Cmda and Cmpt on fMRI data
In this section we assess the agreement or disagreement in detecting shared activation patterns between Cmda and Cmpt. The similarity of activation maps for the the discrimination of body vs. car categories was computed for the following pairs of modalities: perception and imagery, imagery and visual search. The visual search modality was investigated more in detail by first considering all durations of preparation periods put together and then analysing separately different delays (2, 4, 6, 8 and 10s). This analysis was meant to emphasize the issue of small sample size typical for neuroscientific data. The comparison between Cmda and Cmpt took into account additional elements such as the choice of ROI and the type of encoding of activation maps.
The ROI chosen for the analysis was the Object Selective Cortex (OSC) map shown in Fig. 1. We analysed separately the performance of methods for the left part of the ROI, right part of the ROI and the whole ROI. Encoding of the activation maps was of two types: raw BOLD and beta maps. For the raw BOLD encoding, the volumes were selected that corresponded to the peak of the hemodynamic response function (HRF, as rendered by SPM software  www.fil.ion.ucl.ac.uk/spm/doc/) convolved with the boxcar function that represented the experimental manipulation. One volume was selected per trial, and for Cmpt the volumes were averaged to produce a single representative volume per subject per condition per modality. For beta encoding, beta maps were calculated trialwise using linear regression; for Cmpt the maps were averaged over trials, too, to produce a single beta map per subject per condition per modality.
Cmda
was performed by training a logistic regression classifier with
regularization on the trials of one of the modalities. The regularization parameter was selected according to a nested crossvalidation scheme (leaveonerun out). The accuracy was then estimated on a test set from another modality. The training and test process was replicated for each subject. Then, the value was computed for the group using the permutation scheme described in [2]. The resulting values are reported in Table 1.Cmpt group analysis was carried out as described in section Crossmodal permutation test (Cmpt). The similarity distance between activation maps that we used is the Pearson correlation measure, both for raw BOLD volume and beta maps encoding. The significance of the proposed test statistics was computed by a permutation scheme with iterations to estimate the null distribution. The resulting values are reported in Table 1.
In Table 1 we report only one result for the comparison between Cmda and Cmpt related to raw BOLD volume encoding: the crossmodal analysis between Perception and Imagery. In this case none of the methods detect a meaningful shared activation pattern. Beta maps encoding on the other hand seems to be a more efficient representation. Chen and colleagues [22] demonstrate that using beta values is a way to get rid of intrinsic variabilities of BOLD signal throughout the brain and, specifically, within a single area. In our case, this means that beta values are more representative of the effect size than raw BOLD signal changes during taskon periods. For this reason in the presentation of results we only focus on results obtained with data encoded with beta maps.
The results in Table 1 confirm our expectations about the presence of common patterns between modalities in Object Selective Cortex. At the same time, Cmpt appears to have higher statistical power and sensitivity in revealing these patterns. The results reported in Table 1 illustrate two main scenarios: both Cmda and Cmpt show significant values or only Cmpt. In light of the simulation results we may argue, that since Cmda has a higher false error rate (Figure 2), in case of such disagreements the Cmpt result is more reliable. This argument is further supported by the additional empirical evidence that the false positive rate or Type I error is similar for the two tests, limiting the risk of the disagreement being biased by a more optimistic rejection of the null hypothesis.
Crossmodal analysis results for perception vs. imagery with beta maps encoding are in agreement between Cmda and Cmpt when the two hemispheres are considered individually, namely the left and right OSC respectively. When the analysis is extended to the joint ROI, the number of trials remains constant, while the number of voxels double. In this case the classifier is affected by the higher dimensionality of data, and Cmda does not succeed in rejecting the null hypothesis.
The empirical results also support the claim that Cmpt is more robust not only in highdimensional but also in small sample setting. Simulations show that the Type II error of Cmpt is below that of Cmda in the small sample regime (Fig. 2). We may find analogous behaviour for the crossmodal analysis of visual search vs. imagery. If we consider the cumulative trials of visual search, irrespective of the delays, Cmda rejects the null hypothesis. When we restrict the crossmodal analysis to single delays of preparation period for visual search, the number of trials drop from to . In this case Cmda fails to reject the null hypothesis while Cmpt does not.
Cmpt results are in line with the view that we should expect the presence of shared activity patterns between perceived and imagined object categories. Cmpt analysis also confirms that the presence of these patterns can be expected in highlevel visual areas processing information about object categories. We are going to further elaborate on this point in the discussion section. On the other hand, Cmda results appear to be affected by the data sample size relative to the high dimensionality of data.
Exploratory data analysis with Cmpt
We ran Searchlight analysis of the whole brain, using Cmpt. First, we wanted to show if and how inserting Cmpt as the elementary unit within the Searchlight framework could identify voxels that store information related to common activation patterns for two different cognitive modalities. Next, we intended to compare the spatial profiles of the exploratory analysis with the ROI individuated for the confirmatory analysis.
To construct group level maps, we referred to the procedure we illustrated in section Crossmodal Permutation Test (Cmpt) at a ROI level. Here we consider the spheres centered on each voxel as ROIs: we first compute the ”true” statistic and then we proceeded by using permutations. We started by computing singleparticipants’ Cmpt  Searchlight maps, where each voxel was considered as the center of a sphere (r= 8), and calculated the Tstatistic. Then, we summed up singleparticipants’ Tvalues and ended up with the true grouplevel statistic. Next, we created N=10000 permutations of the session labels, and we subsequently constructed 10000 averaged beta maps for each of the conditions based on the permuted labels  that is, two maps per subject per modality per permutation. In this way, we made sure that the data coming from different participants were tested against the same permutations in a uniform way. We then applied Cmpt procedure on these permuted maps by first computing an individual Tstatistic and then by summing up the group values  that is, we constructed an adhoc null distribution. Finally, we simply counted how many times the “true” group statistic was higher than the permuted group statistic, and we transformed the count in a fraction of the total number of permutations, obtaining a pvalue. This value was assigned to the voxel in the center of the sphere. The procedure was repeated for each voxel within the gray matter mask. We ended up with a Cmpt Searchlight map of pvalues coming from a combination of permutationlike tests.
In Figures 4 and 5 we present results coming from the exploratory analysis. Fig. 4 showcases the overlap between the whole OSC ROI and informative voxels identified by Searchlight in the occipitaltemporal cortex for the crossmodal pair of perception vs. imagery only. In Fig. 5, we put together fragments of maps for the pairs of cognitive processes where our confirmatory analysis yielded significant results, namely perception vs. imagery, visual search (delay 8s) vs. imagery, visual search (all delays) vs. imagery (see Table 1 A Test for Shared Patterns in Crossmodal Brain Activation Analysis). For merely illustrative purposes, the maps were thresholded at the conventional significance level of 0.05 (as we are not aiming at significant cluster identification, no correction for multiple comparisons was carried out). In the left column of Fig. 5 we demonstrate the overlaps between the ROI identified in the course of the group analysis (Fig. 1
) and the portions of the map that signal the presence of information about common patterns between two cognitive processes for a single slice (z=16). The fact that the ROI identified contains a high portion of informative voxels is further illustrated by the histograms in the right column of the same figure. These are histograms of the pvalues of the voxels within the ROI. We can see that all three histograms have a skewed shape, signalling the presence of a rather large number of voxels with pvalues under 0.05 in the ROI.
For comparison, we also ran Cmda Searchlight with the same sphere size (r=8) for the same modality pairs: perception vs. imagery, visual search (all delays) vs. imagery, visual search (delay 8) vs. imagery. The analysis was performed with Matlab 8.5.0, MathWorks, NatickMA, USA using inhouse code and Libsvm library (https://www.csie.ntu.edu.tw/ cjlin/libsvm/). Classifier used for producing the maps was an SVM classifier with a linear kernel as implemented in the Libsvm library. For each subject, two Searchlight maps were obtained for each modality pair, one where the classifier was trained on the Imagery data and tested on the other modality data, and one where the assignment of train  test data was reversed. Then, these two maps were averaged as suggested in[9]
yielding a single map per subject per modality pair. For the group analysis, a onesample ttest against chance level (50 %) was performed using SPM software. The resulting group maps were thresholded at the significance level of 0.05 and cluster size of 10 voxels. Then we compared the group maps to the OSC ROI selected for the confirmatory analysis. The results are presented in Table 2. It shows percentages of voxels within the OSC ROI that were identified by the
Cmda Searchlight as informative about shared patterns between two modalities. The numbers concerning the size of intersection between the Searchlight map and the ROI are given both as an absolute number of voxels within the ROI and in terms of percentages.The problem of asymmetry between classifier results when swapping train and test modalities in a crossmodal setting is well attested for the pair of perception vs. imagery. The accuracies obtained with the classifier trained on the imagery data have been shown to be consistently higher than after training on perception data ( [28, 33]). To minimize the impact of this asymmetry in crossmodal investigations, it was suggested to average the maps resulting from different traintest combinations for a pair of modalities[9]. However, the authors of the paper showed that the divergence in accuracy numbers obtained with different traintest combinations in their data was insignificant.
Cmda Searchlight results in our dataset seem to be rather seriously affected by the issues stemming from the accuracy asymmetries. First, for the pair of perception vs. imagery, Cmda Searchlight trained on imagery data identifies a high number of voxels within the OSC mask, both in right and left OSC. If trained on perception data, the Searchlight finds a much lower number of voxels within the same area, all of them in the right OSC. In the averaged mask, the number of the voxels that survive is nearly 10 times lower than that identified by the Searchlight trained on imagery data (49 against 432). This same kind of asymmetry is even more prominent for the pair of imagery vs. visual search: Searchlight, trained on imagery data, identifies voxels within the OSC ROI, while it does not identify any in the same area if trained with visual search data (neither all delays, nor delay 8). This result makes us pose a question about the extent to which the voxels idenitified belong actually to really shared patterns between modalities, or we should rather talk about voxels in one modality that are informative about the patterns in the other modality. Our overall conclusion about Cmda Searcchlight is that its use might be questionable in cases when notable asymmetry is expected, as is the case with perception vs. imagery. For some modality pairs asymmetry does not seem to be a big issue, as is the case with the data used in [9], and the use of Cmda Searchlight could be more justified for these data.
Discussion
The patterns of brain activity that are shared between the cognitive processes of perception and imagery have been the subject of quite numerous studies. The question investigated was if we can arrive at abstract, topdown object representations [28, 29] containing distinguishing features [30, 31] that will have common neural substrate both for viewed and imagined object categories [30]. To test for the presence of shared patterns, many studies used crossmodal decoding  namely, multivariate pattern analysis with SVM classifiers [32, 28, 33]. Significant crossmodal classification accuracies were taken as the evidence in favour of the presence of shared activity patterns. In some studies, correlationbased analysis was also performed to visualize and estimate similarity between these patterns in terms of distance [32, 28]. What emerged from these studies was the view that, indeed, visual imagery activates the same areas that contain information about visually perceived stimuli [34, 30], and shared patterns for stimulus categories in these two processes can be established [35, 32, 28, 36, 33, 37]. The areas where these common representations were found include the ventral temporal pathway, lateral occipital cortex [36, 32, 28, 33] and extrastriate cortex [32, 35, 36, 29]. The question of shared patterns in early visual areas, such as V1, remains controversial [32, 28]. Horikawa [31] showed that it depended on the feature type: lower visual features had similar representations for perception and imagery in lower visual areas, while the same was true for higher visual features in higher visual areas. Cichy [33] arrived at a similar conclusion about the subdivision of features: although they did not find significant accuracies for decoding object categories in lateral early visual cortex, they could identify shared representations of object locations in these areas.
Topdown attention patterns mediate attention biases during perception and affect behavioural performance in attention related tasks. In case of visual attention, these patterns can be revealed in visual search experiments via activity in the categoryrelated object selective areas during preparatory delays [38]. Several studies attempted at demonstrating the highlevel nature of the preparatory patterns through crossmodal analysis, mostly with visual perception as the other modality [38, 39, 21]. As object representations in the brain obtained during imagery tasks are thought to be closer to highlevel topdown representations of objects in visual cortex [35, 30], the hypothesis naturally suggests itself that we can expect these patterns to show up also during visual search preparatory periods. We tried to shed light on this hypothesis using both Cmda and Cmpt on visual imagery and visual search data. Besides, we ran crossmodal analysis separately for preparatory periods of varying length (between 2 and 10 seconds) to get insights into preparatory dynamics. We were expecting that only certain dealys would result significant, conforming different hypotheses about this dynamics. For instance, if only shorter delays (24 s) had resulted significant, that could be evidence in favour of transitory and cuerelated nature of the preparatory activity in the Object Selective Cortex. If, on the other hand, we had seen significant results in the longer delays, that could reveal the fact that it takes time for the activity to build up. First, we see that both methods confirm expectations about imagery patterns being more high level than perception. None of the methods yielded significant results in the pairs of visual search vs. perception. As for the presence of the shared patterns between visual imagery and visual search, we are faced again with limitations of Cmda as a method: its results can be significant and it can reveal the presence of shared patterns between preparatory periods and imagery, but this type of analysis needs a lot of data. On the other hand, Cmpt can reveal shared patterns even with fewer data as is the case with 8 seconds delay. Further study is needed to uncover the temporal dynamics of the preparatory topdown patterns. We hypothesize that delays shorter that 8 s do not allow the preparatory activity to build up, while in case of 10 s the delay it is too long, and the subject might be loosing concentration after a certain period of time.
We placed Cmpt side by side with other standard data analysis techniques in order to examine whether this approach could be as informative as others. We have shown that in confirmatory, topdown contexts Cmpt can yield better results than Cmda. However, it is necessary to mention one limitation of the method. One of the overarching questions in the study of visual imagery is identifying neural representations of categorical features in the form of brain activation maps [35, 30]. Cmpt method cannot provide insights into the location of the discriminative patterns at a ROI level. Despite being a more robust test for crossmodal analysis, Cmpt is not appropriate to investigate the shape of shared pattern within a given ROI. In this case, Cmpt doesn’t support a sensitivity analysis at the voxel level needed to compute granular brain maps of activations that are common between modalities. On the other hand, Cmda
(at least when linear classifiers are used) contains a vector of weights that can give some clues about the relevance of the input features. However,
Cmpt combined with Searchlight technique can be a helpful method to locate brain regions that contain information about common patterns between modalities.We took advantage of one strong point of the Searchlight analysis, its “modular” nature: Searchlight might be thought of as a generic framework of data examination that can subsume various analysis techniques as elementary units. Searchlight is widely used in neuroimaging, but it suffers from a number of issues. Conducting Searchlight analysis has several major advantages: first, it can be run on the whole brain, no prior ROI selection is required. Next, it avoids the “curse of dimensionality” of full brain classification, by reducing the number of features used at each point by the classifier. Finally, it has proven to be quite successful in identifying subject specific activation patterns
[23]. The maps produced with Searchlight are of the same nature as the maps obtained with the univariate GLM approach, but they are based on a more finegrained pattern identification from multiple voxels and better reflect the spatial properties of the BOLD signal (that is, adjacent voxels have similar activation patterns). However, major criticisms of the Searchlight approach regard the use of classification accuracy as the information measure and the ttest as the method to obtain group significances. As is pointed out in [23], SVM classifiers can correctly classify even with a few number of highly informative voxels and when weakly informative voxels are numerous enough. Both of these behaviours can cause distortions in a map: in the first case, all searchlights overlapping with one of a few informative voxels will be significant. In this way, the number of informative voxels is overestimated. In the second case, the cause of distortions is “discontinuous information detection”: groups of weakly informative voxels will be missed out if their size is below a certain threshold, but can be judged significant if you just add a single voxel. That leads to underestimation of the number of significant clusters just because the number of weakly significant voxels does not reach a certain mass. Efficiency of using classifiers with Searchlight depends strongly on the classifier parameters and sphere size [23, 24]. In [24], the point is raised against interpretability of classifier accuracy with neuroscientific data: unlike distance measures, its value depends on the properties of the dataset (amount of training data and what kind of data is used as test data) and not only on the presence of a particular effect in the data. Besides, the authors point out that capturing interactions of several factors in a factorial experimental design cannot be cast as a classification task. So, addressing these methodological issues for Searchlight can significantly improve this valuable tool and make its result more scientifically rigorous.Classification accuracy is not the only way to represent information content. In the original paper by Kriegeskorte [25] the metrics used was Mahalanobis distance between the distributions corresponding to stimulus categories. In [24] the authors build on the probabilistic model of the data proposing a crossvalidated multivariate ANOVA (MANOVA) as the informational content measure. In [19] three various measures  classification accuracy, Euclidean/Mahalanobis distance, and Pearson correlation distance  are compared for reliability in the context of Searchlight analysis. In this paper, it was shown that “continuous crossvalidated distance estimators” such as Euclidean/Mahalanobis distance or Pearson correlation should be preferred for Searchlight because they are more interpretable from the neuroscientific viewpoint.
Another bunch of critical remarks concerns the use of ttests for assessing significance at the group level. Certain properties of neuroscientific data make the use of ttests questionable for this purpose, “particularly, the low number of observations and the nongaussianity of the probability distribution of accuracy. As a consequence, several assumptions of the tstatistic are not met, rendering the procedure invalid from a theoretical point of view” [26]. However, it is not the only option here. In [26] a nonparametric test for group significance and cluster inference was proposed based on permutations and bootstrapping procedure. Nastase and colleagues [9] also opt for permutation tests in Searchlight context.
Methodologically, using Cmpt in conjunction with the Searchlight technique for crossmodal pattern analysis has several advantages over the common Searchlight procedure because it does not rely neither on classification accuracy nor on the ttests and hence avoids the common methodological pitfalls. At the same time, we are following the suggestions in the literature that are considered more appropriate for the Searchlight. First, the test statistic proposed in equation 2 that is used as the measure of information contained at each voxel is based on Pearson correlation and is interpretable in terms of similarity. Second, group significance is tested nonparametrically with permutation tests that do not make assumptions about the shape of the data distribution. We found that Cmpt integrated into Searchlight has proven effective also to explore and, potentially, confirm what we observed using topdown, ROIbased analysis, which suggests both robustness and efficiency of the Cmpt Searchlight in fMRI data analysis.
However, it is important to note that confirmatory and exploratory analyses report different pvalues. While it is possible to qualitatively compare the outcomes of these two analyses, plainly putting their pvalues side by side might be misleading. CMPTROI pvalues come from an extended, functionally welldefined area including 625 or 1250 voxels. CMPTSL analysis spans over the whole brain sphere by sphere, extracting results from spheres including about 200 voxels each. This means that pvalues coming from confirmatory and exploratory analysis should not be compared on a purely quantitative level.
The question of shared patterns between various cognitive modalities is relevant not only for object categorization in visual processing. It is fundamental in the study of interactions between topdown and bottomup processing streams in the human brain in general. Further directions of study could include using the Cmpt method and the Cmpt Searchlight technique with a wider number of other cognitive modalities, such as auditory or linguistic [40, 41, 42]. Besides, we could investigate other areas that can share representations with imagery  for instance, working memory areas [35]. Finally, the Cmpt method could be tried with other types of neuroimaging data  as, for example, EEG motor imagery data for BrainComputer interfaces [43] or MEG data [44, 45].
Acknowledgements
The research was partially funded by the Autonomous Province of Trento, Call “Grandi Progetti 2012”, project “Characterizing and improving brain mechanisms of attention  ATTEND” and by Centro Internazionale per la Ricerca Matematica (CIRM). The authors gratefully acknowledge the contribution of Marius Peelen to the disucssions of experiment design and data analysis and to Valentina Borghesani for valuable feedback on the manuscript.
Author contributions statement
E.K., V.I. and P.A participated in designing the experiment. E.K. and V.I. implemented the design, conducted the experiment, acquired the data and preprocessed the data. All authors analysed the data. E.K., F.P. and E.O. implemented data analysis pipelines. F.P. designed, implemented and conducted experiments on simulated data. E.K., F.P., V.I. and P.A. wrote the manuscript.
Additional information
The authors declare no competing interests.
Data and code availability.
Processed data (ROI betamaps, BOLD volumes and Searchlight maps) are available from a public repository on Github https://github.com/elenakalinina/Code_CMPT_paper. The code used to generate and analyze synthetic data is available from the same repository along with data analysis code and Searchlight implementation. Sharing the whole dataset involves the consent of thirdparties who participated in the experiment design. If the consensus is reached, the data could be made publicly available.
References
 [1] Knops, A., Thirion, B., Hubbard, E. M., Michel, V. & Dehaene, S. Recruitment of an area involved in eye movements during mental arithmetic. Science 324, 1583–1585 (2009).
 [2] Etzel, J. A., Gazzola, V. & Keysers, C. Testing simulation theory with crossmodal multivariate classification of fMRI data. PLoS ONE 3(11), URL http://dx.doi.org/10.1371/journal.pone.0003690. DOI 10.1371/journal.pone.0003690. (2008).
 [3] Shinkareva, S. V., Malave, V. L., Mason, R. A., Mitchell, T. M. & Just, M. A. Commonality of neural representations of words and pictures. NeuroImage 54, 2418–2425 (2011).
 [4] Haynes, J.D. & Rees, G. Decoding mental states from brain activity in humans. Nature Reviews Neuroscience 7, 523–534 (2006).

[5]
Kaplan, J. T., Man, K. &
Greening, S. G.
Multivariate crossclassification: applying machine learning techniques to characterize abstraction in neural representations.
Frontiers in Human Neuroscience 9, 151 URL http://journal.frontiersin.org/article/10.3389/fnhum.2015.00151. DOI 10.3389/fnhum.2015.00151. (2015).  [6] Majerus, S. et al. Crossmodal decoding of neural patterns associated with working memory: evidence for attentionbased accounts of working memory. Cerebral cortex 26, 166–179 (2016).
 [7] Vetter, P., Smith, F. W. & Muckli, L. Decoding sound and imagery content in early visual cortex. Current Biology 24, 1256–1262 (2014).
 [8] Kaiser, D., Azzalini, D. C. & Peelen, M. V. Shapeindependent object category responses revealed by MEG and fMRI decoding. Journal of Neurophysiology 115(4), 2246–50 (2016).
 [9] Nastase, S. A., Halchenko, Y. O., Davis, B. & Hasson, U. Crossmodal searchlight classification: methodological challenges and recommended solutions. In Pattern Recognition in NeuroImaging (PRNI), 2016 International Workshop on, 1–4. IEEE. URL https://ieeexplore.ieee.org/document/7552355. DOI 10.1109/PRNI.2016.7552355. (2016).
 [10] Fisher, R. A. The Design of Experiments. Oliver and Boyd, Edinburgh (1935).
 [11] Pitman, E. J. Significance tests which may be applied to samples from any populations. Supplement to the Journal of the Royal Statistical Society 4, 119–130 (1937).
 [12] Lehmann, E. L. & Stein, C. On the theory of some nonparametric hypotheses. The Annals of Mathematical Statistics 20(1), 28–45 (1949).
 [13] Nichols, T. E. & Holmes, A. P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human brain mapping 15, 1–25 (2002).
 [14] Eklund, A., Nichols, T. E. & Knutsson, H. Cluster failure: why fmri inferences for spatial extent have inflated falsepositive rates. Proceedings of the National Academy of Sciences 113(28), 7900–7905 (2016).
 [15] Woolrich, M. W., Beckmann, C. F., Nichols, T. E. & Smith, S. M. Statistical analysis of fMRI data. In fMRI Techniques and Protocols 41, 183–239 SpringerVerlag, New York. (2009).
 [16] Winkler, A. M., Ridgway, G. R., Douaud, G., Nichols, T. E. & Smith, S. M. Faster permutation inference in brain imaging. NeuroImage 141, 502 – 516 (2016).
 [17] Lehmann, E. L. & Romano, J. P. Testing Statistical Hypotheses. SpringerVerlag New York. (2005).
 [18] Etzel, J. A. MVPA permutation schemes: permutation testing for the group level. In Pattern Recognition in NeuroImaging (PRNI), 2015 International Workshop on, 65–68. IEEE. URL https://ieeexplore.ieee.org/document/7270849. DOI 10.1109/prni.2015.29. (2015)
 [19] Walther, A. et al. Reliability of dissimilarity measures for multivoxel pattern analysis. NeuroImage 137, 188–200 (2016).
 [20] Gramfort, A., Peyré, G. & Cuturi, M. Fast optimal transport averaging of neuroimaging data. Preprint at URL http://arxiv.org/abs/1503.08596. 1503.08596. (2015).
 [21] Peelen, M. V., FeiFei, L. & Kastner, S. Neural mechanisms of rapid natural scene categorization in human visual cortex. Nature 460, 94–97 (2009).
 [22] Chen, X., Pereira, F., Lee, W., Strother, S. & Mitchell, T. Exploring predictive and reproducible modeling with the singlesubject FIAC dataset. Human brain mapping 27, 452–461 (2006).
 [23] Etzel, J. A., Zacks, J. M. & Braver, T. S. Searchlight analysis: promise, pitfalls, and potential. NeuroImage 78, 261–269 (2013).
 [24] Allefeld, C. & Haynes, J.D. Searchlightbased multivoxel pattern analysis of fMRI by crossvalidated MANOVA. NeuroImage 89, 345–357 (2014).
 [25] Kriegeskorte, N., Goebel, R. & Bandettini, P. Informationbased functional brain mapping. Proceedings of the National Academy of Sciences of the United States of America 103, 3863–3868 (2006).
 [26] Stelzer, J., Chen, Y. & Turner, R. Statistical inference and multiple testing correction in classificationbased multivoxel pattern analysis (MVPA): Random permutations and cluster size control. NeuroImage 65, 69–82 (2013).
 [27] Ojala, Markus, Garriga, Gemma C, Permutation tests for studying classifier performance. Journal of Machine Learning Research 11, 1833–1863 (2010)
 [28] Reddy, L., Tsuchiya, N. & Serre, T. Reading the mind’s eye: decoding category information during mental imagery. NeuroImage 50, 818–825 (2010).
 [29] Ishai, A. Seeing faces and objects with the ”mind’s eye”. Archives italiennes de biologie 148, 1–9 (2010).
 [30] Roldan, S. M. Object recognition in mental representations: directions for exploring diagnostic features through visual mental imagery. Frontiers in Psychology 8 URL http://dx.doi.org/10.3389/fpsyg.2017.00833. DOI 10.3389/fpsyg.2017.00833. (2017).
 [31] Horikawa, T. & Kamitani, Y. Generic decoding of seen and imagined objects using hierarchical visual features. Nature Communications 8 URL http://dx.doi.org/10.1038/ncomms15037. DOI 10.1038/ncomms15037. (2017).
 [32] Lee, S.H., Kravitz, D. J. & Baker, C. I. Disentangling visual imagery and perception of realworld objects. NeuroImage 59, 4064–4073 (2012).
 [33] Cichy, R. M., Heinzle, J. & Haynes, J.D. Imagery and perception share cortical representations of content and location. Cerebral Cortex 22, 372–380 (2012).
 [34] Farah, M. J. The neural basis of mental imagery. Trends in neurosciences 12, 395–399 (1989).
 [35] Pearson, J., Naselaris, T., Holmes, E. A. & Kosslyn, S. M. Mental imagery: functional mechanisms and clinical applications. Trends in cognitive sciences 19, 590–602 (2015).
 [36] Stokes, M., Thompson, R., Cusack, R. & Duncan, J. Topdown activation of shapespecific population codes in visual cortex during mental imagery. The Journal of neuroscience : the official journal of the Society for Neuroscience 29, 1565–1572 (2009).
 [37] Anderson, A. J. J., Bruni, E., Lopopolo, A., Poesio, M. & Baroni, M. Reading visually embodied meaning from the brain: Visually grounded computational models decode visualobject mental imagery induced by written text. NeuroImage 120, 309–322 (2015).
 [38] Stokes, M., Thompson, R., Nobre, A. C. & Duncan, J. Shapespecific preparatory activity mediates attention to targets in human visual cortex. Proceedings of the National Academy of Sciences 106, 19569–19574 (2009).
 [39] Peelen, M. V. & Kastner, S. A neural basis for realworld visual search in human occipitotemporal cortex. Proceedings of the National Academy of Sciences of the United States of America 108, 12125–12130 (2011).
 [40] Simanova, I., Hagoort, P., Oostenveld, R. & van Gerven, M. A. Modalityindependent decoding of semantic information from the human brain. Cerebral cortex 24, 426–434 (2014).
 [41] Simanova, I., Francken, J. C., de Lange, F. P. & Bekkering, H. Linguistic priors shape categorical perception. Language, Cognition and Neuroscience 31, 159–165 (2016).
 [42] Borghesani, V., Pedregosa, F., Eger, E., Buiatti, M. & Piazza, M. A perceptualtoconceptual gradient of word coding along the ventral path. In Pattern Recognition in Neuroimaging, 2014 International Workshop on, 1–4. IEEE. URL https://ieeexplore.ieee.org/document/6858512. DOI 10.1109/prni.2014.6858512. (2014)
 [43] Choi, K. Electroencephalography (EEG)based neurofeedback training for braincomputer interface (BCI). Experimental brain research 231, 351–365 (2013).
 [44] Dikker, S. & Pylkkänen, L. Predicting language: MEG evidence for lexical preactivation. Brain and language 127, 55–64 (2013).
 [45] Hirschfeld, G., Zwitserlood, P. & Dobel, C. Effects of language comprehension on visual processing  MEG dissociates early perceptual and late N400 effects. Brain and language 116, 91–96 (2011).