Computational reproducibility, the ability to recompute analyses over time and space [peng2011reproducible], has become a critical component of scientific methodology as many researchers acknowledge the existence of a reproducibility crisis [baker2016there]. Among other factors, infrastructural characteristics play an important role in making experiments reproducible. For instance, in neuroinformatics, our primary field of interest, studies have shown the effect of the operating system on computational results [gronenschild2012effects, glatard2015reproducibility]. However, conducting such reproducibility studies at scale is cumbersome due to the computational and storage requirements of analysis pipelines.
Neuroinformatics pipelines are generally iterated on data coming from 10 to 1,000 subjects, possibly with subtle input parameter variations to adjust specific data acquisition conditions. Subject data often capture anatomical and functional characteristics of their brain, for instance through Magnetic Resonance Imaging (MRI) or Electroencephalography (EEG). The number of input files associated with a subject may vary, and these files may also be of different sizes. Typical processing times range from 15 minutes to 15 hours per subject, with input ranging from 100 MB to 15 GB per subject, and outputs ranging from 1 GB to 500 GB.
The reproducibility of a given pipeline may vary across subjects, for instance due to different pipeline branches being executed depending on the input data content. As an example, in the public database released by the Human Connectome Project [van2013wu], subjects may have one or two anatomical images of each modality; when two images are present, they are aligned together and averaged. Acquisition artifacts, for instance due to motion, may also trigger corrections not otherwise required, and some branches of the pipeline may be executed only for specific acquisition parameters. These remarks are consistent with recent findings showing that variations in the results of a functional MRI analysis are dependent on the dataset being analyzed [bowring2018exploring]. To capture such variations, reproducibility evaluations need to be conducted on many subjects, which is unwieldy.
Our goal is to predict the outcome of reproducibility evaluations in a large population of subjects, from a reduced set of pipeline executions. More precisely, we aim at predicting whether a particular file produced by an analysis pipeline will be identical across execution conditions, or if it will contain reproducibility errors. We approach the problem from the point of view of collaborative filtering, inspired by works that applied this method outside of its initial application domain, for instance [feng2013efficient]. The main issue, and originality, of our problem lies in the fact that the training set cannot be arbitrarily sampled from the utility matrix defining the collaborative filtering problem. Instead, the training set has to respect time constraints imposed by the file creation order during pipeline executions.
This paper makes the following contributions:
We model reproducibility evaluations in data processing pipelines as a collaborative filtering problem.
We propose strategies to sample the training set under time constraints.
We evaluate and compare our sampling strategies on synthetic and real datasets.
The problem formulation, collaborative filtering technique, and proposed sampling strategies for the training set are presented in Section II. The datasets are described in Section III, and experimental results are in Section IV. Finally, we conclude on the best sampling method to use, and on the impact, limitations and generalizability of the results.
Ii-a Problem formulation
The pipeline to be evaluated is represented by a matrix of size , where the columns represent data coming from different subjects, and the rows represent the files produced by the pipeline. While subjects are not ordered, files are, for instance from their last modification time in a sequential execution, and we assume that this order is consistent across subjects. measures the reproducibility of file produced during the processing of subject in two conditions, for instance two different operating systems. In our experiments, we restrict to be boolean, but our methods can be applied for real values as well.
Our goal is to predict the test set of missing values of from a training set of known ones, where and . In contrast with the traditional collaborative filtering problem, we have control over the construction of the training set. For instance, we can choose to include only files produced by specific subjects, or the first files produced by the processing of every subject. Moreover, the construction of the training set is constrained by the order of the matrix rows, which is formalized as follows:
Our problem is the following:
Given a training ratio , find a subset of of size such that (1) and respect Equation 1, and (2) can be predicted from with high accuracy.
The sampling of the training set will be described in Section II-C. The collaborative filtering techniques used for the predictions are reported hereafter.
Ii-B Collaborative filtering
Collaborative filtering is a technique to predict unknown values of a matrix called “utility matrix” from the known ones. Traditionally, the matrix represents the ratings of items, represented in columns, by users, represented in rows. Ratings might be explicit, when users provide ratings through a dedicated system, or implicit, when users’ behaviors are analyzed to estimate their preferences. An overview of collaborative filtering is provided in[leskovec2014mining]. In our context, the utility matrix is the matrix described previously.
Several methods have been proposed to implement collaborative filtering. Item-item collaborative filtering [breese1998empirical, linden2003amazon] predicts the rating of item by user from the ratings of items similar to by user . Likewise, user-user collaborative filtering [breese1998empirical] predicts the rating of item by user from the ratings of item by users similar to . Both methods have been used extensively for e-commerce applications.
A third class of methods, which is the one that we will use, is based on the factorization of the utility matrix to estimate latent factors along which users and items are represented. This method is described in [koren2009matrix] and became famous as it contributed to winning the Netflix prize in 2009. To summarize, the method aims at finding and vectors of , where is the number of latent factors, such that:
where is the rating of item by user . In practice, the optimization involves a regularization term to avoid overfiting particular users or items. The method finds and that minimize the following objective, where is the regularization parameter and is the training set:
It is also common to include user biases and item biases in the optimization, defined as the average deviation of user and item to the global average . The problem is then to find and that minimize the following objective:
Stochastic gradient descent and alternating least squares (ALS) are often used as optimization techniques. In the remainder, we use ALS as implemented in Apache Spark’s MLLib version 2.3.0. We also round predictions to the nearest integer to obtain binary values.
Ii-C Sampling of the training set
As explained before, the training set in our problem cannot be constructed by unconstrained random sampling of the matrix. Instead, the training and test sets have to comply to Equation 1. To address this issue, we investigated the following sampling techniques, illustrated in Figure 1.
Ii-C1 Complete Columns (Figure 0(a))
The training set is sampled by randomly selecting complete columns in the utility matrix, that is, complete subject executions. The last selected column might be incomplete to meet the exact training ratio. This method respects the time constraints. It corresponds to a situation where the collaborative filtering method will predict the reproducibility of the subjects in the test set from the subjects in the training set.
Ii-C2 Complete Rows (Figure 0(b))
The training set is sampled by selecting complete rows in the utility matrix, that is, the first files produced by every execution. The last selected row might be incomplete to meet the exact training ratio. This method respects the time constraints. It corresponds to a situation where the processing of all the subjects is launched and interrupted before the execution is complete. The collaborative filtering method will then predict the reproducibility of the remaining files.
Ii-C3 Random Subjects – RS (Figure 0(c))
This method builds the training set by selecting the files from random subjects until the training ratio is reached. The file selected in a subject is the file with the lowest index in this subject that has not been already selected in the training set, which respects the time constraints.
Ii-C4 Random File Numbers (Uniform) – RFNU (Figure 0(d))
The number of files selected for every subject is randomly selected in a uniform distribution, where b is set to the total number of files and a is set according to training ratio as follows:
For , we ensure that the average number of selected files by subject is by sampling the number of files in every subject from
with probability, and setting it to 0 otherwise. For , this is ensured by the value of , which leads the average of U(a, b) to be .
Ii-C5 Random File Numbers (Triangular) – RFNT
The number of files selected for every subject is randomly selected in a triangular distribution as in Figure 2. The mean of the distribution is . We set c to and we set a and b with the following two approaches.
Largest a (RFNTL, Figure 0(e))
a is set to the largest possible value, i.e., b, and b is set accordingly to ensure that the average of the distribution is . Two cases occur:
When : , the number of files selected in every subject is sampled from with probability and set to 0 otherwise.
Smallest a (RFNTS, Figure 0(f))
a is set to the smallest possible value, i.e., 0, and b is set accordingly to ensure that the average of the distribution is . Three cases occur:
When : , the number of files selected in every subject is sampled from with probability and set to 0 otherwise.
When : , the number of files selected in every subject is sampled from with probability and set to otherwise.
Ii-C6 Random Unreal (Figure 0(g))
The training set is sampled in a random uniform way, regardless of the file creation times. This method does not respect the time constraints in Equation 1: it will be used as a baseline for comparison.
In each method, we also included the first row of the matrix (first file of each subject) and a random column (all files of a random subject) to avoid cold start issues.
Iii-a Synthetic Dataset
We generated synthetic matrices as shown in Figure 3. Each matrix has 100 files and 100 subjects of different types. Subjects of the same type behave identically and all types contain the same number of subjects 1. Such a decomposition by subject type corresponds to a situation where different subjects may have data of different nature, as is the case in the Human Connectome Project data [van2013wu] where not all the subjects have the same amount of images.
For subjects of a given type, the matrix consists of blocks, where is the number of types. Blocks are defined with all the possible variation patterns: some types do not vary at all, while other ones vary between every block. Such patterns are meant to mimic the logic of data processing pipelines: each block of files represents the files produced at a given stage of the pipeline, which may or may not contain reproducibility errors depending on the subject type.
Iii-B Real Dataset
We processed a set S of 94 subjects randomly selected in the S500 HCP release111https://db.humanconnectome.org of the Human Connectome Project [glasser2013minimal], in three execution conditions with different versions of the CentOS operating system (5.11, 6.8 and 7.2 – referred as C5, C6 and C7), using the PreFreesurfer and Freesurfer pipelines described in [glasser2013minimal] and available on GitHub222https://github.com/Washington-University/Pipelines/releases/tag/v3.19.0. For each pipeline, we identified the set F of files produced for all subjects in all conditions. For each condition pair and each pipeline, we computed a binary reproducibility matrix U of size , where is true if and only if file of subject was different in each condition. Rows of were ordered by ascending file modification time in a random subject in S.
Figure 4 shows the matrices for PreFreesurfer and Freesufer. The reproducibility of these pipelines varies across subjects, but most files behave consistently across all subjects, leading to complete black or white lines.
We conducted two experiments for each reproducibility matrix, to evaluate the performance of our predictions using (1) ALS without biases, and (2) ALS with subject and file biases. We compare the performance of our sampling methods to (1) a dummy classifier that always predicts the value in the majority class and (2) the Random Unreal method, used as the baseline sampling technique. All reported values are averages over 5 repetitions. Due to sampling issues, it is possible that the actual training ratio obtained with some of the sampling methods does not exactly match the target one. We checked that the difference between the target and real training ratios was lower than 0.01. We used Spark’s ALS model as available inpyspark.ml, with 50 factors, a regularization parameter of 0.01, a maximum of 5 iterations and non negative constraints set to true. The code and data used to obtain the results are available through GitHub at https://github.com/big-data-lab-team/paper-reproducibility-collaborative-filtering.
Iv-a Accuracy on Synthetic Data
Iv-A1 ALS without Bias
Figure 5 shows accuracy results for ALS without bias. By construction, the accuracy of the dummy classifier is close to 0.5 for all subject types. Random Unreal performs well for all subject types, which confirms that ALS is working correctly. All the other methods perform better than the dummy classifier, and their accuracy decreases as the number of subject types increases. However, only 3 methods can provide accuracy values above 0.85 for all subject types: Random Subjects (RS), RFNU and RFNTL. Surprisingly, RFNTS does not perform well for more than 2 subject types, even for .
Iv-A2 ALS with Bias
Figure 6 shows accuracy results for ALS with bias. RS, RFNU and RFNTL remain the methods that best compare to Random Unreal, but their accuracy is much lower than without biases. This is presumably due to the fact that file biases are all close to 0.5. In such a situation, biases are detrimental and should not be included.
Iv-B Accuracy on Real Data
Iv-B1 ALS without bias
Figure 7 shows the accuracy for ALS without bias. Among the methods that performed well in the synthetic dataset, RFNU performs the best, with an accuracy higher than 0.95 for all datasets when the training ratio is larger than 0.5. Complete rows and complete columns do not reach the accuracy of the dummy classifier. Random Unreal has an accuracy close to 1, which shows that collaborative filtering works well on this dataset too.
Iv-B2 ALS with bias
Figure 8 shows the accuracy for ALS with bias. From a training ratio of 0.5, all methods perform well for all datasets, except complete columns in Figure 7(b). All methods except RFNU, RFNTL and RFNTS also perform well for a training ratio lower than 0.5. Overall, the accuracy is much higher than without bias, due to the fact that the file biases are very strong. In the remainder, all the results are obtained using ALS without bias.
Iv-C ROC analysis
Figure 9 compares the sampling methods in the ROC space, for a training ratio of 0.9, on the synthetic and Freesurfer dataset. The PreFreesurfer dataset is not included since specificity of RFNTL, RFNU, RS and complete rows is undefined at this training ratio (the test set only contains positive elements). The average sensitivity and specificity values are reported in Table I, which confirms that RFNU is the best performing method on average.
Iv-D Effect of the number of factors
Figure 12 shows the effect of the number of factors used in the ALS optimization for the RFNU method and the synthetic dataset with 8 subject types (3 blocks of files). For training ratios greater than 0.5, the accuracy with 3 factors is substantially higher than with 2 factors. Beyond 3, the number of factors does not have any effect, which shows that the data is best explained by 3 factors. Figure 13 confirms that 3 factors are enough to separate the files of this dataset in 3 blocks, and the subjects in 8 types.
Iv-E Effect of the maximum number of iterations
Figure 12 illustrates the effect of the number of iterations used in the ALS optimization, for the RFNU method and the synthetic dataset with 8 types. From a training ratio of 0.7, the number of iterations has a moderate effect on the accuracy. We used 5 iterations in our experiments, which only slightly degrades the accuracy compared to 15 or 20.
Iv-F Prediction error localization
Figure 14 compares the prediction errors made by RFNU, RFNTL and RFNTS. The training set is represented in black (negatives) and white (positives), and the test set is in green (true positives), yellow (false negatives), gray (true negatives) and red (false positives). This representation provides insights regarding where, and perhaps why, prediction errors occur. At this training ratio, RFNU (Figure 13(a)) uniformly samples the number of files per subject between 0.8 and , which enables the training on the last files of the pipeline, while maintaining a low number of columns with a low training ratio. On the contrary, RFNTL (Figure 13(b)) samples the number of files per subject from 0.85, but the probability to have a complete column is 0 (the one complete column in Figure 13(b) is the one included to prevent cold start issues), which leads to a “stripe” of prediction errors at the bottom of the matrix. RFNTS (Figure 13(c)) does not have this issue, as it includes many complete columns in the training set. However, it comes at the cost of several columns with a number of files lower than 60%: in such columns, the prediction is often entirely wrong, which explains the reduced accuracy compared to RFNU.
Figure 10 shows the RFNU results on the Freesurfer dataset for a training ratio of 0.6. The prediction error is localized (1) at the bottom of the matrix, which corresponds to the end of the execution, and (2) in the regions where file reproducibility varies across subjects, i.e., lines are not entirely black or entirely white. This is consistent with our expectations and confirms the validity of our results.
While collaborative filtering, perhaps unsurprisingly, correctly predicts the missing values in a matrix modeling reproducibility evaluations, the usual random sampling method cannot be used in time-constrained processes. We proposed 6 sampling methods to address this issue, and we found that one of them, RFNU, performs better than the other ones on average. We explain that by the fact that RFNU builds the training set using a balanced mix of nearly complete and nearly empty columns, with a continuum of intermediate configurations. On the contrary, other methods, including RFNTL and RFNTS, bias the sampling toward complete or empty columns, which is sometimes detrimental to accuracy.
For datasets that are strongly dependent on row bias, such as the PreFreesurfer and Freesurfer ones, RFNU provides accuracy values consistently higher than 0.95 for training ratios higher than 0.5, even when biases are not included. For the synthetic dataset, RFNU still performs very well when biases are not included, but it does very poorly when biases are included. For this reason, we recommend to not include biases in the collaborative filtering optimization to solve this problem. Even though biases provide a slight accuracy improvement when datasets are strongly biased (constant lines in the matrix), they can also be very detrimental for more complex datasets such as the synthetic one used here.
From a practical standpoint, our study shows that reproducibility evaluations of the PreFreesurfer and Freesurfer pipelines can be conducted using only 50% of the files produced, with an accuracy above 95%. Potentially, this could reduce the computing time and storage required for such studies by a factor of 2. However, such studies could not be conducted by processing only half of the subjects entirely, which would correspond to the complete columns sampling method. Instead, the processing of all the subjects should be initiated and terminated in a random uniform way, assuming that files are uniformly produced throughout the execution. On more complex matrices, such as the synthetic dataset studied here, the training ratio required to get a 95% accuracy increases to 0.85.
Our study could be extended to real-valued reproducibility matrices instead of just binary ones. It is indeed common for file differences to be quantified using specific similarity measures or distances, such as the Levenshtein distance between strings, or the sum of squared distances among voxels of an image. Our sampling methods could be directly applied to real-valued matrices, and we expect our conclusions on the best-performing sampling method (RFNU) and the inclusion of biases in the model to remain valid.
Finally, the method described in this paper could possibly be used to predict the outcome of other time-constrained processes, for instance markers of chronic disease activity.
We thank Compute Canada for providing the compute and storage resources required for this study.