Meta-analyses are widely used in medicine and health policy to increase statistical power in studies suffering from small sample sizes. Gene expression experiments are a typical example of such designs. The R packages metaMA and metaRNASeq are dedicated to gene expression microarray and NGS meta-analysis, respectively. While metaMA and metaRNASeq are open source and available on CRAN, they require coding skills in R to perform meta-analysis. Thus, to facilitate the use and the dissemination of these packages, we developed Galaxy wrappers. Galaxy [1, 2, 3] is an open, web-based platform for data intensive biomedical research. It keeps tracks of history and all analyses can be rerun. Galaxy community is very active and a lot of bioinformatics tools are included in Galaxy thanks to a modular system based on XML wrappers. These integrated tools can be shared via the Galaxy toolshed which serves as an appstore.
Overview of R packages integrated into Galaxy
Gene expression microarray data meta-analysis can be performed thanks to the metaMA  R package. It proposes methods to combine either p-values or moderated effect sizes from different studies to find differentially expressed genes. In our pipeline we only keep the inverse normal method  to combine the p-values calculated by limma  for each single study.
Differences between metaMA and metaRNASeq
Main differences come from the statistical distributions used to model data and from the manner to treat the genes exhibiting conflicting expression patterns (i.e., under-expression when comparing one condition to another in one study, and over-expression for the same comparison in another study). Usually, microarray data are modelled by Gaussian distributions while NGS data are modelled by Negative Binomial distributions. As explained in and , the trick which consists to use one-tailed p-values for each single study before combination in metaMA avoids directional conflicts. In metaRNASeq, this trick can not be used, which necessits a post-hoc identification of conflicts, step which is also proposed in metaRNASeq.
Description of Galaxy tools
SMAGEXP tool suite offers two distinct gene expression meta-analysis functionalities : one dedicated to microarray data meta-analysis and one dedicated to RNAseq data meta-analysis (see Table 1 and figure 1).
|GEOQuery||GEO database ID||Rdata object and .cond file|
|QCNormalization||Raw .CEL affymetrix files and .conf file||Rdata object and plots|
|Import custom data||Expression data in tabular .txt format||Rdata object and plots|
|Limma analysis||Rdata object from GEOQuery or QCNormalization||Rdata Object and HTML report|
|Microarray data meta-analysis||Rdata objects from Limma analyse||HTML report|
|RNA-seq data meta-analysis||Results text files from galaxy deseq2 tool||HTML report|
Microarray data meta-analysis
GEOQuery tool fetches microarray data directly from GEO database, based on the GEOQuery  R package. Given a GSE accession ID, it returns an Rdata object containing the data and a text file (.cond file) summarizing the conditions of the experiment. The .cond file is a text file containing one line per sample in the experiment. Each line is made of 3 columns:
Condition of the biological sample
Description of the biological sample
Column names are optional and only the columns order matters. As the GEO dataset should already have been normalized, the GEOQuery tool does not perform any normalization method, apart from an optional log2 transformation.
It is possible to analyze .CEL files from affymetrix gene expression microarray. The QCnormalization tool offers to ensure the quality of the data and to normalize them. Several normalization methods are available :
quantile normalization + log2
background correction + log2
This tool generates several quality figures : microarray images, boxplots and MA plots. It also outputs an Rdata object containing the normalized data for further analysis with the limma analysis tool.
Import custom data tool
This tool imports data stored in a tabular text file. Column titles (chip IDs) must match the IDs of the .cond file. A few normalization methods are proposed, but it is possible to skip the normalization step, by choosing "none" in the normalization methods options. Therefore this tool is of special interest when the input dataset has been previously normalized.
This tool also generates boxplots and MA plots and outputs an Rdata object containing the data for further analysis with the limma analysis tool.
Limma analysis tool
The Limma analysis tool performs single analysis either of data previously retrieved from GEO database or normalized affymetrix .CEL files data. Given a .cond file, it runs a standard limma differential expression analysis. The user choose two conditions extracted from the .cond file (see Figure 3). It generates boxplots for rough quality control of normalization, p-value histograms to ensure that statistical hypotheses are not violated and a volcano plot to quickly identify the most-meaningful changes. This tool also outputs a table summarizing the differentially expressed genes and their annotations. Genes are sorted by ascending Benjamini-Hochberg adjusted p-value, and annotations are retrieved via GEO database. This list of genes can be exported to excel or to csv format. This table is sortable and requestable. Furthermore it is possible to expand each row to display extended annotations informations, including hypertext links to the National Center for Biotechnology Information (NCBI) gene database. Finally, this tool outputs an Rdata object to perform further meta-analysis and a tabular file containing the all results and annotations of the differential analysis.
Microarray data meta-analysis tool
The meta-analysis relies on the metaMA R package. Prior to the meta-analysis itself, a pre-processing is made in order to ensure compatiblity between several sources of data. In fact, data could come from different types of microarrays. First, we list the Entrez gene ID corresponding to each probe of each dataset. Next, we keep the probes corresponding to the genes which are shared by all the experiments of the meta-analysis. Then, for each dataset, we merge the microarray probes originating from the same Entrez gene ID by computing their mean. Note that the merging of different technologies induces a loss of information and might generate several conflicts as probes do not necessary reflect the same biological reality. Finally, the p-value combination method of metaMA is run on the merged dataset. It generates a Venn Diagram summarizing the results of the meta-analysis, and a list of indicators to evaluate the quality of the performance of the meta-analysis :
DE : Number of differentially expressed genes
IDD (Integration Driven discoveries) : number of genes that are declared differentially expressed in the meta-analysis that were not identified in any of the single studies alone
Loss : Number of genes that are identified differentially expressed in single studies but not in meta-analysis
IDR (Integration-driven Discovery Rate) : corresponding proportion of IDD
IRR (Integration-driven Revision) : corresponding proportion of Loss
It also outputs a fully sortable and requestable table, with gene annotations and hypertext links to NCBI gene database.
RNA-seq data meta-analysis
The RNA-seq data meta-analysis tool relies on the deseq2 galaxy tool analysis results. Given several text file resulting from the deseq2 tool, the metaRNAseq tool performs a meta-analysis, generates the list of differentially expressed genes, and outputs the DE, IDD, Loss, IDR and IRR indicators.
Microarray meta-analysis example
SMAGEXP was applied to two GEO datasets identified with the following IDs : GSE3524 and GSE13601. These two datasets contain human oral squamous cell carcinoma (SCC) data. See Figure 4 for an overview of the worfklow of this analysis.
First, we fetch data from the GSE3524 using the GEOQuery tool (with parameter "log2 transformation" = auto). Then we launch the limma analysis, using the output from the GEOquery tool. It generates an Rdata output, which will be usefull for the meta-analysis. Results can be seen on Figure 5 and Figure 6
Secondly, the same kind of analysis is run from raw .CEL files. We choose to keep six .CEL files from the GSE13601 dataset (IDs from GSM342582 to GSM342587). Quality control and normalization is done thanks to the QCnormalization tool. Then, as previously, the limma analysis tool is run to generate a HTML report and an Rdata output.
Run a metaMA analysis
To run the microarray meta-analysis tool, we only need the Rdata output of each single study, generated by the limma analysis tool. It generates a Venn diagram to compare the results of each study with the meta-analysis. It also outputs several indicators as described in the description of the tool (see Figure 7). As for the limma tool, annotated expressed genes are displayed in a table which can be ordered and requested.
RNA-seq data meta-analysis example
The RNA-seq data meta-analysis tool relies on deseq2 results (see Figure 8).
It outputs a Venn diagram and the same indicators as in the microarray data analysis tool for both Fisher and inverse normal p-values combinations. It also generates a text file containing summarization of the results of each single analysis and meta-analysis. Potential conflicts between single analyses are indicated by zero values in the "signFC" column.
We developed SMAGEXP, a toolsuite dedicated to gene-expression data meta-analysis. This toolsuite proposes quality controls, single analyses and meta-analyses of microarray and RNA-seq data, suggesting appropriate pipelines for each type of data. It delivers fully annnotated results of differentially expressed genes, exportable in several usual formats. Integrated into Galaxy, SMAGEXP is easy to use for biologists and life scientists. R packages metaMA and metaRNAseq thus inherit reproductibility and accessibility support from Galaxy.
SMAGEXP is available on the Galaxy main toolshed .
Source code is available on github at : https://github.com/sblanck/smagexp.
Furthermore, thanks to Docker, we made these Galaxy tools and their dependencies easy to deploy. A fully dockerized instance of Galaxy containing SMAGEXP is available at :
-  J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences,” Genome Biol, vol. 11, no. 8, p. R86, 2010.
-  D. Blankenberg, G. V. Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko, and J. Taylor, “Galaxy: A web-based genome analysis tool for experimentalists,” Current protocols in molecular biology, pp. 19–10, 2010.
-  B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, W. C. Miller, W. J. Kent, and A. Nekrutenko, “Galaxy: a platform for interactive large-scale genome analysis,” Genome research, vol. 15, no. 10, pp. 1451–1455, 2005.
-  G. Marot, J.-L. Foulley, C.-D. Mayer, and F. Jaffrezic, “Moderated effect size and p-value combinations for microarray meta-analyses,” Bioinformatics, vol. 25, no. 20, pp. 2692–2699, 2009.
-  L. Hedges and I. Olkin, Statistical Methods for Meta-Analysis. London: Academic Press, 1985.
-  M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth, “limma powers differential expression analyses for RNA-sequencing and microarray studies,” Nucleic Acids Research, vol. 43, no. 7, p. e47, 2015.
-  A. Rau, G. Marot, and F. Jaffrézic, “Differential meta-analysis of rna-seq data from multiple studies,” BMC Bioinformatics, vol. 15, no. 1, pp. 1–10, 2014.
-  R. A. Fisher, Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd, 1932.
M. I. Love, W. Huber, and S. Anders, “Moderated estimation of fold change and dispersion for rna-seq data with deseq2,”Genome Biology, vol. 15, p. 550, 2014.
-  S. Davis and P. Meltzer, “Geoquery: a bridge between the gene expression omnibus (geo) and bioconductor,” Bioinformatics, vol. 14, pp. 1846–1847, 2007.
-  D. Blankenberg, G. Von Kuster, E. Bouvier, D. Baker, E. Afgan, N. Stoler, J. Taylor, and A. Nekrutenko, “Dissemination of scientific software with galaxy toolshed,” Genome Biology, vol. 15, no. 2, pp. 1–3, 2014.