EBIC: an open source software for high-dimensional and big data biclustering analyses

07/26/2018
by   Patryk Orzechowski, et al.
University of Pennsylvania
2

Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the final result. Results: EBIC was applied to datasets of different sizes, including a large DNA methylation dataset with 436,444 rows. For the largest dataset we observed over 6.6 fold speedup in computation time on a cluster of 8 GPUs compared to running the method on a single GPU. This proves high scalability of the algorithm. Availability: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic . Installation and usage instructions are also available online.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

05/26/2018

Large-Scale Learning from Data Streams with Apache SAMOA

Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-sour...
02/10/2019

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

This paper documents the release of the ELKI data mining framework, vers...
02/28/2019

Evaluation of Frequent Itemset Mining Platforms using Apriori and FP-Growth Algorithm

With the overwhelming amount of complex and heterogeneous data pouring f...
02/15/2022

Survey of Big Data sizes in 2021

The modern increase in data production is driven by multiple factors, an...
04/23/2021

NOMAD version 4: Nonlinear optimization with the MADS algorithm

NOMAD is software for optimizing blackbox problems. In continuous develo...
07/02/2013

Distributed Online Big Data Classification Using Context Information

Distributed, online data mining systems have emerged as a result of appl...

1 Introduction

Biclustering is an unsupervised machine learning technique which attempts to detect meaningful data patterns that are distributed across different columns and rows of the input dataset. This allows biclustering to capture heterogeneous patterns that manifest only in subsets of genes and subsets of samples. Biclustering has been commonly applied to genomic datasets

Padilha and Campello (2017) and has proven to be successful in revealing potential diagnostic biomarkers Liu et al. (2018), or tumor transcription profiles in breast cancer Singh et al. (2018).

With exponentially increasing sizes of the input datasets, there is an emerging need for effective and efficient methods that would scale well with growing amounts of data. Although there was discussion on possibility of applying biclustering to larger datasets Kasim et al. (2016); Padilha and Campello (2017), hardly any biclustering study involved large genomic dataset. This motivated emergence of parallel biclustering method. Some of the most recent parallel methods use multiple threads - e.g. runibic Orzechowski et al. (2017), or Message Passing Interface (MPI) - e.g. ParBiBit González-Domínguez and Expósito (2018), or GPU - e.g. CCS Bhattacharya and Cui (2017); Orzechowski et al. (2018).

One of the recent advancements in biclustering area was introduction of EBIC - a parallel biclustering method, which takes advantage of multiple evolutionary computation strategies

Orzechowski et al. (2018). This representative of hybrid biclustering algorithms Orzechowski and Boryczko (2016a, c, b) has been shown to outperform multiple state-of-the-art methods in terms of accuracy. Although the original concept of EBIC provided theoretical support for multiple GPUs, all the previous evaluations have been made using a single GPU. Thus, the rationale of involving multiple GPUs was not clear. Another constraint for EBIC was hardware limitation of the size of the dataset to 65,535 rows per GPU. This required large clusters of GPUs in order to run analyses and greatly restricted application of the method.

In this paper we introduce the open source package built on top of the upgraded version of the method. First and foremost, a full support for multi-GPUs is added, which allows to analyze datasets with almost unlimited numbers of rows (available memory is a constraint). Secondly, the method has been integrated with Bioconductor, which enables the user to run all the analysis from the R level. Thirdly, a different method for performing analysis was added, which depends on the presence or absence of missing values within the data. Last, but not least, some bugs have been fixed and optimizations were made for more efficient memory management. All above combined make an this open source software ready out-of-the-box for big data biclustering analysis.

2 Methods

The new version of EBIC provides a comprehensive open source framework for performing biclustering analysis. The major improvements over the original release of the method include:

  • Support for Big Data. In the previous version of EBIC only a very limited number of rows could be processed on a single GPU. Kernel grid constrained the maximal number of rows analyzable by a EBIC to 65,535 per GPU. Thus, at least 8 GPUs were needed to analyze large datasets, e.g. modern methylation datasets. Our new implementation overcomes this limitation, allowing to analyze up to

    rows per single GPU (devices with computing capabilities 3.0). This greatly enhances the flexibility and applicability of the method to almost any type of data. This comes at a cost of reducing the size of genetic algorithm population down to 65,535. This remains a large number, as for the majority of genomic datasets the algorithm converged using a population size of 1,600 or less individuals.

  • Handling missing values. We introduce a very important feature which allows to remove the impact of missing values on the results of the method. As EBIC search is driven by counting of rows, a greater or equal relation between the values in columns used to capture missing values, instead of the real trends in the data. This posed a drawback, especially for datasets with high percentage of missing values. Instead of finding useful patterns in the data, EBIC used to become more attracted in detecting the emptiness. In the current release missing values might be replaced with a predefined value (e.g. 0 or 999), which is no longer counted towards the score of the bicluster. Thus, the method is focused on the real trends in the data, instead of emptiness.

  • Different input file formats support. EBIC allows different delimiters in input file. The data values might be separated by either comma, tabulator or semicolon. An upper left header is not required, which simplifies porting files between EBIC and R.

  • Compatibility with R and Bioconductor. The results returned by EBIC could be easily saved into a format loadable by Bioconductor R package biclust in order to perform biological validation. In Supplementary Material we provide detailed workflow presenting how to use EBIC, all within R environment.

  • Workflow for analysis of methylation data. EBIC was capable to capture bio-meaningful signals in methylation data. A tutorial is presented in a Supplementary Material.

3 Results

In order to assess running times of the algorithm we have performed tests on from 1 up to 8 GPUs on datasets with varying number of rows and columns. The GEO accession numbers of the datasets as well as run times of the algorithm are presented in Table 1.

Dataset Genes Samples Description Run time
GDS1490 12,483 150 neural tissue profiling 7.1 mins
GSE65194 54,675 178 breast cancer 18.3 mins
GSE84493 436,444 310 prostate cancer methylation 24.5 mins
Table 1: Datasets used in the experiment as well as an average running time (in minutes) using a cluster of 8 GeForce GTX 1080 Ti GPUs.
Figure 1: Speedups obtained using multiple GPUs (GeForce GTX 1080 Ti) for the datasets from Table 1.

EBIC obtained up to 6.6x fold speedup using 8 GPUs on a dataset with over 436k rows. For the datasets with smaller number of rows the speedups were around 1.2x (12k rows) and 2.75x (55k rows). The relation between the different number of GPUs used and obtained speedup is presented in Fig. 1.

4 Conclusions

In this paper we present the recent advancements in one of the leading biclustering methods. The algorithm was wrapped into a framework, which is conveniently integrated with R and allows multiple input file formats. In Supplementary Material we also demonstrate that even for such a large genomic dataset, the results provided by EBIC are bio-meaningful. We conclude that EBIC, released as open source package, is a very convenient tool for getting insight from large genomic datasets.

Funding

This research was supported in part by PL-Grid Infrastructure and by grants LM012601, TR001263, ES013508 from the National Institutes of Health (USA).

References

  • Bhattacharya and Cui (2017) Bhattacharya, A. and Cui, Y. (2017). A gpu-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules. Scientific Reports, 7(1), 4162.
  • González-Domínguez and Expósito (2018) González-Domínguez, J. and Expósito, R. R. (2018). Parbibit: Parallel tool for binary biclustering on modern distributed-memory systems. PloS one, 13(4), e0194361.
  • Kasim et al. (2016) Kasim, A., Shkedy, Z., Kaiser, S., Hochreiter, S., and Talloen, W. (2016).

    Applied Biclustering Methods for Big and High-Dimensional Data Using R

    .
    CRC Press.
  • Liu et al. (2018) Liu, Y.-C., Chiu, Y.-J., Li, J.-R., Sun, C.-H., Liu, C.-C., and Huang, H.-D. (2018). Biclustering of transcriptome sequencing data reveals human tissue-specific circular rnas. BMC Genomics, 19(1), 958.
  • Orzechowski and Boryczko (2016a) Orzechowski, P. and Boryczko, K. (2016a). Hybrid biclustering algorithms for data mining. In G. Squillero and P. Burelli, editors, Applications of Evolutionary Computation, pages 156–168, Cham. Springer International Publishing.
  • Orzechowski and Boryczko (2016b) Orzechowski, P. and Boryczko, K. (2016b). Propagation-based biclustering algorithm for extracting inclusion-maximal motifs. Computing and Informatics, 35(2), 391–410.
  • Orzechowski and Boryczko (2016c) Orzechowski, P. and Boryczko, K. (2016c). Text mining with hybrid biclustering algorithms. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, and J. M. Zurada, editors, Artificial Intelligence and Soft Computing, pages 102–113, Cham. Springer International Publishing.
  • Orzechowski et al. (2017) Orzechowski, P., Pańszczyk, A., Huang, X., and Moore, J. H. (2017). runibic: a bioconductor package for parallel row-based biclustering of gene expression data. bioRxiv.
  • Orzechowski et al. (2018) Orzechowski, P., Sipper, M., Huang, X., and Moore, J. H. (2018). Ebic: an evolutionary-based parallel biclustering algorithm for pattern discovery. Bioinformatics, page bty401.
  • Padilha and Campello (2017) Padilha, V. A. and Campello, R. J. (2017). A systematic comparative evaluation of biclustering techniques. BMC bioinformatics, 18(1), 55.
  • Singh et al. (2018) Singh, A., Bhanot, G., and Khiabanian, H. (2018). Tuba: Tunable biclustering algorithm reveals clinically relevant tumor transcriptional profiles in breast cancer. bioRxiv, page 245712.