DeepAI
Log In Sign Up

BIDEAL: A Toolbox for Bicluster Analysis – Generation, Visualization and Validation

07/26/2020
by   Nishchal K. Verma, et al.
0

This paper introduces a novel toolbox named BIDEAL for the generation of biclusters, their analysis, visualization, and validation. The objective is to facilitate researchers to use forefront biclustering algorithms embedded on a single platform. A single toolbox comprising various biclustering algorithms play a vital role to extract meaningful patterns from the data for detecting diseases, biomarkers, gene-drug association, etc. BIDEAL consists of seventeen biclustering algorithms, three biclusters visualization techniques, and six validation indices. The toolbox can analyze several types of data, including biological data through a graphical user interface. It also facilitates data preprocessing techniques i.e., binarization, discretization, normalization, elimination of null and missing values. The effectiveness of the developed toolbox has been presented through testing and validations on Saccharomyces cerevisiae cell cycle, Leukemia cancer, Mammary tissue profile, and Ligand screen in B-cells datasets. The biclusters of these datasets have been generated using BIDEAL and evaluated in terms of coherency, differential co-expression ranking, and similarity measure. The visualization of generated biclusters has also been provided through a heat map and gene plot.

READ FULL TEXT VIEW PDF

page 5

page 6

page 9

page 10

08/18/2020

EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering and Visualization of Gene Expression Data

Clustering is a popular data mining technique that aims to partition an ...
01/28/2023

Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review

Cancer is a term that denotes a group of diseases caused by abnormal gro...
07/08/2014

Visualization and Correction of Automated Segmentation, Tracking and Lineaging from 5-D Stem Cell Image Sequences

Results: We present an application that enables the quantitative analysi...
12/01/2022

Corvo: Visualizing CellxGene Single-Cell Datasets in Virtual Reality

The CellxGene project has enabled access to single-cell data in the scie...
02/01/2021

The Gene Mover's Distance: Single-cell similarity via Optimal Transport

This paper introduces the Gene Mover's Distance, a measure of similarity...
03/06/2022

Semi-automated recommendation platform for data visualization: Roopana

Information visualization is the final process of any data analytics pro...
07/15/2022

COEM: Cross-Modal Embedding for MetaCell Identification

Metacells are disjoint and homogeneous groups of single-cell profiles, r...

1 Introduction

Biclustering has become prevalent and useful data mining technique among researchers for analyzing the data. It has been applied to a wide variety of applications such as bioinformatics, information retrieval, text mining, dimensionality reduction, recommender systems, electoral data analysis, disease identification, association rule discovery in databases, and many more [1]. Among these, bioinformatics [2] [3] seems to have taken the advantage of biclustering for analysis of the gene expression data. During any biological process under different experimental conditions, genes are examined by their expression levels. The data is present in a matrix form with rows representing genes and columns as experimental conditions. The aim is to group genes and conditions into a sub-matrix to obtain crucial biological information such as identification of co-regulated patterns among genes. A bicluster B can be represented as

(1)

where refers to the expression level of instance under sample , and , is the number of instances, and

is the number of attributes. It involves finding the maximum sub-matrices in a data matrix with maximum coherency. Since biclustering is a NP-hard problem, various heuristics and meta-heuristics approaches have been used in the literature to find better solutions

[4].

The traditional clustering algorithms give equal importance to all the columns. These algorithms are -means clustering [5]

, hierarchical clustering

[6], self-optimal clustering [7], improved mountain clustering [8], fuzzy C-means clustering [9], unsupervised fuzzy clustering[10], etc. Each algorithm has its own advantage. Despite their usefulness, they are not very helpful in a variety of problems. For example, every gene may not take part in every condition with gene expression analysis. Thus, combinatorial regulation and joint patterns of gene expression biclustering are essential to realize the complex nature of genes. In [11], a plethora of solutions to perform biclustering has been presented. Undoubtedly, among the pool of algorithms, all have their own distinctive ways including heuristic and statistical approaches with their merits and demerits. It is not expected that a single approach would turn out to be well-suited for all types of data. So, any problem must be tackled with respective suitable algorithms and the best result must be noted. This generates the need of a comprehensive biclustering toolbox where various algorithms can be tested, validated, and visualized. A toolbox can be compared in terms of the following:

  1. Number of algorithms embedded in the toolbox.

  2. Number of validation indices present for qualitative analysis of generated biclusters.

  3. Number of visualization methods available for generated biclusters.

  4. User-friendly interface of the toolbox.

Toolboxes Algorithms Validation Indices Visualization Methods Platform
BicAT[12] CC[23], ISA[27], OPSM[26], xMotif[30] None Heat Map[48] JAVA
BiVisu[16]
Greedy version of pCluster
Mean Square Residue,
Average Correlation Value
Heat Map, Parallel
Coordinate Plots
MATLAB
BicOver-lapper 2.0[13] Visualization Toolbox None Venn like Diagrams R, JAVA
Expander[14] SAMBA[46] None Heat Map JAVA
BAT[17] BiHEA[47] Pairwise Gene Analysis Heat Map, Numerical Matrix JAVA
BiBench[18]
CC, OPSM, xMotif, kSpectral[28], ISA, Plaid[31], BiMax[32],
Bayesian, QUBIC[38], FABIA[34], COALESCE
Jaccard Index[40], F-measure
Heat Map, Bicluster
Projection, Parallel Coordinates
Python
BiClust[19] BiMax, CC, Plaid

Jaccard Index, Constant Variance

Parallel Plot, Heat
Map, Bubble Plot
R
BicNET[15] BicNET None Biclustering Network Data Java
MTBA[20]
CC, BSGP[25], ISA, OPSM, kSpectral, ITL[29], xMotif, BiMax,
Plaid, FLOC[24], BiMax, LAS[33]
Jaccard Index, SB Score,
Constant and Sign Variance
Heat Map, Gene Plot MATLAB
CoClust[21]
Modularity Based, Information-Theoretic Based
None
Cluster Plot, Cluster Size,
Heat Map, Cluster Graph
Python
BicPAMS[22]
BicPAM, BicNET, Bic2PAM, BiP, BiModule
None
Graphical Display,
Heat Map
Java
BIDEAL
(Proposed Toolbox)
CC, BSGP, OPSM, ISA, kSpectral, ITL, xMotif, Plaid, FLOC, BiMax,
LAS, FABIA, BitBit[35], BiSim[36], MSVD[37], QUBIC, ROBA[39]
Jaccard Index, SB Score, Constant and
Sign Variance, Hausdorff, MSE
Heat Map, Gene Plot, Cluster
Plot, Numerical Matrix
MATLAB
TABLE I: Summary of the biclustering toolboxes

Based on the above-mentioned features, it can be summarized that a toolbox must be diverse in nature. In the past decade, the growing demand of biclustering algorithms has led the intense research on developing toolboxes for biclustering. This paper proposes a user-friendly toolbox namely “BIDEAL” which incorporates biclustering algorithms, validation indices, and visualization methods. Table I summarizes various biclustering toolboxes in terms of available algorithms, validity indices, and visualization methods. Considering the visualization methods or result presentation for generated biclusters, BicAT [12], BicOver-lapper 2.0 [13], Expander [14], and BicNET [15] provide only single visualization method. On the other hand, BiVisu [16], BAT [17], BiBench [18], BiClust [19], MTBA [20], CoClust [21], BicPAMS [22], and BIDEAL have multiple methods of visualization. Among these, CoClust and BIDEAL offers the maximum number of visualization methods. By default, BIDEAL provides bicluster results in a numerical matrix. Another important feature of a toolbox is the validation indices to check the quality of obtained biclusters. BiVisu, BAT, BiBench, and BiClust offers only one or two validation indices whereas, BIDEAL have six i.e. maximum among the listed toolboxes. The Graphical User Interface (GUI) of any application for the execution of various algorithms on a single platform alleviates the process. The user-friendly interface of BIDEAL enables the testing of new dataset quite easy without any prior knowledge of back-end programming. On the other hand, BiBench, BiVisu, BiClust, CoClust, and MTBA requires a little bit familiarity with the programming knowledge. Moreover, BicAT allows the execution of algorithms with default parameter settings, which is a constraint whereas, BIDEAL allows to change these parameters.

Contributions: This paper introduces the proposed BIDEAL toolbox, its necessity, and importance in comparison with other existing toolboxes in literature. Table II summarizes the comparison of the features available in BIDEAL with respect to existing toolboxes in the literature. In summary, the features of BIDEAL are as follows:

  1. It is developed to integrate the largest number of biclustering algorithms, validation indices, and visualization methods (over existing toolboxes) on a single platform.

  2. It accommodates preprocessing methods as well within.

  3. It has a user-friendly interface than other existing biclustering toolboxes.

  4. To demonstrate the usefulness of BIDEAL, it has experimented with four standard datasets and their validation indices have been compared.

To the best of our knowledge, no existing biclustering toolboxes have all these features incorporated on a single platform.

The paper is arranged as: Section 2 presents a brief introduction about biclustering algorithms embedded in BIDEAL, Section 3 describes validation indices, Section 4 illustrates GUI of BIDEAL, and Section 5 provides the results on four standard datasets using BIDEAL. Finally, Section 6 concludes the paper.

FeaturesToolboxes
BicAT
[12]
BiVisu
[16]
BicOver-lapper
2.0 [13]
Expander
[14]
BAT
[17]
BiBench
[18]
BiClust
[19]
BicNET
[15]
MTBA
[20]
CoClust
[21]
BicPAMS
[22]
BIDEAL
(Proposed Toolbox)
No. of Algorithms 5 1 1 1 1 4 3 1 12 3 5 17
No. of Validation Indices 0 2 0 0 1 2 2 0 4 0 0 6
No. of Visualization Methods 1 2 1 1 3 3 3 2 2 4 2 5
Graphical User Interface (GUI) Yes No Yes Yes Yes No No Yes No No Yes Yes

The values shown in bold represents the best feature among all the toolboxes.

TABLE II: Comparison of the features comprised with various biclustering toolboxes

2 BIDEAL: Ready for use Biclustering Algorithms

This section provides a brief overview of biclustering algorithms embedded in BIDEAL.

Cheng and Church (CC)[23] proposed an algorithm to process expression data on the basis of Mean Squared Residue (MSR) score as

(2)

MSR measures coherency of genes and conditions using mean values and extract -biclusters. Another effective algorithm FLexible Overlapped biClustering (FLOC) [24] was proposed. It performs probabilistic steps and find overlapped biclusters further refined using MSR score to overcome the effect of missing values in biclusters. The missing values often create random disturbances which affect the quality and slow down the operation of biclusters identification. The biclusters acquired by FLOC give better results for a larger matrix with smaller MSR in comparison to CC.

Dhillon [25] used Bipartite Spectral Graph Partitioning (BSGP) to model data matrix as =. It is based on an exhaustive bicluster enumeration approach, which tries to find partitions of the minimum cut vertex in a bipartite graph between rows and columns. Considering the time and memory, it is quite expensive. BSGP approach can be represented as

(3)

Order Preserving Sub-Matrices (OPSM) [26] algorithm finds matrices, which have expression level in strictly increasing linear order. The algorithm uses a heuristic approach for biclustering. A sub-matrix can be said to be order preserving, if under the permutation of the conditions, the value of the gene expression data is linearly increasing or decreasing.

Another approach proposed by Bergmann et al. i.e. Iterative Search Algorithm (ISA) based on coherently overlapped biclusters, also referred as Transcription Modules (TM), can extract biclusters by iterative search from the gene expression data matrix [27].

In [28], Kluger et al.

proposed a spectral technique known as kSpectral to find biclusters based on Eigenvectors of the data matrix. Firstly, the datasets are normalized and then a singular value decomposition technique is applied on the micro array, where the constant part wise Eigenvalues give the checkerboard patterns in the sub-matrix. Finally,

-means clustering is applied to obtain the checkerboard structures from the data matrix.

In [29]

, the authors presented the information-theoretic (ITL) formulation for biclustering. In this formulation, an optimization approach has been followed where the number of rows and column clusters are constraints and the task is to maximize the mutual information between clustered random variables. It can reduce the problem of high dimensionality and sparsity.

Murali et al. [30] proposed a representation for gene expression data called as conserved gene expression motifs or xMotifs. It tries to find largely conserved gene expression motifs from the given discretized data matrix. It uses a greedy approach that conserves row. A sub-matrix is said to be a conserved motif if the expression level of a gene is found consistent in the respective sub-matrix. Comparing distinct gene motifs for distinct conditions, we get to know of genes which are conserved in multiple conditions but are the in dissimilar state in various conditions.

In Plaid [31], a bicluster is assumed to follow the statistical model and the binary least squares is used to fit the bicluster membership parameters. In this model, data matrix can be considered as a superposition of layers, where layer is a subset of genes and conditions of the data matrix. The data tries to fit in a plaid model can be expressed as

(4)

Binary Inclusion Maximal (BiMax) is based on fast divide and conquer approach [32]. It tries to find all the bi-maximal biclusters which contains only one element. The algorithm requires discretization of the gene expression level matrix into a binary matrix by deciding a threshold.

Large Average Sub-Matrix (LAS) [33] is a statistically advanced algorithm which uses a Gaussian null model for gene expression data. It finds the bicluster to give the largest significance score which is defined as

(5)

The elements of the data matrix are subtracted from the mean of the significance score (5) to form a residual matrix. The search is iteratively repeated until optimal value falls below the predefined threshold.

Hochreiter et al. [34] presented a multiplicative model biclustering algorithm i.e. Factor Analysis for Bicluster Acquisition (FABIA) that takes linear alliance of genes and conditions into account. In this model, the row and column vectors need to be multiple of each other. FABIA models the data matrix as the addition of biclusters and an additive noise. Here, the linear dependency of subsets of rows and columns can be described by outer product . The overall model is given by

(6)

In [35], bit-patterns are extracted from the data matrix using two phase process known as BitBit algorithm. The first phase includes a novel encoding process to divide the columns of the data matrix to a certain length determined by the minimum number of columns. In the second phase, biclustering of bit patterns takes place using selective search. Each pair of row generates a pattern. In BitBit, the comparison between rows takes place at bit level. To tackle excessive computation, iterative approach is used instead of divide and conquer approach as in BiMax by avoiding recursion and also additional traversals of the matrix a.k.a. BiSim [36].

Wang et al. [37] proposed Modular Singular Value Decomposition Multi-Objective Evolutionary biclustering (MSVD) algorithm. MSVD splits the gene expression data matrix into a set of sub-matrices with equal dimensions into a non-overlapping manner. Then, it projects the data obtained for the desired number of eigenvalues and applies -means clustering to cluster them.

Another algorithm QUalitative BIClustering (QUBIC) [38] based on graph theory approach is also embedded in BIDEAL. In QUBIC, the expression level of genes is expressed in a qualitative or semi-qualitative manner under multiple conditions as an integer value.

Tchagang et al. proposed ROBA [39], where basic linear algebra techniques were used. There are three main steps in this algorithm. The first step involves preprocessing of data to handle missing values and noise. The second step decomposes given data matrix into binary matrices. The last step involves identification based on the type of bicluster.

3 BIDEAL: Accessible Validation Indices for Performance Measures

Various validation indices as performance measures are used to check the quality of biclusters as described in further subsections.

3.1 Jaccard Index

Jaccard index [40] compares the biclusters obtained by applying the two biclustering algorithms and finding out the number of similar biclusters between them. Jaccard index gives a value of 0 if biclusters are dissimilar else 1. Jaccard index is defined as

(7)

3.2 SB Score

Differential co-expression ranking score a.k.a. SB score was proposed in [41]. Considering two biclusters and , where is formed by gene under the first set of conditions and is formed by the same gene with a second set of conditions. Chia et al. proposed an algorithm to compare the goodness of gene w.r.t. two nonidentical set of conditions. If is good gene than there will be co-expression between gene and first set of conditions while differential co-expression between gene and second set of condition. The differential co-expression of can be measured as

(8)

where is used to offset the large ratios.

3.3 Constant Variance

In [7], corresponding variance of genes/ conditions is taken into consideration where the variance is the average of the sum of Euclidean distances between rows and columns of bicluster. Higher the value of the variance, lower the quality of the bicluster. The expression of the variance is given by

(9)

3.4 Sign Variance

For a coherent bicluster, the value of sign variance is lower [20]

. It is same as constant variance except it preprocesses the data matrix into sign matrix and then estimates variance.

3.5 Hausdorff Distance

The Hausdorff distance [42] calculates the distance between the pair of sub-matrices obtained from the gene expression data matrix. It is maximum for traversal from the element of first bicluster to the nearest element of second bicluster and signifies dissimilarity. Mathematically, it can be written as

(10)

3.6 Mean Squared Residue

To calculate mean squared residue, the mean square error (MSE) of each bicluster is calculated [23]. Then overall MSE can be calculated by taking the mean of individual values.

4 BIDEAL: Key Features and GUI

BIDEAL integrates various biclustering algorithms into a stand-alone application of graphical user interface (GUI) developed using MATLAB. It is executable on Windows as well as on Linux operating system. BIDEAL includes several functions to preprocess the raw data, validate, and visualize the biclusters. The key features of BIDEAL are as follows:

Fig. 1: Graphical User Interface of BIDEAL. From left to right: Homepage, Visualization page, and Validation page.

4.1 Data Preprocessing

BIDEAL includes four preprocessing methods, i.e. filtering, binarization, discretization, and normalization. Filtering is used to eliminate the effect of Not a Number (NaN) spots and missing values from the data. Binarization is used to convert a numerical feature vector into a Boolean, it is mostly useful for downstream probabilistic estimators which assume that the input data is distributed according to a multi-variate Bernoulli distribution. Discretization, a.k.a. quantization/ binning, is used to transform continuous features into discrete values. Some specific datasets with continuous features may not be linearly correlated with the target and are not able to handle with feature selection methods. In such cases, obtaining an interpretable explanation of such features won’t be easy. However, this type of data may be benefited from discretization because it can transform the dataset of continuous attributes to one with only nominal attributes. Normalization is used for scaling the individual samples to have unit norm. In general min-max and z-score normalization are used when data come from the normal distribution. However, biomedical data or most of the clinical research data do not follow the normal distribution because they are mostly skewed. For this purpose, logarithmic transformation bistochastization and item independent re-scaling of rows and columns are used. The log transformation decreases the variability of data and bistochastization makes all rows and columns to have the same mean value and the matrix is repeatedly normalized until convergence, whereas, in the independent row and column normalization of rows sum to a constant and columns sum to a distinct constant

[28].

4.2 Largest Number of Biclustering Algorithms

For biclusters generation, biclustering algorithms have been embedded in BIDEAL that is maximum among all the available toolbox listed in Table II. It provides flexibility to select biclustering algorithms according to the nature of data. Availability of all algorithms at a single platform allows to analyze the data with minimal efforts.

4.3 Initial Parameter Setting of Algorithms

Without a prior knowledge of algorithms, the parameters setting is quite challenging for naive user. BIDEAL facilitates the initial value of parameters as provided in the original published work which users can easily change if needed.

4.4 Robust Bicluster Generation

BIDEAL offers several ways to ensure a smooth and robust bicluster generation. For example, the filtering option is availed to reduce the effect of NaN and missing values present in the dataset.

4.5 Identification of Cluster Type

BIDEAL offers validation indices to determine the type of biclusters. For example, the constant variance can identify constant bicluster, whereas sign variance allows to identify bicluster where coherent sign changes on rows and columns.

4.6 Similarity Measures

BIDEAL offers two validation indices, i.e. Jaccard index and Hausdorff distance to measure the similarity and dissimilarity, respectively between two biclusters. The value of Jaccard index of a particular biclustering algorithm varies from to depending upon the level of similarity. Hausdorff distance, widely used in several applications, can also measure the distance between two distinct biclusters. For example, in Yeast dataset, Jaccard index values were calculated for CC algorithm and it can be seen that results obtained from other algorithms were dissimilar from CC as Jaccard index values were very less for all other algorithms.

4.7 User-Friendly Interface

BIDEAL offers a user friendly GUI which is easy to use for bicluster analysis including generation, visualization, and validation. The unique features of this interface are:

  1. BIDEAL is a self contained concise toolbox with all the relevant information present in it. It provides immediate visual results and effect of each action.

  2. In many cases, the installation of toolbox depends on other components like language, which in general is not availed with toolbox package. To ease the installation, the stand-alone executable files are packaged with MATLAB run-time compiler in BIDEAL. This enables the user to just click and install the ready to use biclustering algorithms.

4.8 Implementation and GUI

BIDEAL has been developed using MATLAB which integrates various features into a stand-alone application. The GUI of developed BIDEAL toolbox comprises of the following steps for biclusters generation, validation, and visualization:

  1. The home page of BIDEAL is shown in Fig. 1. At first, the dataset should be loaded. It can be either a sample or user-defined dataset.

  2. The data can be preprocessed using filtering, binarization, normalization, or discretization.

  3. Select the required algorithm to generate biclusters. User will be prompted to feed input parameters else BIDEAL will consider the default values.

  4. Generated results can be saved in .mat file.

  5. Click the Bicluster Visualization button on the home page to visualize the biclusters. Any of the available three options on visualization page i.e. heat map, cluster plot, or gene profile can be clicked to visualize the result.

  6. Click the Bicluster Quality Index button to access the validation indices. The validation page displays individual bicluster or overall biclusters result.

  7. Press Reset button to again access the home page.

Fig. 2: Validation indices on various datasets.
DatasetsAlgorithms
CC
[23]
BSGP
[25]
OPSM
[26]
ISA
[27]
kSpectral
[28]
ITL
[29]
xMotif
[30]
Plaid
[31]
FLOC
[24]
BiMax
[32]
LAS
[33]
FABIA
[34]
BitBit
[35]
BiSim
[36]
MSVD
[37]
QUBIC
[38]
ROBA
[39]
Yeast[23] 100 10 16 16 0 1 97 4 20 75 20 2 212 1547 13 10 10104
ALL vs. AML[43] 1 0 37 500 0 100 89 4 20 100 52 5 0 0 100 0 32591
GDS205[44] 1 7 7 13 6 0 5 0 20 11 5 5 0 1 3 10 3925
GDS301[45] 1 0 10 0 0 100 39 0 20 100 5 4 1 1 1 0 0
TABLE III: Number of biclusters obtained with biclustering algorithms available in BIDEAL on four datasets

5 BIDEAL: Testing and Validations on Benchmark Datasets

Fig. 3: Left: Heat map plot and Right: Gene plot using CC algorithm for generated bicluster on Yeast dataset.

To demonstrate the utility of gene expression profiling by generation of patterns or biclusters through a single platform decreases user efforts. Hence, BIDEAL offers a user friendly interface to decrease the cumbersomeness faced during the biclusters formation. In this section, the experiments and validation on four benchmark datasets have been provided using BIDEAL. The four datasets used are Saccharomyces cerevisiae cell cycle dataset (Yeast) [23] with genes and conditions, Leukemia (ALL vs. AML) dataset [43] with genes and conditions, Mammary tissue profile dataset (GDS205) [44] with genes and conditions, and Ligand screen in B cells dataset (GDS301): Epstein Barr virus-induced molecule-1 [45] with genes and conditions. The biclusters formed on these four benchmark datasets are further validated using validation indices available in BIDEAL as depicted in Fig. 2. Table III tabulates the number of biclusters obtained using biclustering algorithms embedded in BIDEAL. Since Yeast [23] and ALL vs. AML [43] datasets are preprocessed therefore GDS205 and GDS301 were preprocessed before execution of the biclustering algorithms. In further subsections, the findings of BIDEAL have been discussed in detail.

5.1 Saccharomyces Cerevisiae Cell Cycle (Yeast) Dataset

Yeast dataset [23] comprises of genes and conditions. The objective of this dataset is the identification of genes whose mRNA levels are regulated by the cell cycle. The number of biclusters generated on Yeast dataset using BIDEAL have been reported in Table III. The table depicts that among all algorithms ROBA generates highest number of biclusters whereas kSpectral fails to produce any bicluster i.e. 0. It is due the fact that kSpectral did not find any distinctive checkerboard patterns in Yeast dataset. On the other hand, ROBA utilizes simple linear algebraic methods instead of complex optimization and extracted highest i.e. 10104 number of biclusters. Since the hierarchy of biclustering algorithms is application specific therefore one cannot measured their utility in terms of number of bicluster like BiSim forms biclusters whereas ITL and FABIA extracted only 1 and 2 biclusters respectively. However all of them have their own biological significance. CC forms biclusters which cover approximately genes and approx. of conditions. Fig. 3 shows a sample heat map and gene plot using CC algorithm for generated bicluster on Yeast dataset. BSGP and QUBIC reported biclusters, whereas FABIA and Plaid had very few biclusters with fewer genes and conditions. BitBit gave biclusters while kSpectral failed to produce any bicluster which signifies that this model do not fit with the given dataset. OPSM and ISA reported the same number of biclusters. Considering the quality of obtained biclusters, it was noted that the biclusters obtained using BiSim, FABIA, and kSpectral had no similarity w.r.t. CC in the context of Jaccard index. On the other hand, ITL, Plaid, BitBit, and ISA had very low similarity. BSGP and MSVD gave higher similarity while ROBA had the maximum similarity among all. According to sign variance metric, the biclusters obtained using CC, Plaid, ISA, and FABIA were less coherent while ROBA, BSGP, and BiMax gave strong coherent biclusters. LAS, BiSim, and MSVD were giving average coherent biclusters. While measuring the quality of biclusters using constant variance, it was inferred that BSGP, MSVD, BiMax formed better biclusters while ISA and Plaid gave higher values of constant variance indicating lower quality of biclusters. LAS, CC, BitBit, ITL, and FLOC gave an average type of biclusters.

5.2 Leukemia (ALL vs. AML) Dataset

Leukemia dataset comprises of two subtypes of leukemia cancer i.e. Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL). It has genes and conditions. For ALL vs. AML dataset, also ROBA reported highest number of biclusters biclusters due to its ability to extract more than one type of biclusters in given dataset. As mentioned earlier various biclustering algorithms are able extract specific patterns from dataset. For ex. BSGP works better when dataset can be modelled using bipartite graph efficiently whereas kSpectral is well known to extract checkerboard patterns in data. In case of this dataset both patterns were not applicable therefore 0 biclusters were reported. On other hand BitBit and BiSim are known to search patterns in less time by traversing the binarized data matrix with tuned parameters. As shown in Table III BSGP, kSpectral, BitBit and BiSim failed to produce any bicluster. BiMax successfully extracts 100 inclusive maximal biclusters from this dataset. It is interesting to notice that ITL, BiMax and MSVD produced same number of biclusters i.e. 100 though their objective functions are different from each other. ITL tries to preserve mutual info whereas BiMax follows divide and conquer strategy and MSVD is inspired from linear algebra technique. CC formed only one bicluster which has all genes and conditions. LAS, OPSM and xMotif resulted 52, 37 and 89 bicluster respectively. FABIA and Plaid extracted only 5 and 4 biclusters due to presence of less conditions and few layers as per plaid model. Considering the Jaccard index similarity, xMotif and CC values were high. CC and xMotif had a negative score which indicates differential co-expression. According to sign variance, CC gave coherent biclusters as it had the lowest value while high value of FLOC and BiMax indicates less coherent cluster. Rest of the algorithms generated biclusters with average coherency. From constant variance values, it can be inferred that ISA gave very low quality biclusters.

5.3 Mammary Tissue Profile (GDS205) Dataset

GDS205 [44] comprises of genes and conditions. For this dataset again ROBA resulted in high number of biclusters i.e. 3925. This indicates there are overlapped gene and sample sets where genes are involved in several biological pathways. Rest of the biclustering algorithms, embedded in BIDEAL extracted approximately 12 biclusters only. BiMax successfully extracted 11 subsets of genes and conditions whereas BiSim only extracted 1 bicluster. FABIA extracted only 5 biclusters which signifies that GDS205 dataset is not influenced by heavy-tailed distribution. For this dataset use of FLOC algorithm over the CC is clearly shown. FLOC resulted in biclusters without being effected by random interference whereas as CC produced only bicluster. BSGP and OPSM both gave biclusters indicating presence of order-preserving sub-matrices in GDS201. kSpectral and xMotif resulted in and , respectively. LAS and MSVD discovered , biclusters, respectively. Qubic identified checkerboard pattern present in data. For this dataset ITL, Plaid, and BitBit failed to provide any bicluster. Plaid did not find any shift biclusters in this dataset whereas ITL fails to find co entropy based subsets genes and conditions. Now considering the validity of these bicluster we found that in terms of sign variance, CC and QUBIC resulted in very low value i.e. more coherent biclusters but biclusters produced by LAS were not coherent hence it had high value of sign variance. According to the constant variance, CC and QUBIC produced best biclusters, but FLOC gave the high value of constant variance, which meant that the quality of the biclusters was not good. Jaccard indices were calculated w.r.t. CC like others. It interprets that the biclusters formed by BSGP and MSVD had the lowest similarity with the biclusters formed by CC. It can also be concluded that CC and QUBIC produced better biclusters for this dataset.

5.4 Ligand Screen in B Cells (GDS301) Dataset

GDS301 dataset comprises of genes and 11 conditions collected by culturing B Cells with Ligand to perform temporal analysis. As shown in Table III BiMax produced maximum number of biclusters i.e. . This signifies 100 biclusters were found with values of 1s by enumeration. ITL also discovered same number of biclusters by extracting mutual information between genes and conditions. BSGP, kspectral, and Plaid failed to produce any bicluster. Plaid discovers interesting pattern with multivariate data whereas kSpectral identifies biclusters only if genes are co-regulated with expression levels. FABIA reported to extract 4 biclusters. CC, ISA, BitBit, BiSim, all reported one bicluster having all genes and conditions in that bicluster. This means algorithms failed to extract the patterns from dataset. Though MSVD formed one bicluster where all conditions were present but only genes were matched. In terms of Jaccard index, BitBit and BiSim had maximum similarity with CC, whereas ITL and BiMax had less similarity with CC. In terms of sign variance, xMotif and CC gave coherent biclusters but biclusters formed by FLOC were not coherent enough. Constant variance values were mostly similar i.e. FABIA produced maximum constant variance among all.

5.5 Biological Significance

The biological significance of biclustering algorithms refers to the identification of subset of genes clustered with similar subset of conditions to form a pattern or bicluster. The biclusters are useful for disease identification, biomarkers generation, gene-drug association, etc. The reliability of these biclusters are justified using various evaluation measures. BIDEAL provides constant variance and sign variance as evaluation measures to check the coherency, significance, and reliability of biclusters obtained using various biclustering algorithms . In terms of coherency, for Yeast dataset, biclusters generated using FLOC, Bimax, LAS, and ITL algorithms had low sign variance and constant variance. In ALL vs. AML dataset, most of the algorithms failed to generate significantly coherent biclusters except CC and xMotif algorithms. In GDS205 dataset, CC and BiSim algorithms produced coherent biclusters whereas, in GDS301 dataset, CC, ITL, and ISA algorithms produced coherent biclusters. Another evaluation measure, i.e. SB score, is also embedded in BIDEAL. The SB score was quite low for Yeast dataset except for the biclusters generated using BSGP algorithm. It shows that the obtained biclusters had more co-expression level for two conditions among genes. In ALL vs. AML dataset, generated biclusters have differential co-expression among genes and conditions because the value of SB score was almost absent. GDS205 dataset reported the high value of SB score which signifies the more co-expression ranking among genes w.r.t. two sets of conditions. In each dataset, at least one algorithm had reported similar bicluster as CC algorithm, for example ITL in case of ALL vs. AML dataset, whereas BiSim in GDS205 dataset. As presented in Table III, it can be seen that for datasets, ROBA gave an exceptionally large number of biclusters which means overlapping biclusters were generated, FABIA and plaid resulted in less number of biclusters for all datasets, FLOC generated a constant number of biclusters i.e. . For GDS301, only CC, OPSM, ITL, xMotif, FLOC, BiMax, LAS, ISA, MSVD, and FABIA had some result and BiSim and BitBit were quite similar to CC. In case of Yeast dataset, kSpectral failed to produce any bicluster while ITL, Plaid, and BitBit gave no bicluster on GDS205 dataset. Most of the biclusters formed using xMotif, BiSim, QUBIC, BSGP, and CC are of -type which indicates clusters with strong instance and attribute effect. MSVD, FLOC, ISA, and BiMax generated biclusters are of T-type hence these biclusters are with strong instance effect.

5.6 Execution Time and Size of Dataset

The proposed toolbox integrates various biclustering algorithms on a single platform therefore to measure the execution time one needs to note the execution time of each algorithm. Since the complexity of the biclustering problem relies on the dataset and the objective function therefore its execution time can vary from few seconds to hours. For example on the Yeast dataset, CC, xMotif, and BiMax takes less than 5 seconds to compute biclusters; BSGP, ISA, kSpectral, and FLOC take around 1 minutes to compute biclusters; BitBit and QUBIC extracts biclusters in 30 minutes; and BiSim executes in 90 minutes. Moreover, considering the maximum file sizes can be handled, the proposed toolbox has been validated for the dataset with maximum size of 25 MB. The test has been performed on Yeast dataset of file size 198KB, ALL vs. AML dataset of file size 656KB, GS205 dataset of file size 120KB, and GDS301 dataset of file size 25 MB.

6 Conclusions

The proposed “BIDEAL” toolbox in this paper has been developed to generate, validate, and visualize the biclusters from any data on a single platform. It integrates famous biclustering algorithms, validation indices, and

visualization methods for comprehensive data interpretations. Additionally, it provides preprocessing module to remove outliers and NaN spots from the data which helps to rectify issues related to null values, discrete matrix, etc. The proposed toolbox has been tested and validated on four benchmark gene expression datasets i.e. Yeast, ALL vs. AML, GDS205, and GDS301. It was inferred that each algorithm of BIDEAL can generate distinct set of biclusters from the same data; therefore, the selection of appropriate technique is required. The diverse nature of BIDEAL with various validation indices and visualization methods has been proven effective for selection of best biclusters. Information retrieval from data mainly depends on the type of local patterns, whether it has overlapping and constant biclusters, or noisy data. We hope that the availability of BIDEAL will help the research community by widespread use of biclustering algorithms to identify coherent groups in data which is very useful in disease subtype identification. Furthermore, the toolbox can help to cater the data analysis needs, and it is being offered free to the community.

References

  • [1] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: a survey,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 1, no. 1, pp. 24-45, Jan.-March 2004.
  • [2] V. Singh, N. K. Verma, and Y. Cui, “Type-2 fuzzy PCA approach in extracting salient features for molecular cancer diagnostics and prognostics,” IEEE Trans. on NanoBioscience, vol. 18, no. 3, pp. 482-489, July 2019.
  • [3] R. K. Sevakula, V. Singh, N. K. Verma, C. Kumar, and Y. Cui,

    Transfer learning for molecular cancer classification using deep neural networks,”

    IEEE/ACM Trans. Comput. Biol. Bioinf., 2018. (Early Access)
  • [4] B. Pontes, R. Giráldez, and J. S. Aguilar-Ruiz, “Biclustering on expression data: A review,” Journal of Biomedical Informatics, vol. 57, pp. 163-180, 2015.
  • [5] J. MacQueen, “Some methods for classification and analysis of multivariate observations,”

    In Proc. of 5th Berkeley symposium on mathematical statistics and probability

    , vol. 1, no. 14, pp. 281-297, 1967.
  • [6] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241-254, 1967.
  • [7] N. K. Verma and A. Roy, “Self optimal clustering technique using optimized threshold function,” IEEE Syst. Journal, vol. 99, pp. 1-14, 2013.
  • [8] N. K. Verma, A. Roy, and Y. Cui, “Improved mountain clustering algorithm for gene expression data analysis,” Journal of Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 30-35, 2011.
  • [9] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Journal Computers and Geosciences, vol. 10, no. 2-3, pp. 191-203, 1984.
  • [10] A. B. Geva and D.H. Kerem, “Forecasting generalized epileptic seizures from the EEG signal by wavelet analysis and dynamic unsupervised fuzzy clustering” IEEE Trans. on Biomedical Engg., vol. 45. no. 10, pp. 1205-1216, 1998.
  • [11] N. K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, and Y. Cui, “A comparison of biclustering algorithms,” in Int. Conf. on Systems in Medicine and Biology, pp. 90-97, 2010.
  • [12] S. Barkow et al., “BicAT: A biclustering analysis toolbox,” Bioinformatics, vol. 22, pp. 1282-1283, 2006.
  • [13] R. Santamaría, R. Therónand, and L. Quintales, “BicOverlapper 2.0: Visual analysis for gene expression,” Bioinformatics, vol. 30, no. 12, pp. 1785-1786, 2014.
  • [14] R. Shamir et al., “EXPANDER - An integrative program suite for microarray data analysis,” Bioinformatics, vol. 6, no. 1, pp. 232, 2005.
  • [15] R. Henriques and S. C. Madeira, “BicNET: Flexible module discovery in large-scale biological networks using biclustering,” Algorithms for Molecular Biology, vol. 11, no. 1, pp. 1, 2011.
  • [16] K. O. Cheng et al., “BiVisu: Software tool for bicluster detection and visualization,” Bioinformatics, vol. 23, no. 17, pp. 2342-2344, 2007.
  • [17] C. A. Gallo, J. S. Dussaut, J. A. Carballido, and I. Ponzoni, “BAT: A new biclustering analysis toolbox,” LNCS in Advances in Bioinfo. and Compt. Biology, pp. 67-70, 2010.
  • [18] K. Eren, “Application of biclustering algorithms to biological data,” Diss. The Ohio State University, 2012.
  • [19] S. Kaiser and F. Leisch, “BiClust: A toolbox for biclustering analysis in R,” 2008.
  • [20] J. Gupta, S. Singh, and N. K. Verma, “MTBA: MATLAB toolbox for biclustering analysis,” IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, IIT Kanpur, India, pp.148-152, 2013.
  • [21] R. François, M. Stanislas, and N. Mohamed, “CoClust: A python package for co-clustering”, in Journal of Statistical Software, vol. 88, no. 7, pp. 1-29, 2018.
  • [22] H. Rui, F. Ferreira, and S. Madeira “BicPAMS: Software for biological data analysis with pattern-based biclustering”, BMC Bioinformatics, vol. 18, no. 1, 2017.
  • [23] Y. Cheng and G. Church, “Biclustering of expression data,” Conf. on Intelligent Systems for Molecular Biology, vol. 8, pp. 93-103, 2000.
  • [24] J. Yang, H. Wang, W. Wang, and P. S. Yu, “An improved biclustering method for analyzing gene expression profiles,”

    Int. Journal on Artificial Intelligence Tools

    , vol. 14, no. 5, pp. 771-789, 2005.
  • [25] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,” Int. Conf. on Knowl. discovery and data mining, pp. 269-274, 2001.
  • [26] A. Ben-Dor et al., “Discovering local structure in gene expression data: the order-preserving submatrix problem,” Int. Conf. on Computational biology, vol. 10, pp. 49-57, 2000.
  • [27] S. Bergmann, J. Ihmels, and N. Barkai, “Iterative signature algorithm for the analysis of large-scale gene expression data,” Physical Review E, vol. 67. no. 3, pp. 031902, 2003.
  • [28] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein, “Spectral biclustering of microarray data: coclustering genes and conditions,” Genome research, vol. 13, no. 4, pp. 703-716, 2003.
  • [29] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoretic co-clustering,” Int. Conf. on Knowl. discovery and data mining, pp. 89-98, 2003.
  • [30] T. M. Murali and S. Kasif, “Extracting conserved gene expression motifs from gene expression data,” in Proc. of Pacific Symposium Biocomputing, vol. 3, pp. 77-88, 2003.
  • [31] L. Lazzeroni and A. Owen, “Plaid models for gene expression data,” Statistica Sinica, vol. 12, pp. 61-86, 2002.
  • [32] A. Prelić et al., “A systematic comparison and evaluation of biclustering methods for gene expression data,” Bioinformatics, vol. 22, no. 9, pp. 1122-1129, 2006.
  • [33] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel,

    “Finding large average sub-matrices in high dimensional data,”

    The Annals of Applied Statistics, pp. 985-1012, 2009.
  • [34] S. Hochreiter et al., “FABIA: Factor analysis for bicluster information acquisition,” Bioninformatics, vol 26, no. 12, pp. 1520-1527, 2010.
  • [35] D. S. Rodriguez-Baena, A. J. Perez-Pulido, and J. S. Aguilar-Ruiz, “A bi-clustering algorithm for extracting bit-patterns from binary datasets,” Bioinformatics, vol. 27, no. 19, pp. 2738-45, 2011.
  • [36] N. Noureen and M. A. Qadir, “BiSim: A simple and efficient biclustering algorithm,”

    Int. Conf. on Soft Computing and Pattern Recognition

    , pp. 1-6, 2009.
  • [37] D. Wang and Zheng, “MSVD-MOEB algorithm applied to cancer gene expression data,” Int. Conf. on Awareness Science and Technology (iCAST), pp. 119-124, 2015.
  • [38] L. Guojun, Q. Ma, H. Tang, A. H. Paterson, and Y. Xu, “QUBIC: A qualitative biclustering algorithm for analyses of gene expression data,” Nucleic acids research, pp. gkp491, 2009.
  • [39] A. B. Tchagang and A. H. Tewfik, “Robust biclustering algorithm (ROBA) for DNA microarray data analysis,” Proc. IEEE/SP 13th Workshop on Statistical Signal Processing, pp. 984-989, 2005.
  • [40] M. Filippone, F. Masulli, and S. Rovetta, “Stability and performances in biclustering algorithms,” Int. Meeting on Comput. Intelligence Methods for Bioinformatics and Biostatistics, pp. 91-101, 2008.
  • [41] B. K. H. SB and R. K. M. Karuturi “Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms,” Algorithms for Molecular Biology, vol. 5, no. 1, pp. 23, 2010.
  • [42] N. K. Verma, E. Dutta, and Y. Cui, “Hausdorff distance and global silhouette index as novel measures for estimating quality of biclusters,” Int. Conf. on Bioinformatics and Biomedicine, pp. 267-272, 2015.
  • [43] T. R. Golub, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring”, Science, vol. 286, no. 5439, pp. 531-537, 1999.
  • [44] S. P. Suchyta et al., “Bovine mammary gene expression profiling using a cDNA microarray enhanced for mammary-specific transcripts”, Physiol Genomics, vol. 16, no. 1, pp. 8-18, 2003. Available Online: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS205.
  • [45] Available Online: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS301.
  • [46] A. Tanay, R. Sharan, and R. Shamir, “Discovering statistically significant biclusters in gene expression data,” Bioinformatics, vol. 18, pp. 136-144, 2002.
  • [47] C. A. Gallo, J. A. Carballido, and I. Ponzoni, “Bihea: A hybrid evolutionary approach for microarray biclustering,” Symposium on Bioinformatics, Springer, pp. 36-47, 2009.
  • [48] L. Wilkinson and M. Friendly, “The history of the cluster heat map,” The American Statistician, vol 63, no. 2, pp. 179-184, 2009.