DeepAI
Log In Sign Up

Triclustering of Gene Expression Microarray Data Using Coarse-Grained Parallel Genetic Algorithm

08/31/2019
by   Shubhankar Mohapatra, et al.
0

Microarray data analysis is one of the major area of research in the field computational biology. Numerous techniques like clustering, biclustering are often applied to microarray data to extract meaningful outcomes which play key roles in practical healthcare affairs like disease identification, drug discovery etc. But these techniques become obsolete when time as an another factor is considered for evaluation in such data. This problem motivates to use triclustering method on gene expression 3D microarray data. In this article, a new methodology based on coarse-grained parallel genetic approach is proposed to locate meaningful triclusters in gene expression data. The outcomes are quite impressive as they are more effective as compared to traditional state of the art genetic approaches previously applied for triclustering of 3D GCT microarray data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/01/2015

Formal Concept Analysis for Knowledge Discovery from Biological Data

Due to rapid advancement in high-throughput techniques, such as microarr...
06/06/2014

Ant Colony Optimization for Inferring Key Gene Interactions

Inferring gene interaction network from gene expression data is an impor...
02/25/2001

Gene Expression Programming: a New Adaptive Algorithm for Solving Problems

Gene expression programming, a genotype/phenotype genetic algorithm (lin...
11/12/2018

Prediction of Alzheimer's disease-associated genes by integration of GWAS summary data and expression data

Alzheimer's disease is the most common cause of dementia. It is the fift...
10/05/2020

Factorized linear discriminant analysis for phenotype-guided representation learning of neuronal gene expression data

A central goal in neurobiology is to relate the expression of genes to t...
08/23/2020

Relative ESR1 gene expression in major depression a biomarker for hormone therapy

Depression is a common disease which emerges with a disorder in temper. ...
03/10/2022

Bayesian Copula Directional Dependence for causal inference on gene expression data

Modelling and understanding directional gene networks is a major challen...

1 Introduction

In microarray research, finding groups of genes exhibiting similar expressions, clustering and biclustering techniques are more commonly used in gene expression analysis [14][9]. However,these techniques become inefficient when the influence of the time as a factor affects the behavior of expression profiles  [7]. Now, these types of longitudinal experiments are gaining interest in various areas of molecular activities where the evaluation of time is essential. For example, in cell cycles, the evolution of diseases or development at the molecular level is time based as they consider time an important factor of evaluation [1]. Hence, triclustering appears to be a valuable mechanism as it allows evaluation of the expression profiles under a block of conditions along with under a subset of time points.

Figure 1: Illustration of a tricluster

A coherent tricluster is defined as a set of genes that pursues either coherent values or behaviors. These clusters might have useful information that identify significant phenotypes or potential genes relating to the phenotypes and their regulation relations [17]

. The computational complexity of triclustering algorithms is more expensive than the biclustering algorithms(which are already NP hard), so heuristic based algorithms are an upstanding resemblance for triclustering.

Genetic Algorithms (GAs) are search specific algorithms and are motivated by the characteristics of genetics and natural selection [10]. GAs usually undergo some important phases like reproduction, mutation, fitness evaluation and selection. Sequential GAs are competent in many applications as well as in different domains. However, there exist some problems in their utilization of problems like triclustering. For example, the fitness evaluation in sequential GAs is usually very time-consuming. Also, sequential GAs may get trapped in a sub-optimal region of the search space thus becoming unable to find better quality solutions. So parallel GAs(PGAs) seem to be a better alternative to the traditional sequential GAs with the adoption of parallelism. The static subpopulations with migration parallel GAs have a key characteristic of applying multiple demes along with the presence of a migration operator. Coarse-grained parallel genetic algorithms(CgPGA) follow the same general terms for a subpopulation model having a fairly small number of demes with many individuals. Very often coarse-grained parallel GAs are treated as distributed GAs as in general their implementation is carried out on distributed memory MIMD computers. This appeal can also be well configured with heterogeneous networks.

In this paper, an algorithm based on coarse grained parallel genetic algorithms(CgPGA) approach is proposed. This algorithm finds genus of similar patterns for genes on a three-dimensional space, where genes, conditions and time factor are taken into consideration.

The rest of this paper is organized as follows: A review of the literature is presented in section 2. The proposed methodologies along with the details of the fitness functions and the genetic operators used are described in section 3. The simulation results with their GO term validation are discussed in section 4.4. Finally, section 5 presents the summary and the research findings of the proposed scheme and prospects for future work.

2 Related Work

Zhao and Zaki introduced triCluster algorithm in 2005 [21]. In this work, the patterns are discovered in three dimensional (3D) gene expression data along with a set of matrices for the quality measure. A contemporary approach that finds coherent triclusters which contain the regulatory relationships among the genes is stated in [20] and subsequently extract time-delayed clusters in [18].

LagMiner, in [19]

introduced a new technique to detect time-lagged 3D clusters. The evolutionary computation in the form of a multi-objective algorithm has also been employed in the search for triclusters in 

[13]. Bhar Anirban et al. in 2012 presented -TRIMAX algorithm [2]. Again in 2013, the same authors applied the -TRIMAX algorithm in estrogen-induced breast cancer cell datasets which provides insights into breast cancer prognosis [3]. David et al. presented a novel tricluster algorithm called as trigen in 2013 [8]. The novelty of this Trigen algorithm lies upon the use of the genetic approach to mine three dimensional gene expression microarray data. In 2015, Ayangleima et al. applied coarse-grained parallel genetic algorithm(CgPGA) with migration technique to mine biclusters in gene expression microarray data [12]. In the year 2016 Kakati et al. presented a fast gene expression analysis that uses distributed triclustering and parallel biclustering approach [11]. In her work, the initial bicluster finding is performed by parallel or shared memory approach and then the triclusters are extracted by a distributed or a shared nothing approach. Premalatha et al. in 2016 presented TrioCuckoo [16] which implemented triclustering using the famous cuckoo search technique.

3 Proposed Methodology

In this section, the reported algorithm has experimented on the standard yeast cell cycle dataset (Saccharomyces cerevisiae) [15]. Then the biological validation process is initiated with a tool called GO term finder (Version 0.83) [4] to get the functional annotations of the genes resulted in the output tricluster.

3.1 Encoding of individuals

Every individual in the population encodes a tricluster. Triclusters are represented in the form binary strings of G+C+T length, G being the genes(rows), C being the conditions(columns) and T being the times(height) of the 3D expression matrix. If the bit in an individual is 1, it indicates that the respective row, column or height have a place in the tricluster.

3.2 Fitness Function

Here a fitness function has been implemented to select the best candidates, which is conceptualized up on the three dimensions aspect of the mean square residue measure (MSR) which has been an all-time effective biclustering measure for gene expression analysis [5]. It is named as now onwards. As is a minimization function, we expect better results with smaller values.

The function is defined for every tricluster(TC). It is minimizing and thus lower values are favourable. Where,

3.3 Weights

The weights term is defined as:

Where , and are weights for the number of genes, conditions and times in a tricluster solution, respectively. High values of weights are favorable.

3.4 Distinction

The distinction term is defined as:

Where, (Co-ord Distinction no. of g), (Co-ord Distinction no. of c) and (Co-ord Distinction no. of t) are, respectively, the number of genes, conditions and time coordinates in the tricluster that are absent in the tricluster being evaluated, and , and are the distinction weights of the genes, conditions and times respectively. Distinction is a measure for the uniqueness of the tricluster being currently evaluated. With increased value of distinction non-overlapping solutions compared with results previously found can be found. Where,

  • : Tricluster gene coordinates subset.

  • Tricluster condition coordinates subset.

  • : Tricluster time coordinates subset.

  • : No. of time co-ord of the tricluster

  • : No. of condition co-ord of the tricluster.

  • : No. of gene co-ord of the tricluster.

  • : Expression value of gene g under condition c at time t from the expression matrix.

3.5 Tri-CgPGA

Tri-CgPGA is based on coarse grained genetic algorithms which come under Parallel Genetic Algorithm family. So like coarse grained algorithms, this evolutionary algorithm takes several steps to execute which are illustrated in the flowchart and pseudo-code below.

Start

Load expression Matrix

Specify Number of Cores

Initialize Initial Population

Evaluation of Population

Genetic Operators

Is migration interval = Count ?

Choose the best individual andreplace with the worst individual of other parallel deems

Select the best individual from other deems

Select the final individuals from the best individuals

Final tricluster to the output set

Stop

Yes
Figure 2: Tri-CgPGA Algorithm Workflow
Input: Expression Matrix
Output: Coherent Triclusters
1 Load the expression matrix Specify the number of cores to be used in parallel for tricluster number I =1 to maximum_triclusters do
2       Initialise the initial population Evaluate the population for generation number J=1 to maximum_generations do
3             selection of parents crossover each parent to generate offsprings mutation of generated offsprings evaluate the new individuals select the individuals with better fitness if migration_interval =count then
4                   choose the best individual of the best deem and replace with the worst individual of the other parallel deems
5            
6      select best individuals from all deems select the final individual from best_indiduals add final tricluster to output_set
return output_set
Algorithm 1 Tri-CgPGA Pseudo Code

4 Experimental results and discussions

All the computational simulations are performed in general conditions on a multiprocessor machine with 4 processors Intel Core i7 3.60 GHz with 4 GB RAM and Windows 8.1 64 bit operating system memory. The yeast cell cycle dataset (Saccharomyces cerevisiae) [15] is used for establishing the efficacy of the proposed algorithm. This dataset contains 6179 genes, 4 conditions, and 14 time points. The experiment is performed on the above mentioned dataset along with its two synthetic versions but only reported for the former.

4.1 List of the Parameters

During execution, some parameters have been set up like the crossover probability

, mutation probability , weights: for genes, for conditions and for times, distinction weights: , and for genes, conditions and times respectively. The details of them are available in table 1. As the algorithms are designed for gene filtration (to obtain the solution with a minimum number of genes), the value of is set to 0.8 so that maximum number of genes can participate in the solution. While setting up the parameters for the distinction term a higher value is being provided for the genes to cover up as much space as possible in this dimension.

0.8 0.5 0.8 0.1 0.1 1.0 0.0 0.0
Table 1: Values of the parameters taken during algorithm execution

4.2 Results on Yeast Dataset

The simulation results are analyzed from the perspective of the different generations. Analyzing across different generations, it indicates as the number of generations is increased, the values also increase. So for bigger generations, better homogeneity among the genes is obtained which is presented in the following graphs.

(a) 50 Generations
(b) 100 Generations
(c) 200 Generations
(d) 400 Generations
Figure 3: Fitness Value Plots

4.3 Comparitive Study

The results obtained from the execution of the algorithm are quite impressive in terms of time and the volume of the output triclusters. As the fitness function is minimizing, lower the value of MSR the better is the fitness of the tricluster. Further the results of the Tri-CgPGA algorithm is compared with the results obtained by the trigen algorithm [8]. The comparison has been done on the basis of computational time taken by the proposed algorithm to execute the codes and to derive the output. In the case of Tri-CgPGA algorithm, it took 30 seconds approximately to run for 1000 genes for 50 generations to deliver the output whereas the trigen algorithm [8] requires 118 seconds to do the same. Hence exploring parallelism with the genetic approach on triclustering of gene expression microarray data is preferable against the traditional GAs as it reduces the computation time for the algorithm execution. Other relevant information regarding the results obtained from the algorithms Tri-CgPGA algorithm is presented in Table 2.

width=1

GENE SIZE AVG. MSR AVG. VOLUME AVG. NO. OF GENES AVG. NO. OF. CONDITIONS AVG. NO. OF TIME
1000 493.35 5124.65 616 1.35 6.5
3000 1322.88 33889.5 1651.37 3 7
6178 2669.086 67798.75 3334 2.65 7.65
Table 2: Detailed Information about triclusters found by Tri-CgPGA algorithm

4.4 GO Term Analysis

The validation of the results obtained is carried out with the Gene Ontology project (GO) [6]. This analysis renders the ontology of terms which describes gene product annotation data along with its characteristics. The ontology describes attributes like molecular functions, cellular component and the relevant biological processes. The queries associated with the associated genes are addressed in GO using the GO Term Finder (Version 0.83) [4]. The findings of the GO Term analysis are presented in Table 3.

width=1 Cluster ID Biological Process Molecular Function PI (P-value= <0.01) 0044699 Single-Organism Process Only one organism is being involved 3.02E-10 0016043 Cellular Component Organization Assembling or de-assembling of a cellular component constituent parts 4.87E-08 0065007 Biological Regulation Biological process regulation of quality or function 0.00725 0080090 Single-Organism cellular Process Cellular level activity, occurring within a single organism 1.61E-06 0060255 Single-Organism Process Only one organism is being involved 1.91E-06 0019222 Single-Organism Process Only one organism is being involved 0.00019 0044763 Single-Organism Cellular Process Cellular level activity, occurring within a single organism 2.69E-06 0050789 Single-Organism Process Only one organism is being involved 4.10E-05 2000112 Single-Organism Cellular Process Cellular level activity, occurring within a single organism 4.89E-06 0010556 Single-Organism Cellular Process Cellular level activity, occurring within a single organism 0.00315 0071840 Cellular Component Organization Biosynthesis of constituent macromolecules,assembly, arrangement of constituent parts, or disassembly of a cellular component 0.00753 0051171 Cellular Component Organization or Biogenesis Biosynthesis of constituent macromolecules, assembly, arrangement of constituent parts, or disassembly of a cellular component 0.00026 0006996 Organelle Organization Cellular level assembly, arrangement of constituent parts, or disassembly of an organelle within a cell 6.00541 0010468 Organelle Organization Cellular level assembly, arrangement of constituent parts, or disassembly of an organelle within a cell 0.00563 0032774 Biological Regulation Biological process regulation of quality or function 0.00939

Table 3: GO for Yeast Cell Cycle Results

5 Conclusion

A new framework Tri-CgPGA, based on the coarse-grained parallel genetic approach(CgPGA) to generate the triclusters from gene expression database is proposed in our work. The results of the suggested framework are compared with another state of the art technique called as Trigen algorithm. As the comparison justifies the proposed scheme’s efficiency over the existing schemes considering the computation time, hence it is preferable to adopt parallel GAs over traditional GAs in the triclustering of gene expression 3D microarray data. There exist number of future directions which might further improve this framework: (1) The acquisition of large-scale databases from other standard datasets to measure the performance of the frameworks (2) To further improve the coherence and the computation time, other competent evaluation measures with the suggested or other existing versions of PGAs should be investigated to obtain more meaningful triclusters.

References

  • [1] Z. Bar-Joseph (2004) Analyzing time series gene expression data. Bioinformatics 20 (16), pp. 2493–2503. Cited by: §1.
  • [2] A. Bhar, M. Haubrock, A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, and E. Wingender (2012) -TRIMAX: extracting triclusters and analysing coregulation in time series gene expression data. In International Workshop on Algorithms in Bioinformatics, pp. 165–177. Cited by: §2.
  • [3] A. Bhar, M. Haubrock, A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, and E. Wingender (2013) Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell. Algorithms for molecular biology 8 (1), pp. 9. Cited by: §2.
  • [4] E. I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, and G. Sherlock (2004) GO:: termfinder—open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20 (18), pp. 3710–3715. Cited by: §3, §4.4.
  • [5] Y. Cheng and G. M. Church (2000) Biclustering of expression data.. In Ismb, Vol. 8, pp. 93–103. Cited by: §3.2.
  • [6] G. O. Consortium (2004) The gene ontology (go) database and informatics resource. Nucleic acids research 32 (suppl_1), pp. D258–D261. Cited by: §4.4.
  • [7] F. Gómez-Vela, F. Martínez-Álvarez, C. D. Barranco, N. Díaz-Díaz, D. S. Rodríguez-Baena, and J. S. Aguilar-Ruiz (2011) Pattern recognition in biological time series. In

    Conference of the Spanish Association for Artificial Intelligence

    ,
    pp. 164–172. Cited by: §1.
  • [8] D. Gutiérrez-Avilés, C. Rubio-Escudero, F. Martínez-Álvarez, and J. C. Riquelme (2014) TriGen: a genetic algorithm to mine triclusters in temporal gene expression data. Neurocomputing 132, pp. 42–53. Cited by: §2, §4.3.
  • [9] J. A. Hartigan (1972) Direct clustering of a data matrix. Journal of the american statistical association 67 (337), pp. 123–129. Cited by: §1.
  • [10] J. Holland and D. Goldberg (1989)

    Genetic algorithms in search, optimization and machine learning

    .
    Massachusetts: Addison-Wesley. Cited by: §1.
  • [11] T. Kakati, H. A. Ahmed, D. K. Bhattacharyya, and J. K. Kalita (2016) A fast gene expression analysis using parallel biclustering and distributed triclustering approach. In Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, pp. 122. Cited by: §2.
  • [12] A. Laishram and S. Vipsita (2015) Bi-clustering of gene expression microarray using coarse grained parallel genetic algorithm (cgpga) with migration. In India Conference (INDICON), 2015 Annual IEEE, pp. 1–6. Cited by: §2.
  • [13] J. Liu, Z. Li, X. Hu, and Y. Chen (2008) Multi-objective evolutionary algorithm for mining 3d clusters in gene-sample-time microarray data. In Granular Computing, 2008. GrC 2008. IEEE International Conference on, pp. 442–447. Cited by: §2.
  • [14] C. Rubio-Escudero, I. Zwir, et al. (2008)

    Classification of gene expression profiles: comparison of k-means and expectation maximization algorithms

    .
    In Eighth International Conference on Hybrid Intelligent Systems, pp. 831–836. Cited by: §1.
  • [15] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher (1998) Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell 9 (12), pp. 3273–3297. Cited by: §3, §4.
  • [16] P. Swathypriyadharsini and K. Premalatha (2018) TrioCuckoo: a multi objective cuckoo search algorithm for triclustering microarray gene expression data. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 34 (6), pp. 1617–1631. Cited by: §2.
  • [17] A. B. Tchagang, S. Phan, F. Famili, H. Shearer, P. Fobert, Y. Huang, J. Zou, D. Huang, A. Cutler, Z. Liu, et al. (2012) Mining biological information from 3d short time-series gene expression data: the optricluster algorithm. BMC bioinformatics 13 (1), pp. 54. Cited by: §1.
  • [18] G. Wang, L. Yin, Y. Zhao, and K. Mao (2010) Efficiently mining time-delayed gene expression patterns. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40 (2), pp. 400–411. Cited by: §2.
  • [19] X. Xu, Y. Lu, K. Tan, and A. K. Tung (2009) Finding time-lagged 3d clusters. In Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on, pp. 445–456. Cited by: §2.
  • [20] Y. Yin, Y. Zhao, B. Zhang, and G. Wang (2007) Mining time-shifting co-regulation patterns from gene expression data. In Advances in data and web management, pp. 62–73. Cited by: §2.
  • [21] L. Zhao and M. J. Zaki (2005) Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 694–705. Cited by: §2.