netgwas: An R Package for Network-Based Genome-Wide Association Studies

Graphical models are powerful tools for modeling and making statistical inferences regarding complex relationships among variables in multivariate data. They are widely used in statistics and machine learning, for example, to analyze biological networks. In this paper we introduce the R package netgwas, which is designed to accomplish three important and interrelated goals in genetics: linkage map construction, reconstructing intra- and inter-chromosomal interactions, and exploring high-dimensional genotype-phenotype and genotype-phenotype-environment networks. The netgwas package can deal with species of any ploidy level. The package implements recent improvements in both linkage map construction (Behrouzi and Wit, 2017a), and reconstructing conditional independence network for non-Gaussian data, discrete data, and mixed discrete and continues data (Behrouzi and Wit, 2017b). Such datasets routinely occur in genetics and genomics such as genotype data, genotype-phenotype dataset, and genotype-phenotype including environmental variables. The package uses a parallelization strategy on multi-core processors to speed-up computations for large datasets. In addition, it contains several functions for simulation and visualization, as well as three multivariate example datasets taken from the literature and used to illustrate the package capabilities. The paper includes a brief overview of the statistical methods which have been implemented in the package. The main body of the paper explains how to use the package. We also illustrate the package functionality with real examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/21/2015

BDgraph: An R Package for Bayesian Structure Learning in Graphical Models

Graphical models provide powerful tools to uncover complicated patterns ...
10/03/2017

De novo construction of q-ploid linkage maps using discrete graphical models

Linkage maps are important for fundamental and applied genetic research....
10/29/2018

Multivariate Analysis and Visualization using R Package muvis

Increased application of multivariate data in many scientific areas has ...
06/26/2020

The huge Package for High-dimensional Undirected Graph Estimation in R

We describe an R package named huge which provides easy-to-use functions...
10/23/2017

bridgesampling: An R Package for Estimating Normalizing Constants

Statistical procedures such as Bayes factor model selection and Bayesian...
07/17/2019

Spherical data handling and analysis with R package rcosmo

The R package rcosmo was developed for handling and analysing Hierarchic...
04/19/2018

varrank: an R package for variable ranking based on mutual information with applications to observed systemic datasets

This article describes the R package varrank. It has a flexible implemen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphical models (Lauritzen, 1996)

are commonly used, particularly in statistics and machine learning, to describe conditional independence relationships among variables in multivariate data. In graphical models, each random variable is associated with a node in a graph, and links represent conditional dependency between variables; the absence of a link implies that the variables are conditionally independent given the rest of the variables – the pairwise Markov property.

The netgwas package reconstructs undirected graphs for non-Gaussian, discrete, and mixed discrete-and-continuous datasets which arise routinely in genetics and genomics, particularly in systems genetics. The package includes various functional modules, including ordinal (genotype) data generation for simulation studies, several methods for reconstructing underlying undirected conditional independence graphs, and a visualization tool. Our package includes three key functions: (i) linkage map construction, (ii) LD networks, and (iii) genotype–phenotype networks reconstruction.

Many algorithms exist to construct linkage maps for diploid species. Some only order markers, such as Try and Ripple (Lander et al., 1987), seriation (SER) (Buetow and Chakravarti, 1987), rapid chain delineation (RCD) (Doerge et al., 1996), recombination counting and ordering (RECORD) (Van Os et al., 2005), unidirectional growth (UG) (Tan and Fu, 2006), CARTHAGEENE (Schiex and Gaspin, 1997), and HighMap (Liu et al., 2014)

. Others also estimate the genetic map, detecting both linkage groups (LGs) and order markers within the LGs. Some of them have been implemented into user-friendly software packages, such as R/qtl

(Broman et al., 2003), JOINMAP (Jansen et al., 2001), OneMap (Margarido et al., 2007), Pheno2Geno (Zych et al., 2015), and MSTMAP (Wu et al., 2008). For polyploids, despite their importance especially in crop research, statistical tools for linkage map construction are underdeveloped. Bourke et al. (2017) and Grandke et al. (2017)

recently developed a method to construct polyploid linkage map. Their method is based on calculating logarithm of odds (LOD) score; former literature uses multi–dimensional scaling and the later uses hierarchical clustering and an optimal leaf algorithm to detect chromosomes and order markers. However, the proposed method in

Grandke et al. (2017) can be computationally expensive, even for a small number of markers, and the tool presented in Bourke et al. (2017) needs manual interactions. Furthermore, the literature has focused on constructing genetic linkage maps only for a specific type of tetraploid species called autotetraploid (Hackett and Luo, 2003, Bourke et al., 2016). One has been implemented in software, TetraploidMap (Preedy and Hackett, 2016), but this also needs manual interaction and visual inspection, which limit its usability. Existing approaches for polyploid map construction are based mainly on estimation of recombination frequency and LOD scores (logarithm of the odds ratio) (Wang et al., 2016). The netgwas package, on the other hand, uses graphical models and the conditional independence concept to construct a linkage map for both diploid and polyploid species based on the method developed by Behrouzi and Wit (2017a).

Revealing the patterns of linkage disequilibrium (LD) is important for association mapping study as well as for studying the genomic architecture of a genome. One common approach to detect LDs is to perform an exhaustive genome scan for pairs of loci. The drawback of such approaches is that hypothesis testing in the genome–scale is underpowered, so that weak long–range LD will go undetected, especially after adjusting for multiple testing. Also, such methods can not simultaneously take the information of more than two loci into account. Moreover, they do not make full and efficient use of modern multi–loci data. Here, we implemented the method proposed by Behrouzi and Wit (2017b) to detect LD patterns in diploid and polyploid species. The netgwas uncovers short- and long–range LD patterns between different loci of a genome while correcting for the effect of other loci. Technically, this requires estimating a sparse adjacency matrix from a multi-loci genotype data, which usually contains large number of markers, where the number of markers can far exceed the number of individuals. The non-zero patterns of the adjacency matrix shows the conditionally dependent short– and long–range LD structures of the genome, and thus provides a basis for identifying associations between distant markers that are due to epistatic selection rather than gametic linkage. Also, the strength of LD structures is determined using partial correlations.

Genome-wide association study is an increasingly popular approach, due mostly to the thousands of genetic factors found to be significantly associated with complex traits (Welter et al., 2013). However, in GWAS, SNPs are often tested individually for association with the phenotype. Since genome-wide scans analyze thousands or even millions of markers, the issue of multiple testing is usually addressed by using a stringent significance threshold of (Panagiotou et al., 2011). This method works only if the associations are strong enough to pass the stringent threshold. However, even if that is the case, this type of analysis has several limitations, which have been discussed in the literature (Hoggart et al., 2008, He and Lin, 2010, Rakitsch et al., 2012, Buzdugan et al., 2016). Particularly, the main issue of this type of analysis is when we test a SNP individually, we ignore the effects of all other SNPs. This leads to high false positive rates. Because if we analyze marginally two correlated SNPs, which only one is causal for the phenotype, both may show an association. In our framework, the netgwas tackles this issue by using Gaussian copula graphical model, which simultaneously models the associations between SNPs and phenotypes. In the genotype–phenotype networks, nodes in the graph are either genetic markers or phenotypes, and each phenotype is connected by an edge to a marker or a group of markers if they directly affect each other after controlling for others. Different phenotypes may also interconnect. Furthermore, in genotype–phenotype–environment networks, nodes in the graph are either genetic markers, phenotypes or environmental variables, and each node is connected by an edge to another node if they directly affect each other given the rest of the variables.

To make our method computationally faster for large datasets, the netgwas package uses multi-core computing capabilities based on the parallel package. To make it easy to use, the netgwas package uses S3 classes as return values of its functions. Our package is available under general public license (GPL ) at the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/packages=netgwas.

In the next section we describe the main functions that are provided in the netgwas package. In addition, we explain the user interface and the performance of the package in several real data sets.

2 Package netgwas

The netgwas package implements the Gaussian copula graphical models (Behrouzi and Wit, 2017b) to (i) construct linkage maps for bi-parental species with any ploidy level, namely diploid ( sets), triploid ( sets), tetraploid ( sets) and so on; (ii) explore high–dimensional short– and long–range linkage disequilibrium (LD) networks among SNP markers while controlling for the effect of other SNPs. The inferred LD networks reveal epistatic interactions across a genome when viability of the particular genetic recombination of the parental lines is considered as phenotype; (iii) infer genotype-phenotype networks from multi–loci multi–trait data, where it measures the pairwise associations with adjusting for the effect of other markers and phenotypes. Moreover, it detects markers that directly are responsible for that phenotype (disease), and reports the strength of their associations in terms of partial correlations. In addition, the package is able to reconstruct conditional dependence networks among SNPs, phenotypes, and environmental variables. We describe below the user interface and the three main functions (see Fig 1) of the package.

User interface

In the R environment, the netgwas package can be loaded using the following commands:

R> install.packages( "netgwas" )
R> library( "netgwas" )

By loading the package, the igraph (Csardi and Nepusz, 2006), Matrix (Bates and Maechler, 2014) , and parallel packages will automatically be loaded, since the netgwas package depends on them. These packages are available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org. We use the igraph package for graph visualization; the Matrix package for memory-optimization, using the sparse matrix output. To speed up computations, we use the parallel package to support parallel computing on a multi-core machine to deal with large inference problems.

The netgwas package consists of three modules:

Figure 1: The main functions in netgwas package.

Module 1. Genotype data simulation: this simulates diploid genotype data in two different ways:

  1. Based on a Gaussian copula graphical model we simulated ordinal variables with a genome-like network structure. Inbred genotype data can be generated for

    p number of SNP markers, for n number of individuals, for k genotype states in a q-ploid species where represents the ploidy level of the chromosomes.

    The simulated data mimic a genome-like graph structure: First, there are g linkage groups (each of which represents a chromosome); then within each linkage group adjacent markers, adjacent

    , are linked via an edge as a result of genetic linkage. Also, with probability

    alpha a pair of non-adjacent markers in the same chromosome are given an edge. Inter-chromosomal edges are simulated with probability beta. These links represent long-range linkage disequilibriums. The corresponding positive definite precision matrix

    has a zero pattern corresponding to the non-present edges. The underlying variable vector

    is simulated from either a multivariate normal distribution,

    , or a multivariate t-distribution with degrees of freedom

    d and covariance matrix

    . We generate the genotype marginals using random cutoff-points from a uniform distribution, and partition the latent space into

    k states. The function can be called with the following arguments

    R> sim <- simgeno(p = 90, n = 200, k = 3, g = 5,
    +ΨΨ       adjacent = 3, alpha = 0.1, beta = 0.02,
    +ΨΨ       con.dist = "Mnorm", d = NULL, vis = TRUE)
    R> head(sim$data, n=3)
    
            [,1] [,2] [,3] [,4] [,5] [,6] ...  [,87] [,89] [,90]
    [1,]     0    0    0    1    1    1   ...    1     2     2
    [2,]     1    1    1    1    0    0   ...    0     2     2
    [3,]     2    2    2    2    0    2   ...    0     0     0
    
    R> plot(sim)
    
    Figure 2: A model-based simulation. (left) Each color corresponds to a chromosome, (right) the correspondent adjacency matrix. The five chromosomes are shown in the diagonal of the matrix.

    The output of the example is shown in Fig 2.

  2. We generate diploid recombinant inbred lines (RILs) using recombination fraction and a CentiMorgan position of markers across the chromosomes. The function can be called with the following arguments

    R> simRIL(g = 5, d = 25, n = 200, cM = 100, selfing = 2)
    

    where g and d represent the number of chromosomes and the number of markers in each chromosome, respectively. The number of sample size can be specified by n. The arguments cM and selfing show the length of chromosome based on centiMorgan position and the number of selfing in the RIL population, respectively.

Module 2. Inference Method: The functions netmap(), netsnp(), and netphenogeno() provide three methods to estimate undirected graphs as follows: a Gaussian copula graphical model using (i) the Gibbs sampling algorithm described in Behrouzi and Wit (2017b); and (ii) the approximation algorithm described in Behrouzi and Wit (2017b); (iii) the nonparanormal skeptic method (Liu et al., 2012) as alternative, along with the Gaussian copula models.

Module 3. Output: This module includes two types of functions:

  • Graph selection: The function selectnet tunes the penalty parameter, based on an information criterion, and provides the selected graph.

  • Visualization: The plotting function plot.netgwas provides a visualization plot to monitor the path of estimated networks for a range of penalty terms; the functions plot.netgwasmap, plot.select and plot.simgeno visualize the corresponding network, the selected graph and the conditional dependence structures of the model-based simulated data.

Figure 3: Linkage map construction in potato. (a) Estimated precision matrix for unordered genotype data of tetraploid potato. (b) Estimated precision matrix after ordering markers. All 12 potato chromosomes detected correctly.

netmap

The function netmap() reconstructs linkage maps for diploid and polyploid organisms. Diploid organisms contain two copies of each chromosomes, one from each parent, whereas polyploids contain more than two copies of each chromosome. In polyploids the number of chromosome sets reflects their level of ploidy: triploids have three copies, tetraploids have four, pentaploids have five, and so forth.

Typically, mating is between two parental lines that have recent common biological ancestors; this is called inbreeding. If they have no common ancestors up to roughly - generations, then this is called outcrossing. In both cases the genomes of the derived progenies are random mosaics of the genome of the parents. However, in the case of inbreeding parental alleles are distinguishable in the genome of the progeny; in outcrossing this does not hold.

Some inbreeding designs, such as Doubled haploid (DH), lead to a homozygous population where the derived genotype data include only homozygous genotypes of the parents namely AA and aa (conveniently coded as and ) for diploid species. Other inbreeding designs, such as F2, lead to a heterozygous population where the derived genotype data contain heterozygous genotypes as well as homozygous ones, namely AA, Aa, and aa (conveniently coded as , and ) for diploid species. We remark that the Gaussian copula graphical models help us to keep heterozygous markers in the linkage map construction, rather than turn them to missing values as most other methods do in map construction for RIL populations.

Outcrossing or outbred experimental designs, such as full-sib families, derive from two non-homozygous parents. Thus, the genome of the progenies includes a mixed set of many different marker types containing fully informative markers (e.g. segregating 1:1:1:1 in diploid parents) and partially informative markers (missing markers, and e.g. segregating 1:2:1, 3:1, and 1:1 in diploid parents). Markers are called fully informative when all of the resulting gamete types can be phenotypically distinguished on the basis of their genotypes; markers are called partially informative when they have identical phenotypes (Wu et al., 2002).

The netmap() function handles various inbred and outbred mapping populations, including recombinant inbred lines (RILs), F2, backcross, doubled haploid, and full-sib families, among others. Not all existing methods for linkage mapping support all inbreeding and outbreeding experimental designs. However, our proposed algorithm constructs a linkage map for any type of experimental design of biparental inbreeding and outbreeding. The proposed method is broad and can handle any population type that contains at least two genotype states.

The function can be called with the following arguments

R> netmap(data, method = "npn", cross= NULL, rho = NULL, n.rho = NULL,
+            rho.ratio = NULL, min.m= NULL, use.comu= FALSE,
+            ncores = "all", em.iter = 5, verbose = TRUE)

The main task in constructing a linkage map is to explore the conditional dependence relationships between markers. The argument method is used to specify which method is to be performed. The estimation procedure relies on maximum penalized log-likelihood, where the argument rho controls the sparsity level. To give an example, we show the steps to construct a linkage map for the example data set TetraPotato. This example regards the tetraploid potato. The data are derived from a cross between “Jacqueline Lee” and “MSG227-2”, where F1 plants were genotyped across genetic markers (Massa et al., 2015). Five allele dosages are possible in this full-sib autotetraploid mapping population (AAAA, AAAB, AABB, ABBB, BBBB), where the genotypes are coded as The dataset includes missing observations. In the following code we estimate the linkage map and plot the results.

R> data(tetraPotato)
# Shuffle the order of markers
R> dat <- tetraPotato[ , sample(1:ncol(tetraPotato), ncol(tetraPotato))]
R> potato.map <- netmap(dat, cross = "outbred"); potato.map
Number of linkage groups:  12
Number of markers per linkage group: 165 152 183 173 148 129 187 196 153 161 146 157
Total number of markers in the linkage map: 1950.
(22  markers removed from the input genotype data)
Number of sample size: n = 156
Number of categories in data: 5  ( 0 1 2 3 4 )
The estimated linkage map is inserted in <YOUR OUTPUT NAME>$map
To visualize the associated network consider plot(<YOUR OUTPUT NAME>)
-----------------------
To visualize the other associated networks consider plot(<YOUR OUTPUT NAME>$allres)
To build a linkage map for your desired network consider builMap() function
R> plot(potato.map, vis = "unordered markers")
R> plot(potato.map, vis = "ordered markers")

The argument cross needs to be specified for ordering markers because we introduce different ordering methods in inbred and outbred populations. In inbred populations, markers in the genome of the progenies can be assigned to their parental homologues, resulting in a simpler conditional independence pattern between neighboring markers. In the case of inbreeding, we use multidimensional scaling (MDS). A metric MDS is a classical approach that maps the original high-dimensional space to a lower dimensional space, while attempting to maintain pairwise distances. An outbred population derived from mating two non-homozygous parents results in markers in the genome of progenies that cannot be easily assigned to their parental homologues. Neighboring markers that vary only on different haploids will appear as independent, therefore requiring a different ordering algorithm. In that case, we use the reverse Cuthill-McKee (RCM) algorithm (Cuthill and McKee, 1969) to order markers. The RCM algorithm is based on graph models. It reduces the bandwidth of the associated adjacency matrix, , for the sparse matrix , where .

Fig 3 visualizes a summary of mapping process. The argument vis in the above plot function can be fixed to "interactive", which it gives a better network resolution particularly for a large number of markers. Fig 3a shows the conditional dependence pattern between unordered SNP markers in the Jacqueline Lee MSG227-2 population. Fig 3b shows the structure of the selected graph after ordering markers. All 12 potato chromosomes detected correctly and markers ordered with reasonable precision (Behrouzi and Wit, 2017a).

Figure 4: Intra– and inter–chromosomal conditional interactions network between markers across A.thaliana genome. (a) Each color corresponds to a different chromosome: orange, yellow, red, blue, and green represent chromosomes 1 to 5, respectively. Different edge colors show positive (blue) and negative (red) entries of precision matrix. (b) Represents associated partial correlation matrix.

netsnp

The function netsnp() reconstructs conditional independence relationships simultaneously among all genetic markers in a genome. In other words, it constructs intra– and inter–chromosomal conditional interaction networks. The function is called via

R> netsnp( data, method = "gibbs", rho = NULL, n.rho = NULL,
+      rho.ratio = NULL, ncores = "all", em.iter = 5, em.tol
+      = .001, verbose = TRUE )

The input data can be any biparental genotype data containing at least two genotype states. The genotype data from the netmap function can also be inserted here. This function can be used to reveal the high-dimensional linkage disequilibrium interactions network for polyploid genotype data. It handles missing observations.

As an example we implement this function to the Arabidopsis thaliana genotype data that are derived from a RIL cross between Columbia-0 (Col-0) and Cape Verde Island (Cvi-0), where individual plants were genotyped across genetic markers (Simon et al., 2008). The data contain 3 possible genotype states: A (homozygous) denoted by 0, H (heterozygous) denoted by 1, and B (homozygous) denoted by 2.

data(CviCol)
R> out <- netsnp(CviCol)
R> sel <- selectnet(out)
R> plot(sel, vis="interactive")
# Partial correlations between markers on genome
R> image(sel$par.cor, xlab="markers", ylab="markers", sub="")

Fig 4 shows that in the population our method finds some trans-chromosomal regions that do interact. In particular, the bottom of chromosome 1 and the top of chromosome 5 do not segregate independently of each other. Besides this, interactions between the tops of chromosomes 1 and 3 involve pairs of loci that also do not segregate independently. Bikard et al. (2009) studied this genotype extensively. They reported that the first interaction we found causes arrested embryo development, resulting in seed abortion, whereas the latter interaction causes root growth impairment. In addition to these two regions, we have discovered a few other trans-chromosomal interactions in the A.thaliana genome. In particular, two adjacent markers, c1-13869 and c1-13926 in the middle of the chromosome 1, interact epistatically with the adjacent markers, c3-18180 and c3-20729, at the bottom of chromosome 3. The sign of their conditional correlation score is negative, indicating strong negative epistatic selection during inbreeding. These markers therefore seem evolutionarily favored to come from different grandparents. This suggests some positive effect of the interbreeding of the two parental lines: it could be that the paternal-maternal combination at these two loci protects against some underlying disorder, or that it actively enhances the fitness of the resulting progeny.

netphenogeno

It is now generally accepted that complex genetic traits are influenced by multiple interacting loci, each with a possibly small effect. Thus, to overcome the limitations of traditional analysis, such as single-locus association analysis (looking for main effects of single marker loci), multiple testing, and QTL analysis, we have developed a method based on discrete graphical models to investigate the simultaneous associations between phenotypes and SNPs. Our method is different and allows for a more powerful interpretation of the findings than the traditional methods, which only analyze one SNP at a time.This is because we adjust for the effect of all other SNPs and phenotypes while measuring the pairwise associations between them. Statistically speaking this implies inferring conditional dependence relationships between variables in the data.

Networks or graphs are used to model interactions. In a genotype-phenotype network, nodes are either phenotypes or SNPs and edges are direct associations after adjustments. In our modeling framework, a genotype–phenotype network is a complex network made up of interactions among: (i) genetic markers, (ii) phenotypes (e.g. disease), and (iii) between genetic markers and phenotypes. It may happen that some phenotypes are associated with a SNP marker, or with multiple SNP markers. We remark that due to the conditional dependence feature often we reduce the number of possible SNPs from hundreds of SNPs to fewer SNPs.

It is of great interest to geneticists and biologists to discover such graph structure. The first problem with this is that such data consist of mixed ordinal-and-continuous variables, where the markers have ordinal values, and phenotypes (disease) can be measured in continuous or discrete scales. We deal with mixed variables by means of copula. A second issue relates to the high-dimensional setting of the data, where thousands of genetic markers are measured across a few samples; we are dealing with inferring potentially large networks with only few biological samples. Fortunately, biological networks are sparse, in the sense that only few elements interact with each other. This sparsity assumption is incorporated into our statistical methods based on penalized graphical models.

The proposed method is implemented in the netphenogeno() function. The function can be called with the following arguments:

R> netphenogeno(data, method = "gibbs", rho = NULL,
+Ψ   n.rho = NULL, rho.ratio = NULL, ncores = "all",
+    em.iter = 5, em.tol = .001, verbose = TRUE)

The netphenogeno returns an object of S3 class type “netgwas”. The functions plot, print and summary work with the object “netgwas”. The input data can be an () matrix or a data.frame where is the sample size and is the dimension that includes marker data and phenotype measurements. One may consider including more columns, like environmental variables.

The argument method determines the type of methods, “gibbs”, “approx”, or “npn”. Option “gibbs” is based on the Gibbs sampler within Gaussian copula graphical models. It is designed for small data (). Option “approx” is based on the Gaussian copula graphical model with the approximation approach, and “npn” is based on semi-parametric Gaussian copula, nonparanormal. The last two methods are faster compare with “gibbs”. In particular “npn” is designed for very large datasets. All the three methods are designed for exploring the conditional independence network for ordinal data, non-Gaussian data, and mixed discrete-and-continuous data.

In the argument rho a sequence of decreasing positive numbers can be provided to control the regularization. Typical usage is to leave the input rho = NULL and have the program compute its own rho sequence based on n.rho and rho.ratio. The program automatically sets up a sequence of n.rho regularization parameters and estimates the graph path. Option ncores determines the number of cores to use for the calculations. Using ncores = “all” automatically detects the number of available cores and runs the computations in parallel on the available cores minus one. The code is memory-optimized, using the sparse matrix data structure when estimating and storing full regularization paths for large data sets.

Figure 5: Genotype–phenotype conditional interaction networks in A.thaliana. Red nodes show phenotypes; white, yellow, gray, blue, and brown colors stand for chromosomes 1 to 5, respectively. Phenotypes measured in long days (TLN-LD, RLN-LD, DTF-LD) conditionally dependent on a region on top of chromosome 5 given the other locations in the genome. CLN-LD is correlated to a region in chromosome 1. Phenotypes measured in short days are linked mostly to chromosomes 1, 2, and 5.

Genotype–phenotype network in A.thaliana

We have applied our algorithm to a public Arabidopsis thaliana dataset, where the accession Kend-L (Kendalville-Lehle; Lehle-WT-16-03) is crossed with the common lab strain Col (Columbia) (Balasubramanian et al., 2009). The resulting lines were taken through six rounds of selfing without any intentional selection. The resulting 282 KendC (Kend-L Col) lines were genotyped at 181 markers. Flowering time was measured for 197 lines of this population both in long days, which promote rapid flowering in many A. thaliana strains, and in short days. Flowering time was measured using days to flowering (DTF) as well as the total number of leaves (TLN), partitioned into rosette and cauline leaves. In total, eight phenotypes were measured, namely days to flowering (DTF), cauline leaf number (CLN), rosette leaf number (RLN), and total leaf number (TLN) in long days (LD), and DTF, CLN, RLN, and TLN in short days (SD). Thus, the final dataset consists of 197 observations for 189 variables (8 phenotypes and 181 genotypes - SNP markers).

R> data(thaliana)
R> head(thaliana, n = 3)
     DTF_LD CLN_LD RLN_LD TLN_LD DTF_SD CLN_SD RLN_SD TLN_SD snp1
[1,]  17.58   3.42  12.17  15.58  56.92  12.42  50.92  63.33    2
[2,]  17.00   2.58  11.33  13.92  53.33   8.42  41.58  50.00    0
[3,]  27.50   8.08  26.92  35.00  69.17  15.17  66.92  82.08    2


     snp2 ... snp180 snp181
[1,]    2        0    2
[2,]    0        2    2
[3,]   NA        0    0

R> out <- netphenogeno(thaliana)
R> sel <- selectnet(out)
R> plot(sel, vis="interactive")

Fig 5 shows the genotype-phenotype network for this population. The network reveals those SNP markers that are directly correlated with the flowering phenotypes. For example in long days, the phenotype days to flowering (DTF-LD) is directly associated with markers , , , and in chromosome 5 which have assay IDs , , , and . Balasubramanian et al. (2009) have reported that the phenotypes DTF-LD is associated with markers from to with assay ID to . Our finding regarding DTF-LD phenotype is consistent with their finding; however, our result has a much stronger interpretation compared to original findings, because we control for all possible effects. Thus, did not show any association with DTF-LD after adjustments, but and on chromosome 5 show an association with DTF-LD, even after taking into account the effect of all other SNPs and phenotypes. We remark that our method reduces the number of candidate SNPs and gives a small set of much more plausible ones. Balasubramanian et al. (2009) have reported that are associated with TLN-LD phenotype, But our method reduces this set to the smaller set of after we control the effect of all other SNPs. Moreover, it avoids many false positives that can occur when using traditional QTL analysis. Furthermore, the association between phenotype CLN-LD and markers and has remained undetected by the use of traditional QTL analysis. Our method goes beyond the bivariate testing of individual SNPs that looks only marginal association. Instead we use a multivariate approach which includes all the SNPs and phenotypes. In this regards, Balasubramanian et al. (2009) have reported that the TLN-SD phenotype is associated with a region in chromosome 5, whereas our proposed method shows that there is no direct link between the TLN-SD phenotype and a region in chromosome 5; TLN-SD is connected to chromosome 5 through the DTF-SD phenotype.

SNP-disease network in mice

The Mus Musculus HDL data (hdl) were obtained from an F2 inner-cross between inbred MRL/MpJ and SM/J strains of mice (Leduc et al., 2012). The original data included gene expression traits for males and females. After filtering based on the location and significance of QTL, the data include phenotypes ( genes and HDL level) and their 5 SNP markers, corresponding to their QTL. The final dataset consists of observations of variables– SNP markers and phenotypes, normalized gene expression and HDL levels. Data are shown as follows:

R> data(hdl)
R> head(hdl, n=3)
     c1 c2 c4 c7 c12   HDL  Pla2g4a Nr1i3 Cyp2b10 Ppap2a Kdsr
1    2  1  2  1   3 -0.16   0.67  -0.57    0.97   0.66 0.38
2    3  3  3  2   2 -0.84   -0.75  1.14   -0.06   0.41 1.96
3    3  1  2  2   1 -1.17   -0.08 -0.52   -0.31  -1.01 0.37
     Degs1 Neu1 Spgl1 Apoa2
1    0.40 0.69  1.35  1.14
2   -0.90 0.74 -1.37  2.19
3    0.61 1.45 -0.03  0.79
Figure 6: SNP-Disease network in hdl data in mice

Three possible genotype states MM (homozygous) are denoted by 1, H (heterozygous) by 2, and SS (homozygous) by 3. The genotypes are ordinal variables while the phenotypes are continuous variables with 10 columns in data frame hdl.

Fig 6 reconstructs the genotype–phenotype network between the five SNP markers, the HDL level, and the gene expressions. Node c1 is a hub locus in the network, as it is directly connected to the HDL level and genes such as apolipoprotein A-II (Apoa2), degenerative spermatocyte homolog 1 (Degs1), and 3-ketodihydrosphingosine reductase (Kdsr). Furthermore, genes Apoa2, Degs1, and Neu1 are directly associated with the HDL level. while in the original study (Leduc et al., 2012) associations between HDL and Degs1, Neu1, Spgl1 genes remained undetected.

Genotype–phenotype network in mice

To better understand the genetic basis of essential hypertension, we reconstruct a conditional independence network between genotypes and phenotypes in mice. The data are from an intercross between BALB/cJ and CBA/CaJ mouse strains (Sugiyama et al., 2002). Only male offspring were considered. The data consist of 93 SNP markers across the genome, and four phenotypes: blood pressure (bp), heart rate (hr), body weight (bw), and heart weight (heart-weight), as measured for 163 individuals. Data are shown as follows:

R> data(bp)
R> head(bp, n = 3)

  bp   hr    bw  heart_wt D1MIT171 D1MIT46 D1MIT10  ...
1 104  517  37.0     133       0       0        0
2 108  690  38.9     135       0       1        1
3 115  653  43.8     159       0       2        2
    D19MIT11 D19MIT71
1       2        2
2       0        0
3       0        0

There are 3 possible genotype states: CC (homozygous) denoted by 0, CB (heterozygous) denoted by 1, and BB (homozygous) denoted by 2. In data frame bp, the genotypes are ordinal variables, whereas the phenotypes are continuous. The data also include some missing observations.

Fig 7 shows the conditional dependence network between the genetic markers across the mice genome and the phenotypes: blood pressure (bp), heart rate (hr), body weight (bw), and heart weight (heart-wt). The conditional independence network in Fig 7 explores genomic regions that regulate blood pressure, heart rate, and heart weight. We identified that two loci“D15MIT184” and “D15MIT175” on chromosome 15 and locus “D7MIT31” on chromosome 7 are associated with blood pressure even after correcting the effects of other SNPs and phenotypes. Our finding regarding blood pressure is consistent with Sugiyama et al. (2002). However, for heart rate phenotype they have reported the two loci “D2Mit304” and “D15Mit184” on chromosomes 2 and 15 were significantly associated with heart rate, but in our findings only two loci “D2MIT304” and “D2MIT274” on chromosome 2 are associated with heart rate phenotype after controlling for the effect of all other SNPs.

Figure 7: Conditional independence network between phenotypes blood pressure (bp), heart rate (hr), body weight (bw), and heart weight (heart-wt) and genetic map in mice.

3 Discussion

We have presented the netgwas package, which is designed based on graphical models, to accomplish three fundamental goals in genetics: linkage map construction, reconstruction of linkage disequilibrium networks, and exploration of high–dimensional genotype–phenotype (disease) networks. The novelty of the underlying methodology is the use of graphical model to accomplish these tasks in a unified way. Moreover, the netgwas package can deal with species of any ploidy level. Due to the fact that we control the effect of all other variables while measuring the pairwise associations, this allows us for a more powerful interpretation of our findings than classical approaches, which tests only the marginal associations.

The package implements the methods developed by Behrouzi and Wit (2017a) and Behrouzi and Wit (2017b) for linkage map construction and inferring of conditional independence networks for non-Gaussian data, discrete data, and mixed discrete-and-continuous data. We note that reproducibility of our results and all the example data that used to illustrate the package is supported by the open-source R package netgwas. We will in the future maintain and develop the package, and extend our package to calculate the genetic distances between markers in the linkage map construction, and to adjust for population structure in the linkage disequilibriums and genotype–phenotype networks.

4 Materials and Methods

In graphical models, each random variable is associated with a node on a graph. The conditional dependence relationships among the random variables are presented as a graph in which specifies a set of nodes and a set of existing links (Lauritzen, 1996). Our focus here is on undirected graphs, in which

. The absence of a link between two nodes specifies the pairwise conditional independence of those two associated random variables given the remaining variables, while a link between two variables indicates their conditional dependence. In Gaussian graphical models, the observed data follow a multivariate Gaussian distribution

. Here, conditional independence is implied by the zero structure of the precision matrix . Based on the pairwise Markov property, variables and are conditionally independent given the remaining variables, if and only if . This property implies that the links in graph correspond with the nonzero elements of the precision matrix , i.e. .

Sparse latent graphical model

A p-dimensional copula is a multivariate distribution with uniform margins on

. Any joint distribution function can be written in terms of its marginals and a copula which encodes the dependence structure. Here we consider a subclass of joint distributions encoded by the Gaussian copula,

where

is the cumulative distribution function (CDF) of p-variate Gaussian distribution with correlation matrix

; is the univariate standard normal CDF; and is the CDF of . Note that and are independent if and only if .

A Gaussian copula can be written in terms of latent variables : Let be the pseudo-inverse of the marginals and be the covariance matrix whose diagonal has normalized with as its correlation matrix. Then a Gaussian copula is defined as:

where and represent the non-Gaussian observed variables and Gaussian latent variables, respectively. We denote the associated latent data as , where .

In order to learn the graphical model, our objective is to estimate precision, the inverse of covariance matrix from independent observations , where . It is well known that the conditional independence between two variables, given other variables, is equivalent to the corresponding element in the precision matrix being zero, i.e. ; or put another way, a missing edge between two variables in a graph G represents conditional independence between the two variables given all other variables. In other words, conditional independence is quantified in terms of partial correlations.

In the classical low-dimensional setting, in which is smaller than , it is natural to implement a maximum likelihood approach to obtain the inverse of the sample covariance matrix. However, in modern applications like genetic networks, including linkage map construction, intra– and inter– chromosomal interactions and network–based QTL analysis, the dimension is routinely far larger than , meaning that the inverse sample covariance matrix does not exist. Motivated by the sparseness assumption of the graph, i.e., most are zero, we tackle the high-dimensional inference problem by using the penalized log-likelihood estimation procedure. We consider the penalized likelihood,

(1)

where we use a sparsity penalty function such as the norm penalty or smoothly clipped absolute deviation (SCAD) penalty on the precision matrix. The norm is defined as

which leads to a desirable optimization problem. Alternatively, we define the SCAD penalty in terms of its first order derivative, given by

for , where and are two tuning parameters. This penalty function produces sparse solution and approximately unbiased coefficient estimates for large coefficients. For the numerical studies we use as recommended by Fan and Li (2001).

Since includes discrete variables, those integrals in (1) are intractable, and instead we solve (1

) by a penalized expectation maximization (EM) algorithm.

(2)

where

(3)

and is the iteration number within the EM algorithm. We compute the conditional expectation inside (3

) using two different approaches: numerically through a Monte Carlo (MC) sampling method, and through a first order approximation. The most flexible and generally applicable approach for obtaining a sample in each iteration of an MCEM algorithm is through a Markov chain Monte Carlo (MCMC) routine like Gibbs and Metropolis–Hastings samplers (more details in

Behrouzi and Wit (2017b)). Alternatively, the conditional expectation in equation (3

) can be computed by using an efficient approximation approach which calculates elements of the empirical covariance matrix using the first and second moments of a truncated normal distribution with mean and variance as follows (see

Behrouzi and Wit (2017b) for more details):

The proposed method is practical when some observations are missing. If genotype information for genotype is missing, it is still possible to draw Gibbs samples for or approximate the empirical covariance matrix, as the corresponding conditional distribution is Gaussian.

The optimization problem in (2) can be solved efficiently in various ways by using glasso or CLIME approaches (Friedman et al., 2008, Hsieh et al., 2011). Convergence of the EM algorithm for penalized likelihood problems has been proved in Green (1990). Our experimental study shows that the algorithm usually converges after several iterations . Note that in both cases the penalty parameter needs to be selected appropriately in the last EM iteration to recover the precision matrix. Thus, in line with Behrouzi and Wit (2017b) we perform model selection using eBIC or AIC to choose a suitable regularization parameter in (1) to produce a sparse undirected graph with a sparsity pattern corresponding to . Alternatively, instead of using the EM algorithm, a nonparanormal skeptic approach can be used to estimate graph structure through Spearman’s rho and Kendall’s tau statistics; details can be found in Liu et al. (2012) and Behrouzi and Wit (2017a).

References

  • Balasubramanian et al. (2009) Balasubramanian, S., C. Schwartz, A. Singh, N. Warthmann, M. C. Kim, J. N. Maloof, O. Loudet, G. T. Trainer, T. Dabi, J. O. Borevitz, et al. (2009). Qtl mapping in new arabidopsis thaliana advanced intercross-recombinant inbred lines. PLoS One 4(2), e4318.
  • Bates and Maechler (2014) Bates, D. and M. Maechler (2014). Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.1-2.
  • Behrouzi and Wit (2017a) Behrouzi, P. and E. Wit (2017a). De novo construction of q-ploid linkage maps using discrete graphical models. arXiv preprint arXiv:1710.01063v2.
  • Behrouzi and Wit (2017b) Behrouzi, P. and E. Wit (2017b). Detecting epistatic selection with partially observed genotype data using copula graphical models. arXiv preprint arXiv:1710.00894.
  • Bikard et al. (2009) Bikard, D., D. Patel, C. Le Metté, V. Giorgi, C. Camilleri, M. J. Bennett, and O. Loudet (2009). Divergent evolution of duplicate genes leads to genetic incompatibilities within a. thaliana. Science 323(5914), 623–626.
  • Bourke et al. (2017) Bourke, P., G. van Geest, R. E. Voorrips, J. Jansen, T. Kranenburg, A. Shahin, R. G. Visser, P. Arens, M. J. Smulders, and C. Maliepaard (2017). polymapr: linkage analysis and genetic map construction from f1 populations of outcrossing polyploids. bioRxiv, 228817.
  • Bourke et al. (2016) Bourke, P. M., R. E. Voorrips, T. Kranenburg, J. Jansen, R. G. Visser, and C. Maliepaard (2016). Integrating haplotype-specific linkage maps in tetraploid species using snp markers. Theoretical and Applied Genetics 129(11), 2211–2226.
  • Broman et al. (2003) Broman, K. W., H. Wu, Ś. Sen, and G. A. Churchill (2003). R/qtl: Qtl mapping in experimental crosses. Bioinformatics 19(7), 889–890.
  • Buetow and Chakravarti (1987) Buetow, K. H. and A. Chakravarti (1987). Multipoint gene mapping using seriation. i. general methods. American journal of human genetics 41(2), 180.
  • Buzdugan et al. (2016) Buzdugan, L., M. Kalisch, A. Navarro, D. Schunk, E. Fehr, and P. Bühlmann (2016). Assessing statistical significance in multivariable genome wide association analysis. Bioinformatics 32(13), 1990–2000.
  • Csardi and Nepusz (2006) Csardi, G. and T. Nepusz (2006). The igraph software package for complex network research. InterJournal Complex Systems, 1695.
  • Cuthill and McKee (1969) Cuthill, E. and J. McKee (1969). Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference, pp. 157–172. ACM.
  • Doerge et al. (1996) Doerge, R. et al. (1996). Constructing genetic maps by rapid chain delineation. Journal of Agricultural Genomics 2(6).
  • Fan and Li (2001) Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.
  • Friedman et al. (2008) Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.
  • Grandke et al. (2017) Grandke, F., S. Ranganathan, N. van Bers, J. R. de Haan, and D. Metzler (2017). Pergola: fast and deterministic linkage mapping of polyploids. BMC Bioinformatics 18(1), 12.
  • Green (1990) Green, P. J. (1990). On use of the em for penalized likelihood estimation. Journal of the Royal Statistical Society. Series B (Methodological), 443–452.
  • Hackett and Luo (2003) Hackett, C. and Z. Luo (2003). Tetraploidmap: construction of a linkage map in autotetraploid species. Journal of Heredity 94(4), 358–359.
  • He and Lin (2010) He, Q. and D.-Y. Lin (2010). A variable selection method for genome-wide association studies. Bioinformatics 27(1), 1–8.
  • Hoggart et al. (2008) Hoggart, C. J., J. C. Whittaker, M. De Iorio, and D. J. Balding (2008). Simultaneous analysis of all snps in genome-wide and re-sequencing association studies. PLoS genetics 4(7), e1000130.
  • Hsieh et al. (2011) Hsieh, C.-J., I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in neural information processing systems, pp. 2330–2338.
  • Jansen et al. (2001) Jansen, J., A. De Jong, and J. Van Ooijen (2001). Constructing dense genetic linkage maps. Theoretical and Applied Genetics 102(6-7), 1113–1122.
  • Lander et al. (1987) Lander, E. S., P. Green, J. Abrahamson, A. Barlow, M. J. Daly, S. E. Lincoln, and L. Newburg (1987). Mapmaker: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics 1(2), 174–181.
  • Lauritzen (1996) Lauritzen, S. (1996). Graphical Models, Volume 17. Oxford University Press, USA.
  • Leduc et al. (2012) Leduc, M. S., R. H. Blair, R. A. Verdugo, S.-W. Tsaih, K. Walsh, G. A. Churchill, and B. Paigen (2012). Using bioinformatics and systems genetics to dissect hdl cholesterol levels in an mrl/mpj x sm/j intercross. Journal of lipid research, jlr–M025833.
  • Liu et al. (2014) Liu, D., C. Ma, W. Hong, L. Huang, M. Liu, H. Liu, H. Zeng, D. Deng, H. Xin, J. Song, et al. (2014). Construction and analysis of high-density linkage map using high-throughput sequencing data. Plos one 9(6), e98855.
  • Liu et al. (2012) Liu, H., F. Han, M. Yuan, J. Lafferty, L. Wasserman, et al. (2012). High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics 40(4), 2293–2326.
  • Margarido et al. (2007) Margarido, G., A. Souza, and A. Garcia (2007). Onemap: software for genetic mapping in outcrossing species. Hereditas 144(3), 78–79.
  • Massa et al. (2015) Massa, A. N., N. C. Manrique-Carpintero, J. J. Coombs, D. G. Zarka, A. E. Boone, W. W. Kirk, C. A. Hackett, G. J. Bryan, and D. S. Douches (2015). Genetic linkage mapping of economically important traits in cultivated tetraploid potato (solanum tuberosum l.). G3: Genes, Genomes, Genetics 5(11), 2357–2364.
  • Panagiotou et al. (2011) Panagiotou, O. A., J. P. Ioannidis, and G.-W. S. Project (2011). What should the genome-wide significance threshold be? empirical replication of borderline genetic associations. International journal of epidemiology 41(1), 273–286.
  • Preedy and Hackett (2016) Preedy, K. and C. Hackett (2016). A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theoretical and Applied Genetics 129(11), 2117–2132.
  • Rakitsch et al. (2012) Rakitsch, B., C. Lippert, O. Stegle, and K. Borgwardt (2012). A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29(2), 206–214.
  • Schiex and Gaspin (1997) Schiex, T. and C. Gaspin (1997). Cartagene: Constructing and joining maximum likelihood genetic maps. In Proceedings of the fifth international conference on Intelligent Systems for Molecular Biology.
  • Simon et al. (2008) Simon, M., O. Loudet, S. Durand, A. Bérard, D. Brunel, F. Sennesal, M. Durand-Tardif, G. Pelletier, and C. Camilleri (2008). Qtl mapping in five new large ril populations of arabidopsis thaliana genotyped with consensus snp markers. Genetics 178, 2253–2264.
  • Sugiyama et al. (2002) Sugiyama, F., G. A. Churchill, R. Li, L. J. Libby, T. Carver, K.-i. Yagami, S. W. John, and B. Paigen (2002). Qtl associated with blood pressure, heart rate, and heart weight in cba/caj and balb/cj mice. Physiological genomics 10(1), 5–12.
  • Tan and Fu (2006) Tan, Y.-D. and Y.-X. Fu (2006). A novel method for estimating linkage maps. Genetics 173(4), 2383–2390.
  • Van Os et al. (2005) Van Os, H., P. Stam, R. G. Visser, and H. J. Van Eck (2005). Record: a novel method for ordering loci on a genetic linkage map. Theoretical and Applied Genetics 112(1), 30–40.
  • Wang et al. (2016) Wang, H., F. A. van Eeuwijk, and J. Jansen (2016). The potential of probabilistic graphical models in linkage map construction. Theoretical and Applied Genetics, 1–12.
  • Welter et al. (2013) Welter, D., J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al. (2013). The nhgri gwas catalog, a curated resource of snp-trait associations. Nucleic acids research 42(D1), D1001–D1006.
  • Wu et al. (2002) Wu, R., C.-X. Ma, I. Painter, and Z.-B. Zeng (2002). Simultaneous maximum likelihood estimation of linkage and linkage phases in outcrossing species. Theoretical population biology 61(3), 349–363.
  • Wu et al. (2008) Wu, Y., P. R. Bhat, T. J. Close, and S. Lonardi (2008). Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS genetics 4(10), e1000212.
  • Zych et al. (2015) Zych, K., Y. Li, J. K. van der Velde, R. V. Joosen, W. Ligterink, R. C. Jansen, and D. Arends (2015). Pheno2geno-high-throughput generation of genetic markers and maps from molecular phenotypes for crosses between inbred strains. BMC bioinformatics 16(1), 51.