DOCUMENTATION FOR THE gINLAnd PACKAGE

Gilles Guillot - Technical University of Denmark

gilles.b.guillot@gmail.com

Overview

The gINLAnd package consists of a set of R functions to detect signature of selection in genomic data. The rationale of the underlying model is that loci displaying an outstanding statistical dependence with a certain environmental variable are likely to belong to a genomic region under selection. The program essentially quantifies the magnitude of this statistical dependence and computes a statistical measure of how likely it is to observe this dependence by chance. Computations are based on the Integrated Nested Laplace Approximation (INLA) method proposed by Rue et al. (2009) and the connection between Stochastic Partial Differential Equations (SPDEs) and Gaussian Markov Random Field (GMRF) introduced by Lindgren et al (2011). The gINLAnd package bundles functions that mainly wrap some R code calling the INLA package. The main output of the program is a list of loci ranked by decreasing evidence of selection.

These functions perform a combination of the four tasks below:

The main tasks, namely inference and Bayes factors computation, can be carried out via the graphical user interface that does not require any knowledge about R.

Installation

install.packages(pkgs='gINLAnd_0.0.0.tar.gz' , repos=NULL , type='source')

Input data

The data required are

Output

The main outputs are Bayes factor for each combination locus x environmental variable. Such a Bayes factor reflects evidence of selection at this locus.

Loading the package

library(gINLAnd)
library(INLA)

Computation using the Graphical User Interface

gINLAnd.GUI()

Computation using R scripts

In the example below we use the data provided as example in the data folder of the package. These data can be loaded as follows:

data(coord,package='gINLAnd')
data(allele.counts,package='gINLAnd')
data(pop.size,package='gINLAnd')
data(env.var,package='gINLAnd')

Inference of covariance structure

In a first step, we need to estimate the covariance structure of the data. This step makes use of the allele counts with the gINLAnd.inference function.

res.infcov <- gINLAnd.inference(s=coord,sphere=FALSE,
                                z=allele.counts,codominant=TRUE,
                                pop.size=pop.size,
                                inference.cov=TRUE)

The result is a list with among others, two components named tau and kappa which can be accessed as:

res.infcov$tau
res.infcov$kappa

The task above will attempt to process the whole data matrix in a single INLA run. This may become prohibitive for large SNP datasets. However, in our experience, this task can be performed on a small subset of loci (500-1000 loci) without noticeable loss of accuracy. This can be done by e.g.

subs = sample(x=1:ncol(allele.counts),size=500,replace=FALSE)
res.infcov <- gINLAnd.inference(s=coord,sphere=FALSE,
                                z=allele.counts,codominant=TRUE,
                                pop.size=pop.size,
                                inference.cov=TRUE,
                                subset.loci.inf.cov = subs)

Computation of Bayes factors

Now we use this knowledge about the spatial structure to compute Bayes factors.

In the example below, we compute Bayes factors that

res.bylocus <- gINLAnd.inference(s=coord,sphere=FALSE,
                                 z=allele.counts,codominant=TRUE,
                                 pop.size=pop.size,y=env.var[,1],
                                 tau=res.infcov$tau,kappa=res.infcov$kappa,
                                 mlik=TRUE,
                                 models.mlik=list("z~1"=FALSE,"z~1+x"=TRUE,"z~1+y"=FALSE,"z~1+x+y"=TRUE),
                                 inference.cov=FALSE)

The task above will process each locus sequentially and can therefore be splitted in several sub-tasks corresponding to several batches of loci.

The result returned is a list. One of its components is a matrix named evidence. Each row of this matrix contains the log-evidence or log-marginal likelihood of a model, namely \[ \ln \int p_{m}(Data | \theta) p_m(\theta) d \theta \] The full model is a model with a spatially structured random effect and a fixed effect of an environmental variable. The evidence for this model can be compared to that of a model that assumes no effect of the environmental variable. For two models \( m \) and \( m' \), we have the Bayes factor \[ \int p_{m}(Data | \theta) p_m(\theta) d \theta \left/ \int p_{m'}(Data | \theta) p_{m'}(\theta) d \theta \right. \]

The log- Bayes factors for the various loci are also stored in the result above. We can plot them by e.g.

plot(res.bylocus$logBF)

The command below will return the index of loci by decresasing evidence of being correlated with the environmental variable considered.

order(res.bylocus$logBF,decreasing=TRUE)

Simulating data

TODO

Conversion between various data formats

TODO

References