`gINLAnd`

PACKAGEGilles Guillot - Technical University of Denmark

The `gINLAnd`

package consists of a set of R functions to detect signature of selection in genomic data. The rationale of the underlying model is that loci displaying an outstanding statistical dependence with a certain environmental variable are likely to belong to a genomic region under selection.
The program essentially quantifies the magnitude of this statistical dependence and computes a statistical measure of how likely it is to observe this dependence by chance.
Computations are based on the Integrated Nested Laplace Approximation (INLA) method proposed by Rue et al. (2009) and the connection between Stochastic Partial Differential Equations (SPDEs) and Gaussian Markov Random Field (GMRF) introduced by Lindgren et al (2011).
The `gINLAnd`

package bundles functions that mainly wrap some R code calling the `INLA`

package.
The main output of the program is a list of loci ranked by decreasing evidence of selection.

These functions perform a combination of the four tasks below:

- Inference of spatial covariance structure
- Computation of Bayes factors (that can be used to rank loci by evidence of selection)
- Simulation of data
- Conversion between various data formats

The main tasks, namely inference and Bayes factors computation, can be carried out via the graphical user interface that does not require any knowledge about R.

- Install the
`INLA`

package following instructions from http://www.r-inla.org/download - Download the file http://www2.imm.dtu.dk/~gigu/gINLAnd/gINLAnd_0.0.0.tar.gz.
- Launch R
- In the R prompt type:

```
install.packages(pkgs='gINLAnd_0.0.0.tar.gz' , repos=NULL , type='source')
```

The data required are

- Allele counts for various sampling units with the following requirements:
- This should be a matrix with one row per geographical sampling site and one column per locus
- The markers should be bi-allelic co-dominant or dominant.
- The sampling units can be either individuals or groups of individuals observed at the same site
- Missing data are allowed

- Sample sizes
- This should be a matrix with one row per geographical sampling site and one column per locus
- This should be haploid sample sizes, so two times the number of individuals for diploid organisms and so on.

- Spatial coordinates of the sampling units.
- This should be a matrix with two columns and one row per sampling site. It can be Lon-Lat coordinates or UTM coordinates.

- Measurements of environmental variables at the same geographcial locations as genetic data.
- This should be a matrix with one row per sampling site and one column per environmental variable.

The main outputs are Bayes factor for each combination locus x environmental variable. Such a Bayes factor reflects evidence of selection at this locus.

- In an R shell, type:

```
library(gINLAnd)
library(INLA)
```

- In an R shell, type:

```
gINLAnd.GUI()
```

In the example below we use the data provided as example in the `data`

folder of the package.
These data can be loaded as follows:

```
data(coord,package='gINLAnd')
data(allele.counts,package='gINLAnd')
data(pop.size,package='gINLAnd')
data(env.var,package='gINLAnd')
```

In a first step, we need to estimate the covariance structure of the data.
This step makes use of the allele counts with the `gINLAnd.inference`

function.

```
res.infcov <- gINLAnd.inference(s=coord,sphere=FALSE,
z=allele.counts,codominant=TRUE,
pop.size=pop.size,
inference.cov=TRUE)
```

The result is a list with among others, two components named `tau`

and `kappa`

which can be accessed as:

```
res.infcov$tau
res.infcov$kappa
```

The task above will attempt to process the whole data matrix in a single INLA run. This may become prohibitive for large SNP datasets. However, in our experience, this task can be performed on a small subset of loci (500-1000 loci) without noticeable loss of accuracy. This can be done by e.g.

```
subs = sample(x=1:ncol(allele.counts),size=500,replace=FALSE)
res.infcov <- gINLAnd.inference(s=coord,sphere=FALSE,
z=allele.counts,codominant=TRUE,
pop.size=pop.size,
inference.cov=TRUE,
subset.loci.inf.cov = subs)
```

Now we use this knowledge about the spatial structure to compute Bayes factors.

In the example below, we compute Bayes factors that

```
res.bylocus <- gINLAnd.inference(s=coord,sphere=FALSE,
z=allele.counts,codominant=TRUE,
pop.size=pop.size,y=env.var[,1],
tau=res.infcov$tau,kappa=res.infcov$kappa,
mlik=TRUE,
models.mlik=list("z~1"=FALSE,"z~1+x"=TRUE,"z~1+y"=FALSE,"z~1+x+y"=TRUE),
inference.cov=FALSE)
```

The task above will process each locus sequentially and can therefore be splitted in several sub-tasks corresponding to several batches of loci.

The result returned is a list. One of its components is a matrix named `evidence`

.
Each row of this matrix contains the log-evidence or log-marginal likelihood of a model, namely
\[ \ln \int p_{m}(Data | \theta) p_m(\theta) d \theta \]
The full model is a model with a spatially structured random effect and a fixed effect of an environmental variable.
The evidence for this model can be compared to that of a model that assumes no effect of the environmental variable.
For two models \( m \) and \( m' \), we have the Bayes factor
\[ \int p_{m}(Data | \theta) p_m(\theta) d \theta \left/ \int p_{m'}(Data | \theta) p_{m'}(\theta) d \theta \right. \]

The log- Bayes factors for the various loci are also stored in the result above. We can plot them by e.g.

```
plot(res.bylocus$logBF)
```

The command below will return the index of loci by decresasing evidence of being correlated with the environmental variable considered.

```
order(res.bylocus$logBF,decreasing=TRUE)
```

TODO

TODO

Coop G., Witonsky D., Di Rienzo A., Pritchard J.K. Using Environmental Correlations to Identify Loci Underlying Local Adaptation. Genetics. 2010

G. Guillot, R. Vitalis, A. le Rouzic, M. Gautier. Detecting correlation between allele frequencies and environmental variables as a signature of selection. A fast computational approach for genome-wide studies, Spatial Statistics, 8 (2014) 145-155. dx.doi.org/10.1016/j.spasta.2013.08.001dx.doi.org/10.1016/j.spasta.2013.08.001.

S Joost, A Bonin, MW Bruford, L Després, C Conord, G Erhardt, P Taberlet. A spatial analysis method (SAM) to detect candidate loci for selection: towards a landscape genomics approach to adaptation. Molecular Ecology 16 (18), 3955-3969

F. Lindgren, H. Rue, E. Lindström. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach Journal of the Royal Statistical Society, Series B, 73 (4) (2011), pp. 423–498

F. Lindgren, H. Rue. Bayesian Spatial and Spatio-temporal Modelling with R-INLA, 2013. http://www.math.ntnu.no/inla/r-inla.org/papers/jss/lindgren.pdf

E.T. Krainski and F. Lindgren, The R-INLA tutorial: SPDE models, 2013. http://www.math.ntnu.no/inla/r-inla.org/tutorials/spde/spde-tutorial.pdf

H. Rue, S. Martino, N. Chopin Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations Journal of the Royal Statistical Society, Series B, 71 (2) (2009), pp. 1–35