R/fst_WC84.R
fst_WC84.Rd
The function calculates Weir and Cockerham (1984)
Fst for diploid genomes. Both overall and pairwise Fst can be estimated with
confidence intervals based on bootstrap of markers (resampling with replacement).
The function gives identical results at the 9th decimal when tested
against genet.dist
in hierfstat
. Using the
argument snprelate = TRUE
will compute the Fst with
SNPRelate. This implementation
gives slightly upward bias values but provided the fastest computations I know,
but it doesn't compute confidence intervals, for now.
For an R implementation, fst_WC84
is very fast.
The computations takes advantage of dplyr, tidyr, purrr,
parallel and SNPRelate.
The impact of unbalanced design on estimates can be tested by using the
subsample argument (see advance mode section).
Special concerns for genome-wide estimate and filtering bias
During computation, the function first starts by keeping only the polymorphic markers in common between the populations. Keep this in mind when filtering your markers to use this function characteristic strategically to get better genome-wide estimate. This is even more important when your project involves more than 2 populations that evolved more by neutral processes (e.g. genetic drift) than by natural selection (see the vignette for more details).
fst_WC84(
data,
snprelate = FALSE,
strata = NULL,
pop.levels = NULL,
pairwise = FALSE,
ci = FALSE,
iteration.ci = 100,
quantiles.ci = c(0.025, 0.975),
heatmap.fst = FALSE,
digits = 9,
filename = NULL,
parallel.core = parallel::detectCores() - 2,
verbose = FALSE,
...
)
A tidy data frame object in the global environment or
a tidy data frame in wide or long format in the working directory.
How to get a tidy data frame ?
Look into radiator tidy_genomic_data
.
You can also use this function to filter your dataset using
whitelist of markers, blacklist of individuals and genotypes.
(optional, logical) Use SNPRelate to compute the Fst. It's the fastest computation I've seen so far!
However, testing with different RADseq datasets as shown several upward bias
with SNPRelate::snpgdsFst
(last version tested was v.1.16.0).
I compared the results with assigner, hierfstat and strataG
(results available upon request).
The SNPRelate author as not given me good reason to belive the issue is fully
resolved, consequently, the option is no longer available, until further notice.
Default: snprelate = FALSE
(optional, data frame) A tab delimited file with 2 columns with header:
INDIVIDUALS
and STRATA
.
If a strata
file is specified, the strata file will have
precedence over any grouping found data file (data
).
The STRATA
column can be any hierarchical grouping.
Default: strata = NULL
.
(optional, string) This refers to the levels in a factor. In this
case, the id of the pop.
Use this argument to have the pop ordered your way instead of the default
alphabetical or numerical order. e.g. pop.levels = c("QUE", "ONT", "ALB")
instead of the default pop.levels = c("ALB", "ONT", "QUE")
.
Default: pop.levels = NULL
.
(optional, logical) With pairwise = TRUE
, the
pairwise WC84 Fst is calculated between populations.
Default: pairwise = FALSE
.
(optional, logical) Compute bootstrapped confidence intervals.
Default: ci = FALSE
.
(optional, integer) The number of iterations for
the boostraps (resampling with replacement of markers).
Default: iteration.ci = 100
.
(optional, double)
The quantiles for the bootstrapped confidence intervals.
Default: quantiles.ci = c(0.025,0.975)
.
(logical) Generate a heatmap with the Fst values in
lower matrix and CI in the upper matrix.
The heatmap can also be generated separately after the Fst
analysis using the separate function: heatmap_fst
.
Default: heatmap.fst = FALSE
.
(optional, integer) The number of decimal places to be used in
results.
Default: digits = 9
.
(optional, character) Give filename prefix, this will trigger
saving results in a directory.
Default: filename = NULL
.
(optional, integer) The number of core for parallel computation
of pairwise Fst. See also the advance mode section below.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) verbose = TRUE
to be chatty
during execution.
Default: verbose = FALSE
.
other parameters passed to the function.
The function returns a list with several objects.
When sumsample is selected the objects end with .subsample
.
$subsampling.individuals
: the combinations of individuals and subsamples,
$sigma.loc
: the variance components per locus, with
(lsiga
: among populations,
lsigb
: among individuals within populations,
lsigw
: within individuals)
$fst.markers
: the fst by markers,
$fst.ranked
: the fst ranked,
$fst.overall
: the mean fst overall markers and the number of markers
$fis.markers
: the fis by markers,
$fis.overall
: the mean fis overall markers and the number of markers,
$fst.plot
: the histogram of the overall Fst per markers,
$pairwise.fst
: the pairwise fst in long/tidy data frame and the number of markers ,
$pairwise.fst.upper.matrix
: the pairwise fst in a upper triangle matrix,
$pairwise.fst.full.matrix
: the pairwise fst matrix (duplicated upper and lower triangle),
$pairwise.fst.ci.matrix
: matrix with pairwise fst in the upper triangle
and the confidence intervals in the lower triangle.
when subsample is selected $pairwise.fst.subsample.mean
is a summary
of all pairwise comparisons subsample. The mean is calculated accross summary
statistics.
Negative Fst are technical artifact of the computation (see Roesti el al. 2012) and are automatically replaced with zero inside this function.
Why no p-values ?
There is no null hypothesis testing with P-values. Confidence intervals provided with the F-statistics enables more reliable conclusions about the biological trends in the data.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
filter.monomorphic
(logical, optional) By default monomorphic
markers present in the dataset are removed (and it should stay that way...).
Default: filter.monomorphic = TRUE
.
holdout.samples
(optional, data frame) Samples that don't participate in the Fst
computation (supplementary). Data frame with one column INDIVIDUALS
.
This argument is used inside assignment analysis.
Default: holdout.samples = NULL
.
subsample
(Integer or character)
With subsample = 36
, 36 individuals in each populations are chosen
randomly to represent the dataset. With subsample = "min"
, the
minimum number of individual/population found in the data is used automatically.
Default is no subsampling, subsample = NULL
.
iteration.subsample
(Integer) The number of iterations to repeat
subsampling.
With subsample = 20
and iteration.subsample = 10
,
20 individuals/populations will be randomly chosen 10 times.
Default: iteration.subsample = 1
.
calibrate.alleles
(logical)
Un-calibrated alleles can bias estimate and by default the function expect that
the REF/ALT alleles are calibrated. Using calibrate.alleles = TRUE
,
can take a bit more time.
Default: calibrate.alleles = FALSE
.
forking
(logical)
For macOS, forking is a much faster option, despite having the reputation of
being unreliable in GUI environment like RStudio.
The implementation in assigner is stable and up to 3-4 times faster.
It's easier on memory.
By default the function uses multisession
from the future package
to implement parallel processing. If you have macOS, test by toggling to
TRUE
this argument.
Default: forking = FALSE
.
Excoffier L, Smouse PE, Quattro JM. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics. 1992;131: 479-491.
Meirmans PG, Van Tienderen PH (2004) genotype and genodive: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4, 792-794.
Michalakis Y, Excoffier L. A generic estimation of population subdivision using distances between alleles with special reference for microsatellite loci. Genetics. 1996;142: 1061-1064.
Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358-1370.
Roesti M, Salzburger W, Berner D. (2012) Uninformative polymorphisms bias genome scans for signatures of selection. BMC Evol Biol., 12:94. doi:10.1111/j.1365-294X.2012.05509.x
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28: 3326-3328. doi:10.1093/bioinformatics/bts606
From GenoDive manual:
'In general, rather than to test differentiation between all pairs of
populations,
it is advisable to perform an overall test of population differentiation,
possibly using a hierarchical population structure, (see AMOVA)'.
To compute an AMOVA, use GenoDive
or Phi_st_Meirmans
in mmod
.
For Fisher's exact test and p-values per markers
see mmod
diff_test
.
Vignette for this function: how to do the pairwise and overall Fst with confidence intervals and build the phylogenetic tree
if (FALSE) { # \dontrun{
wombat.fst.pairwise <- fst_WC84(
data = "wombat.filtered.tidy.tsv",
pop.levels = c("ATL", "MLE", "BIS", "PMO", "SOL", "TAS", "ECU"),
pairwise = TRUE,
ci = TRUE,
iteration.ci = 10000,
quantiles.ci = c(0.025,0.975),
parallel.core = 8,
verbose = TRUE,
filename = "wombat",
heatmap.fst = TRUE
)
# To get the overall Fst estimate:
wombat.fst.pairwise$fst.overall
# To get the Fst plot:
wombat.fst.pairwise$fst.plot
#To get the pairwise Fst values with confidence intervals in a data frame:
df <- wombat.fst.pairwise$pairwise.fst
} # }