assigner
, diveRsity
, GenoDive
, hierfstat
, SNPRelate
, stAMPP
, strataG
, VCFtools.
vignettes/web_only/fst_comparisons.Rmd
fst_comparisons.Rmd
rm(list = ls()) # clean workspace # install required packages if (!require("diveRsity")) install.packages("diveRsity") if (!require("hierfstat")) install.packages("hierfstat") if (!require("StAMPP")) install.packages("StAMPP") if (!require("strataG")) install.packages("strataG") if (!require("SNPRelate")) { if(!require("BiocManager")) install.packages("BiocManager") BiocManager::install("SNPRelate") } if (!require("ggcorrplot")) install.packages("ggcorrplot")
if (!require("remotes")) install.packages("remotes")
## Loading required package: remotes
if (!require("assigner")) remotes::install_github("thierrygosselin/assigner")
## Loading required package: assigner
## Registered S3 method overwritten by 'spdep':
## method from
## plot.mst ape
For macOS: GenoDive.
VCFtools: install instructions or here
Versions of packages and software:
# Check packages versions packageVersion('assigner') # 0.5.6. packageVersion('diveRsity') # 1.9.90 packageVersion('hierfstat') # 0.4.22 packageVersion('stAMPP') # 1.5.1 packageVersion('strataG') # 2.0.2 packageVersion('SNPRelate') # 1.18.0 #GenoDive: 2.27 # VCFtools.: 0.1.17
Put this Rmd notebook in the same directory as the data or use setwd("/YOUR/NEW/PATH/HERE/")
, to change the working directory.
Let’s see how assigner's
Fst calculations (using Weir and Cockerham, 1984) compares with alternatives. To make differences and bottlenecks really stand out we will use a RADseq dataset with thousands of markers.
We’re going to use the unfiltered vcf sunflower dataset from Owens et al. 2016, kindly provided by Gregory Owens. This dataset as missing data.
Owens GL, Baute GJ, Rieseberg LH. Revisiting a classic case of introgression: hybridization and gene flow in Californian sunflowers. Abbott RJ, Barton NH, Good JM, editors. Molecular Ecology. 2016;25: 2630–2643. doi:10.1111/mec.13569
Download the files:
After running this vignette once, I saw something strange in the results. To explore this further, this second grind of testing add 2 additional datasets: data_assigner_sim_01 and data_assigner_sim_02. The documentation explains how these simulated data were produced. They differ mainly in migration rate which resulted in contrastingly different Fst, admixture/membership probability.
We use radiator::genomic_converter to produce the required files or R objects:
# radiator is installed automatically with assigner sunflower.filtered <- radiator::genomic_converter( data = "sunflower.vcf", strata = "sunflower.strata.tsv", output = c( "genepop", # for GenoDive and diveRsity "gtypes", # for strataG "genlight", # for StAMPP "hierfstat", # for Hierfstat "snprelate" # for SNPRelate ) ) %>% dplyr::glimpse()
VCF stats:
radiator::genomic_converter
will only keep polymorphic markers in common between the strata.The sunflower.filtered
object is a list with several objects inside:
names(sunflower.filtered) #> [1] "hierfstat" "gtypes" "genlight" "snprelate" "tidy.data" # the tidy dataset is used by assigner
The function also produced a folder with several files, including:
To learn more about the output folders and files: radiator::genomic_converter
Note on timing
I could use fancy timing packages (microbenchmark, bench), but for simplicity we’re just going to use system.time
.
data_assigner_sim_01 %<>% dplyr::mutate(POP_ID = factor(x = POP_ID, levels = c("POP_1", "POP_2", "POP_3", "POP_4", "POP_5"))) %>% dplyr::arrange(POP_ID) sim.data1 <- radiator::genomic_converter( data = data_assigner_sim_01, output = c( "genepop", # for GenoDive and diveRsity "gtypes", # for strataG "genlight", # for StAMPP "hierfstat", # for Hierfstat "snprelate" # for SNPRelate ), filename = "sim01", path.folder = "fst_sim01", internal = TRUE, verbose = FALSE ) sim.data1$tidy.data %<>% dplyr::arrange(POP_ID)
data_assigner_sim_02 %<>% dplyr::mutate(POP_ID = factor(x = POP_ID, levels = c("POP_1", "POP_2", "POP_3", "POP_4", "POP_5"))) %>% dplyr::arrange(POP_ID) sim.data2 <- radiator::genomic_converter( data = data_assigner_sim_02, output = c( "genepop", # for GenoDive and diveRsity "gtypes", # for strataG "genlight", # for StAMPP "hierfstat", # for Hierfstat "snprelate" # for SNPRelate ), filename = "sim02", path.folder = "fst_sim02", internal = TRUE, verbose = FALSE ) sim.data2$tidy.data %<>% dplyr::arrange(POP_ID)
Let’s start with the Gold Standard, Jérôme Goudet’s hierfstat package.
# ETA ~ 30 min on my MBP, so yes, you have time for coffee... pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID))# pop string system.time( hierfstat.fst <- hierfstat::pairwise.WCfst( dat = sunflower.filtered$hierfstat, diploid = TRUE ) %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT")) %>% dplyr::arrange(POP1, POP2) ) #> 1233.880sec ~20 min hierfstat.fst
pops <- as.character(unique(sim.data1$tidy.data$POP_ID))# pop string hierfstat.sim01.fst <- hierfstat::pairwise.WCfst( dat = sim.data1$hierfstat, diploid = TRUE ) %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT_SIM01")) %>% dplyr::arrange(POP1, POP2) pops <- as.character(unique(sim.data2$tidy.data$POP_ID))# pop string hierfstat.sim02.fst <- hierfstat::pairwise.WCfst( dat = sim.data2$hierfstat, diploid = TRUE ) %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT_SIM02")) %>% dplyr::arrange(POP1, POP2)
This is Eric Archer’s strataG package.
system.time( stratag.fst <- strataG::popStructTest( g = sunflower.filtered$gtypes, stats = "fst", type = "both", quietly = TRUE, max.cores = parallel::detectCores() - 1, nrep = 0, keep.null = FALSE, write.output = FALSE) %$% pairwise$result %>% dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG = Fst) ) stratag.fst #>657.377sec ~ 11 min
stratag.sim01.fst <- strataG::popStructTest( g = sim.data1$gtypes, stats = "fst", type = "both", quietly = TRUE, max.cores = parallel::detectCores() - 1, nrep = 0, keep.null = FALSE, write.output = FALSE) %$% pairwise$result %>% dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG_SIM01 = Fst) stratag.sim02.fst <- strataG::popStructTest( g = sim.data2$gtypes, stats = "fst", type = "both", quietly = TRUE, max.cores = parallel::detectCores() - 1, nrep = 0, keep.null = FALSE, write.output = FALSE) %$% pairwise$result %>% dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG_SIM02 = Fst)
genepop
file.pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID)) # pop string system.time( diversity.fst <- diveRsity::diffCalc( infile = list.files( path = ".", pattern = "_genepop.gen", full.names = TRUE, recursive = TRUE ), outfile = "sunflower_diversity_output", fst = TRUE, pairwise = TRUE, para = TRUE ) %$% pairwise %$% Fst %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY")) ) #> 42.711 sec Wow! this is very fast, plus other distances are automatically generated! diversity.fst
pops <- as.character(unique(sim.data1$tidy.data$POP_ID)) # pop string diversity.sim01.fst <- diveRsity::diffCalc( infile = list.files( path = ".", pattern = "sim01_genepop.gen", full.names = TRUE, recursive = TRUE ), outfile = "sim_data1_diversity_output", fst = TRUE, pairwise = TRUE, para = TRUE ) %$% pairwise %$% Fst %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY_SIM01")) pops <- as.character(unique(sim.data2$tidy.data$POP_ID)) # pop string diversity.sim02.fst <- diveRsity::diffCalc( infile = list.files( path = ".", pattern = "sim02_genepop.gen", full.names = TRUE, recursive = TRUE ), outfile = "sim_data2_diversity_output", fst = TRUE, pairwise = TRUE, para = TRUE ) %$% pairwise %$% Fst %>% # The result need some work # We transform the matrix into a useful tibble: magrittr::set_colnames(x = ., value = pops) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY_SIM02"))
We need the strata
strata <- radiator::read_strata(strata = "sunflower.strata.tsv") %$% strata
Showing off the global Fst only:
system.time( snprelate.fst <- SNPRelate::snpgdsFst( gdsobj = sunflower.filtered$snprelate, population = strata$STRATA, # factors required sample.id = strata$INDIVIDUALS, snp.id = NULL, method = "W&C84", remove.monosnp = TRUE, maf = NaN, missing.rate = NaN, autosome.only = FALSE, with.id = FALSE, verbose = TRUE ) ) #> 0.029sec wo! names(snprelate.fst) #> snprelate.fst$Fst: weighted Fst estimate #> snprelate.fst$MeanFst: the average of Fst estimates across SNPs #> snprelate.fst$FstSNP: a vector of Fst for each SNP
Build a function to conduct pairwise Fst with SNPRelate:
pairwise_fst_snprelate <- function(pop.pairwise, data, strata) { strata.temp <- dplyr::filter(.data = strata, STRATA %in% pop.pairwise) %>% dplyr::mutate(STRATA = droplevels(STRATA)) fst <- SNPRelate::snpgdsFst( gdsobj = data, population = strata.temp$STRATA, # factors required sample.id = strata.temp$INDIVIDUALS, snp.id = NULL, method = "W&C84", remove.monosnp = TRUE, maf = NaN, missing.rate = NaN, autosome.only = FALSE, with.id = FALSE, verbose = TRUE ) #prepare the results into a tibble: fst <- tibble::tibble( POP1 = pop.pairwise[1], POP2 = pop.pairwise[2], SNPRELATE_MEAN = fst$MeanFst, SNPRELATE_WEIGHTED = fst$Fst ) return(fst) }
To run we need all combination of populations:
pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE)
Run the pairwise implementation:
system.time( snprelate.fst <- purrr::map_dfr( .x = pop.pairwise, .f = pairwise_fst_snprelate, data = sunflower.filtered$snprelate, strata = strata ) ) #> 0.245 sec that's the fastest so far!
strata <- radiator::generate_strata(data = sim.data1$tidy.data) %>% dplyr::rename(STRATA = POP_ID) pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE) snprelate.sim01.fst <- purrr::map_dfr( .x = pop.pairwise, .f = pairwise_fst_snprelate, data = sim.data1$snprelate, strata = strata ) %>% dplyr::rename(SNPRELATE_MEAN_SIM01 = SNPRELATE_MEAN, SNPRELATE_WEIGHTED_SIM01 = SNPRELATE_WEIGHTED) strata <- radiator::generate_strata(data = sim.data2$tidy.data) %>% dplyr::rename(STRATA = POP_ID) pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE) snprelate.sim02.fst <- purrr::map_dfr( .x = pop.pairwise, .f = pairwise_fst_snprelate, data = sim.data2$snprelate, strata = strata ) %>% dplyr::rename(SNPRELATE_MEAN_SIM02 = SNPRELATE_MEAN, SNPRELATE_WEIGHTED_SIM02 = SNPRELATE_WEIGHTED)
genepop
fileAnalysis -> Pairwise Differentiation -> F_st
, choose 1 permutations
.GenoDive
is coded in Objective-C (~ < 3 sec on my MBP)# Import in R # Add the proper columns and rows names pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID)) # pop string genodive.fst <- readr::read_tsv( file = "genodive.fst.tsv", col_names = pops ) %>% as.matrix(.) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE")) genodive.fst
Obviously, here the timing gets doesn’t reflect the whole process:
pops <- as.character(unique(sim.data1$tidy.data$POP_ID)) # pop string genodive.sim01.fst <- readr::read_tsv( file = list.files( path = ".", pattern = "genodive.sim01.fst.tsv", full.names = TRUE, recursive = TRUE ), col_names = pops ) %>% as.matrix(.) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE_SIM01")) pops <- as.character(unique(sim.data2$tidy.data$POP_ID)) # pop string genodive.sim02.fst <- readr::read_tsv( file = list.files( path = ".", pattern = "genodive.sim02.fst.tsv", full.names = TRUE, recursive = TRUE ), col_names = pops ) %>% as.matrix(.) %>% magrittr::set_rownames(x = ., value = pops) %>% radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = TRUE, # FALSE because their's only the lower diag na.diag = TRUE, relative = FALSE ) %>% magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE_SIM02"))
Luke Pembleton’s StAMPP
package use genlight
object as input:
system.time( stampp.fst <- StAMPP::stamppFst( geno = sunflower.filtered$genlight, nboots = 1, percent = 95, nclusters = parallel::detectCores() - 1 ) %>% # The distance matrix need some love # if we want to compare it with the others: radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP")) %>% dplyr::arrange(POP1, POP2) ) #> 39.281 sec!
stampp.sim01.fst <- StAMPP::stamppFst( geno = sim.data1$genlight, nboots = 1, percent = 95, nclusters = parallel::detectCores() - 1 ) %>% # The distance matrix need some love # if we want to compare it with the others: radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP_SIM01")) %>% dplyr::arrange(POP1, POP2) stampp.sim02.fst <- StAMPP::stamppFst( geno = sim.data2$genlight, nboots = 1, percent = 95, nclusters = parallel::detectCores() - 1 ) %>% # The distance matrix need some love # if we want to compare it with the others: radiator::distance2tibble( x = ., remove.diag = TRUE, remove.lower = FALSE, na.diag = TRUE, relative = FALSE ) %>% # we switch the POP id column here to match the others magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP_SIM02")) %>% dplyr::arrange(POP1, POP2)
I’m not particularly fond of VCFtools for Fst. Too much manipulation of files. For the sake of the exercise I’m going to compare it quickly with 2 populations for the empirical dataset only.
pop1 <- dplyr::filter(strata, STRATA == "G100") %>% dplyr::select(INDIVIDUALS) %>% readr::write_tsv(x = ., path = "pop1.txt", col_names = FALSE) pop2 <- dplyr::filter(strata, STRATA == "G102") %>% dplyr::select(INDIVIDUALS) %>% readr::write_tsv(x = ., path = "pop2.txt", col_names = FALSE)
the command used:
vcftools --vcf sunflower.vcf --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --out pop1_vs_pop2
#> Fst between G100 and G102:
#> Weir and Cockerham mean Fst estimate: 0.036903
#> Weir and Cockerham weighted Fst estimate: 0.16222
PLINK users, note that the --fst
option in PLINK v.1.9 is actually a port of VCFtools..
For me, comparison with VCFtools stops here, too painfull to conduct more pairwise, getting the ouput ready for R, etc. This is prone to human error and too much work to automate for no benefit (below you’ll understand how those VCFtools results compares with the other packages).
assigner.fst <- assigner::fst_WC84( data = sunflower.filtered$tidy.data, pairwise = TRUE, parallel.core = parallel::detectCores() - 1, verbose = TRUE ) %$% pairwise.fst %>% dplyr::rename(ASSIGNER = FST) %>% dplyr::select(-N_MARKERS) #> 49 sec assigner.fst
Learn more about assigner::fst_WC84
with the dedicated vignette.
assigner.sim01.fst <- assigner::fst_WC84( data = sim.data1$tidy.data, pairwise = TRUE, parallel.core = parallel::detectCores() - 1, verbose = TRUE ) %$% pairwise.fst %>% dplyr::rename(ASSIGNER_SIM01 = FST) %>% dplyr::select(-N_MARKERS) assigner.sim02.fst <- assigner::fst_WC84( data = sim.data2$tidy.data, pairwise = TRUE, parallel.core = parallel::detectCores() - 1, verbose = TRUE ) %$% pairwise.fst %>% dplyr::rename(ASSIGNER_SIM02 = FST) %>% dplyr::select(-N_MARKERS)
fst.summary <- suppressWarnings( assigner.fst %>% dplyr::full_join(diversity.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(genodive.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(hierfstat.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stampp.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stratag.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(snprelate.fst, by = c("POP1", "POP2")) )
If you haven’t run the codes above:
fst.summary <- readr::read_tsv(file = "fst.summary.tsv", col_types = "ccnnnnnnnn") fst.summary
## # A tibble: 36 x 10
## POP1 POP2 ASSIGNER DIVERSITY GENODIVE HIERFSTAT STAMPP STRATAG
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 G100 G102 0.123 0.123 0.115 0.123 0.123 0.123
## 2 G100 G103 0.210 0.210 0.215 0.210 0.211 0.211
## 3 G100 G108 0.394 0.395 0.39 0.394 0.394 0.394
## 4 G100 G109 0.299 0.299 0.287 0.299 0.299 0.299
## 5 G100 G111 0.369 0.370 0.356 0.369 0.369 0.369
## 6 G100 G118 0.169 0.169 0.156 0.169 0.169 0.169
## 7 G100 G122 0.269 0.270 0.255 0.269 0.269 0.269
## 8 G100 G124 0.366 0.366 0.362 0.366 0.366 0.366
## 9 G102 G103 0.118 0.118 0.115 0.118 0.119 0.119
## 10 G102 G108 0.322 0.323 0.319 0.322 0.322 0.322
## # … with 26 more rows, and 2 more variables: SNPRELATE_MEAN <dbl>,
## # SNPRELATE_WEIGHTED <dbl>
If we highlight the first line, pops G100 and G102:
dplyr::filter(fst.summary, dplyr::row_number() == 1L)
## # A tibble: 1 x 10
## POP1 POP2 ASSIGNER DIVERSITY GENODIVE HIERFSTAT STAMPP STRATAG
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 G100 G102 0.123 0.123 0.115 0.123 0.123 0.123
## # … with 2 more variables: SNPRELATE_MEAN <dbl>, SNPRELATE_WEIGHTED <dbl>
Gold Standard: HIERFSTAT 0.12
There’s obviously differences when we look at the pairwise comparisons separately. The difference between the approaches taken in the packages/software probably depends on numerous things:
Let’s look at it globally with a correlation matrix:
## ASSIGNER DIVERSITY GENODIVE HIERFSTAT STAMPP STRATAG
## ASSIGNER 1.000000 0.999998 0.997723 1.000000 0.999996 0.999996
## DIVERSITY 0.999998 1.000000 0.997635 0.999998 0.999995 0.999995
## GENODIVE 0.997723 0.997635 1.000000 0.997723 0.997702 0.997702
## HIERFSTAT 1.000000 0.999998 0.997723 1.000000 0.999996 0.999996
## STAMPP 0.999996 0.999995 0.997702 0.999996 1.000000 1.000000
## STRATAG 0.999996 0.999995 0.997702 0.999996 1.000000 1.000000
## SNPRELATE_MEAN 0.973732 0.974023 0.965324 0.973732 0.973953 0.973953
## SNPRELATE_WEIGHTED 0.996752 0.996691 0.998345 0.996752 0.996729 0.996729
## SNPRELATE_MEAN SNPRELATE_WEIGHTED
## ASSIGNER 0.973732 0.996752
## DIVERSITY 0.974023 0.996691
## GENODIVE 0.965324 0.998345
## HIERFSTAT 0.973732 0.996752
## STAMPP 0.973953 0.996729
## STRATAG 0.973953 0.996729
## SNPRELATE_MEAN 1.000000 0.962916
## SNPRELATE_WEIGHTED 0.962916 1.000000
I prefer to look at it in a tibble…
fst.corr.tibble <- radiator::distance2tibble( x = fst.corr, na.diag = TRUE, relative = FALSE, pop.levels = c("HIERFSTAT", "STRATAG", "ASSIGNER", "DIVERSITY", "STAMPP", "GENODIVE", "SNPRELATE_MEAN", "SNPRELATE_WEIGHTED")) %>% magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION")) %>% dplyr::arrange(dplyr::desc(CORRELATION)) fst.corr.tibble
## # A tibble: 28 x 3
## ID1 ID2 CORRELATION
## <ord> <ord> <dbl>
## 1 ASSIGNER HIERFSTAT 1
## 2 STAMPP STRATAG 1
## 3 ASSIGNER DIVERSITY 1.00
## 4 DIVERSITY HIERFSTAT 1.00
## 5 ASSIGNER STAMPP 1.00
## 6 HIERFSTAT STAMPP 1.00
## 7 ASSIGNER STRATAG 1.00
## 8 HIERFSTAT STRATAG 1.00
## 9 DIVERSITY STAMPP 1.00
## 10 DIVERSITY STRATAG 1.00
## # … with 18 more rows
Very interesting results!
Let’s plot the results:
plot.fst <- fst.summary %>% tidyr::unite(data = ., col = PAIRS, POP1, POP2, sep = "-") %>% tidyr::gather(data = ., key = SOFTWARE, value = FST, -PAIRS) %>% dplyr::mutate( SOFTWARE = factor( x = SOFTWARE, levels = c("HIERFSTAT", "ASSIGNER", "STAMPP", "STRATAG", "DIVERSITY", "GENODIVE", "SNPRELATE_WEIGHTED", "SNPRELATE_MEAN"), ordered = TRUE)) pairs.levels <- dplyr::group_by(plot.fst, PAIRS) %>% dplyr::summarise(FST = mean(FST, na.rm = TRUE)) %>% dplyr::arrange(FST) %$% PAIRS
## `summarise()` ungrouping output (override with `.groups` argument)
under.the.sea.palette <- c("#1C2344", "#163D7D", "#118386", "#89D6D6", "#6E783D", "#FC564F", "#901546", "#FFD521")#, "#010101") plot.fst %<>% dplyr::mutate(PAIRS = factor(x = PAIRS, levels = pairs.levels, ordered = TRUE)) %>% ggplot2::ggplot(data = ., ggplot2::aes(x = FST, y = PAIRS, colour = SOFTWARE)) + ggplot2::geom_jitter(alpha = 0.8) + ggplot2::scale_colour_manual(values = under.the.sea.palette)
Generate a summary tibble:
fst.summary.sim01<- suppressWarnings( assigner.sim01.fst %>% dplyr::full_join(diversity.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(genodive.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(hierfstat.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stampp.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stratag.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(snprelate.sim01.fst, by = c("POP1", "POP2")) %>% dplyr::mutate_if(.tbl = ., .predicate = is.numeric, .funs = round, 4) )
fst.summary.sim01 <- readr::read_tsv(file = "fst.summary.sim01.tsv", col_types = "ccnnnnnnnn") fst.summary.sim01
## # A tibble: 10 x 10
## POP1 POP2 ASSIGNER_SIM01 DIVERSITY_SIM01 GENODIVE_SIM01 HIERFSTAT_SIM01
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 POP_1 POP_2 0.514 0.514 0.514 0.514
## 2 POP_1 POP_3 0.481 0.481 0.481 0.481
## 3 POP_1 POP_4 0.425 0.425 0.425 0.425
## 4 POP_1 POP_5 0.410 0.410 0.410 0.410
## 5 POP_2 POP_3 0.412 0.412 0.412 0.412
## 6 POP_2 POP_4 0.370 0.370 0.370 0.370
## 7 POP_2 POP_5 0.360 0.360 0.360 0.360
## 8 POP_3 POP_4 0.356 0.356 0.356 0.356
## 9 POP_3 POP_5 0.341 0.341 0.341 0.341
## 10 POP_4 POP_5 0.301 0.301 0.301 0.301
## # … with 4 more variables: STAMPP_SIM01 <dbl>, STRATAG_SIM01 <dbl>,
## # SNPRELATE_MEAN_SIM01 <dbl>, SNPRELATE_WEIGHTED_SIM01 <dbl>
Generate the correlation tibble:
fst.corr.tibble.sim01 <- round(cor(x = dplyr::select(fst.summary.sim01, -POP1, -POP2)), 6) %>% radiator::distance2tibble( x = ., na.diag = TRUE, relative = FALSE, pop.levels = c("HIERFSTAT_SIM01", "STRATAG_SIM01", "ASSIGNER_SIM01", "DIVERSITY_SIM01", "STAMPP_SIM01", "GENODIVE_SIM01", "SNPRELATE_MEAN_SIM01", "SNPRELATE_WEIGHTED_SIM01")) %>% magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION_SIM01")) %>% dplyr::arrange(dplyr::desc(CORRELATION_SIM01)) fst.corr.tibble.sim01
## # A tibble: 28 x 3
## ID1 ID2 CORRELATION_SIM01
## <ord> <ord> <dbl>
## 1 ASSIGNER_SIM01 DIVERSITY_SIM01 1
## 2 ASSIGNER_SIM01 GENODIVE_SIM01 1
## 3 DIVERSITY_SIM01 GENODIVE_SIM01 1
## 4 ASSIGNER_SIM01 HIERFSTAT_SIM01 1
## 5 DIVERSITY_SIM01 HIERFSTAT_SIM01 1
## 6 GENODIVE_SIM01 HIERFSTAT_SIM01 1
## 7 ASSIGNER_SIM01 STAMPP_SIM01 1
## 8 DIVERSITY_SIM01 STAMPP_SIM01 1
## 9 GENODIVE_SIM01 STAMPP_SIM01 1
## 10 HIERFSTAT_SIM01 STAMPP_SIM01 1
## # … with 18 more rows
Generate a summary tibble:
fst.summary.sim02<- suppressWarnings( assigner.sim02.fst %>% dplyr::full_join(diversity.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(genodive.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(hierfstat.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stampp.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(stratag.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::full_join(snprelate.sim02.fst, by = c("POP1", "POP2")) %>% dplyr::mutate_if(.tbl = ., .predicate = is.numeric, .funs = round, 4) )
fst.summary.sim02 <- readr::read_tsv(file = "fst.summary.sim02.tsv", col_types = "ccnnnnnnnn") fst.summary.sim02
## # A tibble: 10 x 10
## POP1 POP2 ASSIGNER_SIM02 DIVERSITY_SIM02 GENODIVE_SIM02 HIERFSTAT_SIM02
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 POP_1 POP_2 0.001 0.001 0.001 0.001
## 2 POP_1 POP_3 0.0032 0.0032 0.0032 0.0032
## 3 POP_1 POP_4 0.0021 0.0021 0.0021 0.0021
## 4 POP_1 POP_5 0.0022 0.0022 0.0022 0.0022
## 5 POP_2 POP_3 0.0001 0.0001 0.0001 0.0001
## 6 POP_2 POP_4 0.0002 0.0002 0.0002 0.0002
## 7 POP_2 POP_5 0.0005 0.0005 0.0005 0.0005
## 8 POP_3 POP_4 0 -0.0006 -0.0006 -0.0006
## 9 POP_3 POP_5 0.0028 0.0028 0.0028 0.0028
## 10 POP_4 POP_5 0.0017 0.0017 0.0017 0.0017
## # … with 4 more variables: STAMPP_SIM02 <dbl>, STRATAG_SIM02 <dbl>,
## # SNPRELATE_MEAN_SIM02 <dbl>, SNPRELATE_WEIGHTED_SIM02 <dbl>
Generate the correlation tibble:
fst.corr.tibble.sim02 <- round(cor(x = dplyr::select(fst.summary.sim02, -POP1, -POP2)), 6) %>% radiator::distance2tibble( x = ., na.diag = TRUE, relative = FALSE, pop.levels = c("HIERFSTAT_SIM02", "STRATAG_SIM02", "ASSIGNER_SIM02", "DIVERSITY_SIM02", "STAMPP_SIM02", "GENODIVE_SIM02", "SNPRELATE_MEAN_SIM02", "SNPRELATE_WEIGHTED_SIM02")) %>% magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION_SIM02")) %>% dplyr::arrange(dplyr::desc(CORRELATION_SIM02)) fst.corr.tibble.sim02
## # A tibble: 28 x 3
## ID1 ID2 CORRELATION_SIM02
## <ord> <ord> <dbl>
## 1 DIVERSITY_SIM02 GENODIVE_SIM02 1
## 2 DIVERSITY_SIM02 HIERFSTAT_SIM02 1
## 3 GENODIVE_SIM02 HIERFSTAT_SIM02 1
## 4 DIVERSITY_SIM02 STAMPP_SIM02 1
## 5 GENODIVE_SIM02 STAMPP_SIM02 1
## 6 HIERFSTAT_SIM02 STAMPP_SIM02 1
## 7 DIVERSITY_SIM02 STRATAG_SIM02 1
## 8 GENODIVE_SIM02 STRATAG_SIM02 1
## 9 HIERFSTAT_SIM02 STRATAG_SIM02 1
## 10 STAMPP_SIM02 STRATAG_SIM02 1
## # … with 18 more rows
The table timing was produced using the empirical dataset.
summary.table <- readr::read_tsv(file = "fst_comparisons_packages_features.tsv", col_types = readr::cols(.default = readr::col_character())) knitr::kable(summary.table) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
## Warning in kableExtra::kable_styling(., bootstrap_options = c("striped", :
## Please specify format in kable. kableExtra can customize either HTML or LaTeX
## outputs. See https://haozhu233.github.io/kableExtra/ for details.
FEATURES | HIERFSTAT | STRATAG | ASSIGNER | DIVERSITY | STAMPP | GENODIVE | SNPRELATE | VCFtools |
---|---|---|---|---|---|---|---|---|
Function | pairwise.WCfst | popStructTest | fst_WC84 | diffCalc | stamppFst | F_st | snpgdsFst | –weir-fst-pop |
W&C 1984 Fst | yes | yes | yes | yes | yes | yes | yes | yes |
Language | R | R, C | R | R, C++ | R | Objective-C | R, C++ | Perl, C++ |
Data | object | object | both | file | object | file | object | file |
Format | hierfstat | gtypes | several | genepop | genlight | several | SNPRelate GDS | VCF |
Polymorphic markers | yes | ? | yes | ? | yes | ? | yes | yes |
Tailor common markers | yes | ? | yes | ? | yes | ? | ? | yes |
Output number of markers | no | no | yes | no | no | no | yes | no |
Global fst | no | yes | yes | yes | no | no | yes | yes |
Pairwise fst | yes | yes | yes | yes | yes | yes | no* | no |
Bootstrap individuals | no | no | no | yes | no | no | no | no |
Boostrap loci | no | yes | yes | yes | yes | yes | no | no |
CI | no | no | yes | yes | yes | no | no | no |
P-Value | no | yes | no | no | yes | yes | no | no |
Locus pairwise | no | no | no | yes | no | no | no | no |
Figures | no | no | yes | no | no | no | no | no |
Tibble | no | yes | yes | no | no | no | no | no |
Full upper and lower matrix | yes | no | yes | no | no | yes | yes | no |
Upper matrix | no | no | yes | no | no | no | no | no |
Lower matrix | no | yes | yes | yes | yes | no | no | no |
Combo matrix (Fst + stats) | no | yes | yes | no | no | no | no | no |
Fst markers | yes | no | yes | yes | no | no | yes | yes |
Fis | yes | no | yes | yes | no | no | no | no |
Write | no | no | yes | yes | no | no | no | yes |
Parallelize | no | yes | yes | yes | yes* | yes | yes | yes |
OS (linux, macOS, windows) | all | all | all | all | all | macOS | all | all |
Impacted by missingness | no | yes | no | yes | yes | yes | yes | yes |
Timing (sec) | 1234 | 657 | 49 | 43 | 39 | 3 | 0.245 | 2* |
Note:
You could go and look at the 36 pairwise comparisons, but the end of the story is very simple:
MeanFst
.SNPRelate
MeanFst as a systematic downward bias and the weighted estimate is usually a little higher.assigner
and hierfstat
have equal values down to a very large number of decimals!assigner
and hierfstat
show consistent behaviors in the presence of missing data.assigner
is different than the others: it transform -
Fst to 0.In the absence of missing data, because overall Fst values are similar, in the end, it’s worth looking at:
Please let me know if I did something wrong or forgot to mention an interesting feature of your favorite package.