Prepare your R session

rm(list = ls()) # clean workspace
# install required packages
if (!require("diveRsity")) install.packages("diveRsity")
if (!require("hierfstat")) install.packages("hierfstat")
if (!require("StAMPP")) install.packages("StAMPP")
if (!require("strataG")) install.packages("strataG")
if (!require("SNPRelate")) {
  if(!require("BiocManager")) install.packages("BiocManager")
  BiocManager::install("SNPRelate")
}
if (!require("ggcorrplot")) install.packages("ggcorrplot")
## Loading required package: remotes
if (!require("assigner")) remotes::install_github("thierrygosselin/assigner")
## Loading required package: assigner
## Registered S3 method overwritten by 'spdep':
##   method   from
##   plot.mst ape

For macOS: GenoDive.

VCFtools: install instructions or here

Versions of packages and software:

# Check packages versions
packageVersion('assigner') # 0.5.6.
packageVersion('diveRsity') # 1.9.90
packageVersion('hierfstat') # 0.4.22
packageVersion('stAMPP') # 1.5.1
packageVersion('strataG') # 2.0.2
packageVersion('SNPRelate') # 1.18.0
#GenoDive: 2.27
# VCFtools.: 0.1.17

Put this Rmd notebook in the same directory as the data or use setwd("/YOUR/NEW/PATH/HERE/"), to change the working directory.

Goal of this exercise

Let’s see how assigner's Fst calculations (using Weir and Cockerham, 1984) compares with alternatives. To make differences and bottlenecks really stand out we will use a RADseq dataset with thousands of markers.

Datasets

Empirical dataset

We’re going to use the unfiltered vcf sunflower dataset from Owens et al. 2016, kindly provided by Gregory Owens. This dataset as missing data.

Owens GL, Baute GJ, Rieseberg LH. Revisiting a classic case of introgression: hybridization and gene flow in Californian sunflowers. Abbott RJ, Barton NH, Good JM, editors. Molecular Ecology. 2016;25: 2630–2643. doi:10.1111/mec.13569

Download the files:

Simulated datasets

After running this vignette once, I saw something strange in the results. To explore this further, this second grind of testing add 2 additional datasets: data_assigner_sim_01 and data_assigner_sim_02. The documentation explains how these simulated data were produced. They differ mainly in migration rate which resulted in contrastingly different Fst, admixture/membership probability.

Generate the output files

We use radiator::genomic_converter to produce the required files or R objects:

Empirical dataset

# radiator is installed automatically with assigner
sunflower.filtered <- radiator::genomic_converter(
  data = "sunflower.vcf", 
  strata = "sunflower.strata.tsv", 
  output = c(
    "genepop", # for GenoDive and diveRsity
    "gtypes", # for strataG
    "genlight", # for StAMPP
    "hierfstat", # for Hierfstat
    "snprelate" # for SNPRelate
  )
) %>% dplyr::glimpse()

VCF stats:

  • the original number of markers is 34258.
  • by default, radiator::genomic_converter will only keep polymorphic markers in common between the strata.
  • 15973 biallelic markers are recovered.
  • Number of populations: 9
  • Number of individuals: 88

The sunflower.filtered object is a list with several objects inside:

names(sunflower.filtered)
#> [1] "hierfstat" "gtypes"    "genlight"  "snprelate" "tidy.data"
# the tidy dataset is used by assigner

The function also produced a folder with several files, including:

  • the genepop file
  • the SNPRelate GDS file
  • the hierfstat file
  • the gtypes file
  • the genlight file
  • the tidy data file

To learn more about the output folders and files: radiator::genomic_converter

Note on timing

I could use fancy timing packages (microbenchmark, bench), but for simplicity we’re just going to use system.time.

Simulated datasets

sim_01

data_assigner_sim_01 %<>% 
  dplyr::mutate(POP_ID = factor(x = POP_ID, levels = c("POP_1", "POP_2", "POP_3", "POP_4", "POP_5"))) %>% 
  dplyr::arrange(POP_ID)

sim.data1 <- radiator::genomic_converter(
  data = data_assigner_sim_01, 
  output = c(
    "genepop", # for GenoDive and diveRsity
    "gtypes", # for strataG
    "genlight", # for StAMPP
    "hierfstat", # for Hierfstat
    "snprelate" # for SNPRelate
  ),
  filename = "sim01",
  path.folder = "fst_sim01",
  internal = TRUE,
  verbose = FALSE
)
sim.data1$tidy.data %<>% dplyr::arrange(POP_ID)

sim_02

data_assigner_sim_02 %<>% 
  dplyr::mutate(POP_ID = factor(x = POP_ID, levels = c("POP_1", "POP_2", "POP_3", "POP_4", "POP_5"))) %>% 
  dplyr::arrange(POP_ID)

sim.data2 <- radiator::genomic_converter(
  data = data_assigner_sim_02, 
  output = c(
    "genepop", # for GenoDive and diveRsity
    "gtypes", # for strataG
    "genlight", # for StAMPP
    "hierfstat", # for Hierfstat
    "snprelate" # for SNPRelate
  ),
  filename = "sim02",
  path.folder = "fst_sim02",
  internal = TRUE,
  verbose = FALSE
)
sim.data2$tidy.data %<>% dplyr::arrange(POP_ID)

hierfstat

Let’s start with the Gold Standard, Jérôme Goudet’s hierfstat package.

Empirical dataset

# ETA ~ 30 min on my MBP, so yes, you have time for coffee...
pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID))# pop string

system.time(
  hierfstat.fst <- hierfstat::pairwise.WCfst(
    dat = sunflower.filtered$hierfstat, 
    diploid = TRUE
  ) %>% 
    # The result need some work
    # We transform the matrix into a useful tibble:
    magrittr::set_colnames(x = ., value = pops) %>% 
    magrittr::set_rownames(x = ., value = pops) %>% 
    radiator::distance2tibble(
      x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE
    ) %>% 
    magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT")) %>% 
    dplyr::arrange(POP1, POP2)
)
#> 1233.880sec ~20 min
hierfstat.fst

Simulated datasets

pops <- as.character(unique(sim.data1$tidy.data$POP_ID))# pop string
hierfstat.sim01.fst <- hierfstat::pairwise.WCfst(
    dat = sim.data1$hierfstat, 
    diploid = TRUE
  ) %>% 
    # The result need some work
    # We transform the matrix into a useful tibble:
    magrittr::set_colnames(x = ., value = pops) %>% 
    magrittr::set_rownames(x = ., value = pops) %>% 
    radiator::distance2tibble(
      x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE
    ) %>% 
    magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT_SIM01")) %>% 
    dplyr::arrange(POP1, POP2)
pops <- as.character(unique(sim.data2$tidy.data$POP_ID))# pop string
  hierfstat.sim02.fst <- hierfstat::pairwise.WCfst(
    dat = sim.data2$hierfstat, 
    diploid = TRUE
  ) %>% 
    # The result need some work
    # We transform the matrix into a useful tibble:
    magrittr::set_colnames(x = ., value = pops) %>% 
    magrittr::set_rownames(x = ., value = pops) %>% 
    radiator::distance2tibble(
      x = ., remove.diag = TRUE, remove.lower = TRUE, na.diag = TRUE, relative = FALSE
    ) %>% 
    magrittr::set_colnames(x = ., value = c("POP1", "POP2", "HIERFSTAT_SIM02")) %>% 
    dplyr::arrange(POP1, POP2)

strataG

This is Eric Archer’s strataG package.

Empirical dataset

system.time(
  stratag.fst <- strataG::popStructTest(
    g = sunflower.filtered$gtypes, 
    stats = "fst", 
    type = "both", 
    quietly = TRUE, 
    max.cores = parallel::detectCores() - 1, 
    nrep = 0, 
    keep.null = FALSE, 
    write.output = FALSE) %$%
    pairwise$result %>% 
    dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG = Fst)
)
stratag.fst
#>657.377sec ~ 11 min

Simulated datasets

stratag.sim01.fst <- strataG::popStructTest(
    g = sim.data1$gtypes, 
    stats = "fst", 
    type = "both", 
    quietly = TRUE, 
    max.cores = parallel::detectCores() - 1, 
    nrep = 0, 
    keep.null = FALSE, 
    write.output = FALSE) %$%
    pairwise$result %>% 
    dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG_SIM01 = Fst)
stratag.sim02.fst <- strataG::popStructTest(
    g = sim.data2$gtypes, 
    stats = "fst", 
    type = "both", 
    quietly = TRUE, 
    max.cores = parallel::detectCores() - 1, 
    nrep = 0, 
    keep.null = FALSE, 
    write.output = FALSE) %$%
    pairwise$result %>% 
    dplyr::select(POP1 = strata.1, POP2 = strata.2, STRATAG_SIM02 = Fst)

diveRsity

  • Kevin Keenan’s diveRsity package requires the genepop file.
  • the function is optimized with C++, so no need for a coffee (< 1 min on my MBP)

Empirical dataset

pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID)) # pop string
system.time(
  diversity.fst <- diveRsity::diffCalc(
    infile = list.files(
      path = ".", 
      pattern = "_genepop.gen", 
      full.names = TRUE,
      recursive = TRUE
    ), 
    outfile = "sunflower_diversity_output", 
    fst = TRUE,
    pairwise = TRUE,
    para = TRUE
  ) %$%
    pairwise %$%
    Fst %>% 
    # The result need some work
    # We transform the matrix into a useful tibble:
    magrittr::set_colnames(x = ., value = pops) %>% 
    magrittr::set_rownames(x = ., value = pops) %>% 
    radiator::distance2tibble(
      x = ., 
      remove.diag = TRUE, 
      remove.lower = FALSE, # FALSE because their's only the lower diag
      na.diag = TRUE, 
      relative = FALSE
    ) %>% 
    # we switch the POP id column here to match the others
    magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY"))
)
#> 42.711 sec Wow! this is very fast, plus other distances are automatically generated!
diversity.fst

Simulated datasets

pops <- as.character(unique(sim.data1$tidy.data$POP_ID)) # pop string
diversity.sim01.fst <- diveRsity::diffCalc(
  infile = list.files(
    path = ".", 
    pattern = "sim01_genepop.gen", 
    full.names = TRUE,
    recursive = TRUE
  ), 
  outfile = "sim_data1_diversity_output", 
  fst = TRUE,
  pairwise = TRUE,
  para = TRUE
) %$%
  pairwise %$%
  Fst %>% 
  # The result need some work
  # We transform the matrix into a useful tibble:
  magrittr::set_colnames(x = ., value = pops) %>% 
  magrittr::set_rownames(x = ., value = pops) %>% 
  radiator::distance2tibble(
    x = ., 
    remove.diag = TRUE, 
    remove.lower = FALSE, # FALSE because their's only the lower diag
    na.diag = TRUE, 
    relative = FALSE
  ) %>% 
  # we switch the POP id column here to match the others
  magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY_SIM01"))
pops <- as.character(unique(sim.data2$tidy.data$POP_ID)) # pop string
diversity.sim02.fst <- diveRsity::diffCalc(
  infile = list.files(
    path = ".", 
    pattern = "sim02_genepop.gen", 
    full.names = TRUE,
    recursive = TRUE
  ),  
  outfile = "sim_data2_diversity_output", 
  fst = TRUE,
  pairwise = TRUE,
  para = TRUE
) %$%
  pairwise %$%
  Fst %>% 
  # The result need some work
  # We transform the matrix into a useful tibble:
  magrittr::set_colnames(x = ., value = pops) %>% 
  magrittr::set_rownames(x = ., value = pops) %>% 
  radiator::distance2tibble(
    x = ., 
    remove.diag = TRUE, 
    remove.lower = FALSE, # FALSE because their's only the lower diag
    na.diag = TRUE, 
    relative = FALSE
  ) %>% 
  # we switch the POP id column here to match the others
  magrittr::set_colnames(x = ., value = c("POP2", "POP1", "DIVERSITY_SIM02"))

SNPRelate

  • Xiuwen Zheng’s SNPRelate package is very fast.
  • No out-of-the-box pairwise Fst analysis. So I implemented my own below.

Empirical dataset

We need the strata

strata <- radiator::read_strata(strata = "sunflower.strata.tsv") %$% strata

Showing off the global Fst only:

system.time(
  snprelate.fst <- SNPRelate::snpgdsFst(
    gdsobj = sunflower.filtered$snprelate,
    population = strata$STRATA, # factors required
    sample.id = strata$INDIVIDUALS,
    snp.id = NULL,
    method = "W&C84",
    remove.monosnp = TRUE,
    maf = NaN,
    missing.rate = NaN,
    autosome.only = FALSE,
    with.id = FALSE,
    verbose = TRUE
  )
)
#> 0.029sec wo!

names(snprelate.fst)
#> snprelate.fst$Fst: weighted Fst estimate
#> snprelate.fst$MeanFst: the average of Fst estimates across SNPs
#> snprelate.fst$FstSNP: a vector of Fst for each SNP

Build a function to conduct pairwise Fst with SNPRelate:

pairwise_fst_snprelate <- function(pop.pairwise, data, strata) {
  strata.temp <- dplyr::filter(.data = strata, STRATA %in% pop.pairwise) %>%
    dplyr::mutate(STRATA = droplevels(STRATA))
  
  fst <- SNPRelate::snpgdsFst(
    gdsobj = data,
    population = strata.temp$STRATA, # factors required
    sample.id = strata.temp$INDIVIDUALS,
    snp.id = NULL,
    method = "W&C84",
    remove.monosnp = TRUE,
    maf = NaN,
    missing.rate = NaN,
    autosome.only = FALSE,
    with.id = FALSE,
    verbose = TRUE
  )
  #prepare the results into a tibble:
  fst <- tibble::tibble(
    POP1 = pop.pairwise[1], 
    POP2 = pop.pairwise[2], 
    SNPRELATE_MEAN = fst$MeanFst,
    SNPRELATE_WEIGHTED = fst$Fst
  )
  return(fst)
}

To run we need all combination of populations:

pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE)

Run the pairwise implementation:

system.time(
  snprelate.fst <- purrr::map_dfr(
    .x = pop.pairwise,
    .f = pairwise_fst_snprelate,
    data = sunflower.filtered$snprelate,
    strata = strata
  )
)
#> 0.245 sec that's the fastest so far!

Simulated datasets

strata <- radiator::generate_strata(data = sim.data1$tidy.data) %>% dplyr::rename(STRATA = POP_ID)
pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE)
snprelate.sim01.fst <- purrr::map_dfr(
  .x = pop.pairwise,
  .f = pairwise_fst_snprelate,
  data = sim.data1$snprelate,
  strata = strata
) %>% 
  dplyr::rename(SNPRELATE_MEAN_SIM01 = SNPRELATE_MEAN, SNPRELATE_WEIGHTED_SIM01 = SNPRELATE_WEIGHTED)

strata <- radiator::generate_strata(data = sim.data2$tidy.data) %>% dplyr::rename(STRATA = POP_ID)
pop.pairwise <- utils::combn(as.character(unique(strata$STRATA)), 2, simplify = FALSE)
snprelate.sim02.fst <- purrr::map_dfr(
  .x = pop.pairwise,
  .f = pairwise_fst_snprelate,
  data = sim.data2$snprelate,
  strata = strata
) %>% 
  dplyr::rename(SNPRELATE_MEAN_SIM02 = SNPRELATE_MEAN, SNPRELATE_WEIGHTED_SIM02 = SNPRELATE_WEIGHTED)

GenoDive

  • Inside Patrick Meirmans’s GenoDive software, import or drag the genepop file
  • Run: Analysis -> Pairwise Differentiation -> F_st, choose 1 permutations.
  • Save the results or copy/paste in MS Excel or text editor: genodive.fst.tsv.
  • no need for a coffee, GenoDive is coded in Objective-C (~ < 3 sec on my MBP)

Empirical dataset

# Import in R
# Add the proper columns and rows names
pops <- as.character(unique(sunflower.filtered$tidy.data$POP_ID)) # pop string
genodive.fst <- readr::read_tsv(
  file = "genodive.fst.tsv",
  col_names = pops
) %>% 
  as.matrix(.) %>% 
  magrittr::set_rownames(x = ., value = pops) %>% 
  radiator::distance2tibble(
    x = ., 
    remove.diag = TRUE, 
    remove.lower = TRUE, # FALSE because their's only the lower diag
    na.diag = TRUE, 
    relative = FALSE
  ) %>% 
  magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE"))
genodive.fst

Obviously, here the timing gets doesn’t reflect the whole process:

  • the data as to be imported in GenoDive
  • several manipulations of the results are required to get from GenoDive to R

Simulated datasets

pops <- as.character(unique(sim.data1$tidy.data$POP_ID)) # pop string
genodive.sim01.fst <- readr::read_tsv(
  file = list.files(
      path = ".", 
      pattern = "genodive.sim01.fst.tsv", 
      full.names = TRUE,
      recursive = TRUE
    ),
  col_names = pops
) %>% 
  as.matrix(.) %>% 
  magrittr::set_rownames(x = ., value = pops) %>% 
  radiator::distance2tibble(
    x = ., 
    remove.diag = TRUE, 
    remove.lower = TRUE, # FALSE because their's only the lower diag
    na.diag = TRUE, 
    relative = FALSE
  ) %>% 
  magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE_SIM01"))
pops <- as.character(unique(sim.data2$tidy.data$POP_ID)) # pop string
genodive.sim02.fst <- readr::read_tsv(
  file = list.files(
      path = ".", 
      pattern = "genodive.sim02.fst.tsv", 
      full.names = TRUE,
      recursive = TRUE
    ),
  col_names = pops
) %>% 
  as.matrix(.) %>% 
  magrittr::set_rownames(x = ., value = pops) %>% 
  radiator::distance2tibble(
    x = ., 
    remove.diag = TRUE, 
    remove.lower = TRUE, # FALSE because their's only the lower diag
    na.diag = TRUE, 
    relative = FALSE
  ) %>% 
  magrittr::set_colnames(x = ., value = c("POP1", "POP2", "GENODIVE_SIM02"))

StAMPP

Luke Pembleton’s StAMPP package use genlight object as input:

Empirical dataset

system.time(
  stampp.fst <- StAMPP::stamppFst(
    geno = sunflower.filtered$genlight, 
    nboots = 1, 
    percent = 95, 
    nclusters = parallel::detectCores() - 1
  ) %>% 
    # The distance matrix need some love
    # if we want to compare it with the others:
    radiator::distance2tibble(
      x = ., 
      remove.diag = TRUE, 
      remove.lower = FALSE, 
      na.diag = TRUE, 
      relative = FALSE
    ) %>% 
    # we switch the POP id column here to match the others
    magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP")) %>% 
    dplyr::arrange(POP1, POP2)
)
#> 39.281 sec!

Simulated datasets

stampp.sim01.fst <- StAMPP::stamppFst(
    geno = sim.data1$genlight, 
    nboots = 1, 
    percent = 95, 
    nclusters = parallel::detectCores() - 1
  ) %>% 
    # The distance matrix need some love
    # if we want to compare it with the others:
    radiator::distance2tibble(
      x = ., 
      remove.diag = TRUE, 
      remove.lower = FALSE, 
      na.diag = TRUE, 
      relative = FALSE
    ) %>% 
    # we switch the POP id column here to match the others
    magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP_SIM01")) %>% 
    dplyr::arrange(POP1, POP2)

stampp.sim02.fst <- StAMPP::stamppFst(
    geno = sim.data2$genlight, 
    nboots = 1, 
    percent = 95, 
    nclusters = parallel::detectCores() - 1
  ) %>% 
    # The distance matrix need some love
    # if we want to compare it with the others:
    radiator::distance2tibble(
      x = ., 
      remove.diag = TRUE, 
      remove.lower = FALSE, 
      na.diag = TRUE, 
      relative = FALSE
    ) %>% 
    # we switch the POP id column here to match the others
    magrittr::set_colnames(x = ., value = c("POP2", "POP1", "STAMPP_SIM02")) %>% 
    dplyr::arrange(POP1, POP2)

VCFtools

I’m not particularly fond of VCFtools for Fst. Too much manipulation of files. For the sake of the exercise I’m going to compare it quickly with 2 populations for the empirical dataset only.

Empirical dataset

pop1 <- dplyr::filter(strata, STRATA == "G100") %>% dplyr::select(INDIVIDUALS) %>% readr::write_tsv(x = ., path = "pop1.txt", col_names = FALSE)
pop2 <- dplyr::filter(strata, STRATA == "G102") %>% dplyr::select(INDIVIDUALS) %>% readr::write_tsv(x = ., path = "pop2.txt", col_names = FALSE)

the command used:

vcftools --vcf sunflower.vcf --weir-fst-pop pop1.txt --weir-fst-pop pop2.txt --out pop1_vs_pop2 
#> Fst between G100 and G102:
#> Weir and Cockerham mean Fst estimate: 0.036903
#> Weir and Cockerham weighted Fst estimate: 0.16222

PLINK users, note that the --fst option in PLINK v.1.9 is actually a port of VCFtools..

For me, comparison with VCFtools stops here, too painfull to conduct more pairwise, getting the ouput ready for R, etc. This is prone to human error and too much work to automate for no benefit (below you’ll understand how those VCFtools results compares with the other packages).

Simulated datasets

not conducted

assigner

Empirical dataset

assigner.fst <- assigner::fst_WC84(
  data = sunflower.filtered$tidy.data, 
  pairwise = TRUE,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE
) %$%
  pairwise.fst %>% 
  dplyr::rename(ASSIGNER = FST) %>% 
  dplyr::select(-N_MARKERS)
#> 49 sec
assigner.fst

Learn more about assigner::fst_WC84 with the dedicated vignette.

Simulated datasets

assigner.sim01.fst <- assigner::fst_WC84(
  data = sim.data1$tidy.data, 
  pairwise = TRUE,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE
) %$%
  pairwise.fst %>% 
  dplyr::rename(ASSIGNER_SIM01 = FST) %>% 
  dplyr::select(-N_MARKERS)

assigner.sim02.fst <- assigner::fst_WC84(
  data = sim.data2$tidy.data, 
  pairwise = TRUE,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE
) %$%
  pairwise.fst %>% 
  dplyr::rename(ASSIGNER_SIM02 = FST) %>% 
  dplyr::select(-N_MARKERS)

Comparisons

Empirical dataset

Generate a summary tibble

fst.summary <- suppressWarnings(
  assigner.fst %>%
    dplyr::full_join(diversity.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(genodive.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(hierfstat.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stampp.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stratag.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(snprelate.fst, by = c("POP1", "POP2"))
)

If you haven’t run the codes above:

fst.summary <- readr::read_tsv(file = "fst.summary.tsv", col_types = "ccnnnnnnnn")
fst.summary
## # A tibble: 36 x 10
##    POP1  POP2  ASSIGNER DIVERSITY GENODIVE HIERFSTAT STAMPP STRATAG
##    <chr> <chr>    <dbl>     <dbl>    <dbl>     <dbl>  <dbl>   <dbl>
##  1 G100  G102     0.123     0.123    0.115     0.123  0.123   0.123
##  2 G100  G103     0.210     0.210    0.215     0.210  0.211   0.211
##  3 G100  G108     0.394     0.395    0.39      0.394  0.394   0.394
##  4 G100  G109     0.299     0.299    0.287     0.299  0.299   0.299
##  5 G100  G111     0.369     0.370    0.356     0.369  0.369   0.369
##  6 G100  G118     0.169     0.169    0.156     0.169  0.169   0.169
##  7 G100  G122     0.269     0.270    0.255     0.269  0.269   0.269
##  8 G100  G124     0.366     0.366    0.362     0.366  0.366   0.366
##  9 G102  G103     0.118     0.118    0.115     0.118  0.119   0.119
## 10 G102  G108     0.322     0.323    0.319     0.322  0.322   0.322
## # … with 26 more rows, and 2 more variables: SNPRELATE_MEAN <dbl>,
## #   SNPRELATE_WEIGHTED <dbl>

If we highlight the first line, pops G100 and G102:

dplyr::filter(fst.summary, dplyr::row_number() == 1L)
## # A tibble: 1 x 10
##   POP1  POP2  ASSIGNER DIVERSITY GENODIVE HIERFSTAT STAMPP STRATAG
##   <chr> <chr>    <dbl>     <dbl>    <dbl>     <dbl>  <dbl>   <dbl>
## 1 G100  G102     0.123     0.123    0.115     0.123  0.123   0.123
## # … with 2 more variables: SNPRELATE_MEAN <dbl>, SNPRELATE_WEIGHTED <dbl>

Gold Standard: HIERFSTAT 0.12

  • min value: SNPRELATE_MEAN 0.08
  • max value: SNPRELATE_WEIGHTED 0.14
  • Weir and Cockerham mean Fst estimate: 0.036903 !!!
  • Weir and Cockerham weighted Fst estimate: 0.16222 !!!
  • that’s a bit scary…

Generate a correlation matrix

There’s obviously differences when we look at the pairwise comparisons separately. The difference between the approaches taken in the packages/software probably depends on numerous things:

  • how common polymorphic markers are handle
  • how sample size differences are managed
  • missing data
  • rouding during steps of the codes

Let’s look at it globally with a correlation matrix:

fst.corr <- round(cor(x = dplyr::select(fst.summary, -POP1, -POP2)), 6)
fst.corr
##                    ASSIGNER DIVERSITY GENODIVE HIERFSTAT   STAMPP  STRATAG
## ASSIGNER           1.000000  0.999998 0.997723  1.000000 0.999996 0.999996
## DIVERSITY          0.999998  1.000000 0.997635  0.999998 0.999995 0.999995
## GENODIVE           0.997723  0.997635 1.000000  0.997723 0.997702 0.997702
## HIERFSTAT          1.000000  0.999998 0.997723  1.000000 0.999996 0.999996
## STAMPP             0.999996  0.999995 0.997702  0.999996 1.000000 1.000000
## STRATAG            0.999996  0.999995 0.997702  0.999996 1.000000 1.000000
## SNPRELATE_MEAN     0.973732  0.974023 0.965324  0.973732 0.973953 0.973953
## SNPRELATE_WEIGHTED 0.996752  0.996691 0.998345  0.996752 0.996729 0.996729
##                    SNPRELATE_MEAN SNPRELATE_WEIGHTED
## ASSIGNER                 0.973732           0.996752
## DIVERSITY                0.974023           0.996691
## GENODIVE                 0.965324           0.998345
## HIERFSTAT                0.973732           0.996752
## STAMPP                   0.973953           0.996729
## STRATAG                  0.973953           0.996729
## SNPRELATE_MEAN           1.000000           0.962916
## SNPRELATE_WEIGHTED       0.962916           1.000000

I prefer to look at it in a tibble…

Generate a tibble for correlation results

fst.corr.tibble <- radiator::distance2tibble(
    x = fst.corr, 
    na.diag = TRUE, 
    relative = FALSE, 
    pop.levels = c("HIERFSTAT", "STRATAG", "ASSIGNER", "DIVERSITY", "STAMPP", "GENODIVE", "SNPRELATE_MEAN", "SNPRELATE_WEIGHTED")) %>%
      magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION")) %>% 
  dplyr::arrange(dplyr::desc(CORRELATION))
fst.corr.tibble
## # A tibble: 28 x 3
##    ID1       ID2       CORRELATION
##    <ord>     <ord>           <dbl>
##  1 ASSIGNER  HIERFSTAT        1   
##  2 STAMPP    STRATAG          1   
##  3 ASSIGNER  DIVERSITY        1.00
##  4 DIVERSITY HIERFSTAT        1.00
##  5 ASSIGNER  STAMPP           1.00
##  6 HIERFSTAT STAMPP           1.00
##  7 ASSIGNER  STRATAG          1.00
##  8 HIERFSTAT STRATAG          1.00
##  9 DIVERSITY STAMPP           1.00
## 10 DIVERSITY STRATAG          1.00
## # … with 18 more rows

Very interesting results!

Figure of results

Let’s plot the results:

plot.fst <- fst.summary %>% 
  tidyr::unite(data = ., col = PAIRS, POP1, POP2, sep = "-") %>% 
  tidyr::gather(data = ., key = SOFTWARE, value = FST, -PAIRS) %>% 
  dplyr::mutate(
    SOFTWARE = factor(
      x = SOFTWARE, 
      levels = c("HIERFSTAT", "ASSIGNER", "STAMPP", "STRATAG", "DIVERSITY", "GENODIVE", "SNPRELATE_WEIGHTED", "SNPRELATE_MEAN"), ordered = TRUE))

pairs.levels <- dplyr::group_by(plot.fst, PAIRS) %>% 
  dplyr::summarise(FST = mean(FST, na.rm = TRUE)) %>% 
  dplyr::arrange(FST) %$% 
  PAIRS
## `summarise()` ungrouping output (override with `.groups` argument)
under.the.sea.palette <- c("#1C2344", "#163D7D", "#118386", "#89D6D6", "#6E783D", "#FC564F", "#901546", "#FFD521")#, "#010101")
plot.fst %<>% 
  dplyr::mutate(PAIRS = factor(x = PAIRS, levels = pairs.levels, ordered = TRUE)) %>% 
  ggplot2::ggplot(data = ., ggplot2::aes(x = FST, y = PAIRS, colour = SOFTWARE)) +
  ggplot2::geom_jitter(alpha = 0.8) +
  ggplot2::scale_colour_manual(values = under.the.sea.palette)

Simulated datasets

sim_01: high Fst

Generate a summary tibble:

fst.summary.sim01<- suppressWarnings(
  assigner.sim01.fst %>%
    dplyr::full_join(diversity.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(genodive.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(hierfstat.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stampp.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stratag.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(snprelate.sim01.fst, by = c("POP1", "POP2")) %>% 
    dplyr::mutate_if(.tbl = ., .predicate = is.numeric, .funs = round, 4)
)
fst.summary.sim01 <- readr::read_tsv(file = "fst.summary.sim01.tsv", col_types = "ccnnnnnnnn")
fst.summary.sim01
## # A tibble: 10 x 10
##    POP1  POP2  ASSIGNER_SIM01 DIVERSITY_SIM01 GENODIVE_SIM01 HIERFSTAT_SIM01
##    <chr> <chr>          <dbl>           <dbl>          <dbl>           <dbl>
##  1 POP_1 POP_2          0.514           0.514          0.514           0.514
##  2 POP_1 POP_3          0.481           0.481          0.481           0.481
##  3 POP_1 POP_4          0.425           0.425          0.425           0.425
##  4 POP_1 POP_5          0.410           0.410          0.410           0.410
##  5 POP_2 POP_3          0.412           0.412          0.412           0.412
##  6 POP_2 POP_4          0.370           0.370          0.370           0.370
##  7 POP_2 POP_5          0.360           0.360          0.360           0.360
##  8 POP_3 POP_4          0.356           0.356          0.356           0.356
##  9 POP_3 POP_5          0.341           0.341          0.341           0.341
## 10 POP_4 POP_5          0.301           0.301          0.301           0.301
## # … with 4 more variables: STAMPP_SIM01 <dbl>, STRATAG_SIM01 <dbl>,
## #   SNPRELATE_MEAN_SIM01 <dbl>, SNPRELATE_WEIGHTED_SIM01 <dbl>

Generate the correlation tibble:

fst.corr.tibble.sim01 <- round(cor(x = dplyr::select(fst.summary.sim01, -POP1, -POP2)), 6) %>% 
 radiator::distance2tibble(
  x = ., 
  na.diag = TRUE, 
  relative = FALSE, 
  pop.levels = c("HIERFSTAT_SIM01", "STRATAG_SIM01", "ASSIGNER_SIM01", "DIVERSITY_SIM01", "STAMPP_SIM01", "GENODIVE_SIM01", "SNPRELATE_MEAN_SIM01", "SNPRELATE_WEIGHTED_SIM01")) %>%
  magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION_SIM01")) %>% 
  dplyr::arrange(dplyr::desc(CORRELATION_SIM01))
fst.corr.tibble.sim01
## # A tibble: 28 x 3
##    ID1             ID2             CORRELATION_SIM01
##    <ord>           <ord>                       <dbl>
##  1 ASSIGNER_SIM01  DIVERSITY_SIM01                 1
##  2 ASSIGNER_SIM01  GENODIVE_SIM01                  1
##  3 DIVERSITY_SIM01 GENODIVE_SIM01                  1
##  4 ASSIGNER_SIM01  HIERFSTAT_SIM01                 1
##  5 DIVERSITY_SIM01 HIERFSTAT_SIM01                 1
##  6 GENODIVE_SIM01  HIERFSTAT_SIM01                 1
##  7 ASSIGNER_SIM01  STAMPP_SIM01                    1
##  8 DIVERSITY_SIM01 STAMPP_SIM01                    1
##  9 GENODIVE_SIM01  STAMPP_SIM01                    1
## 10 HIERFSTAT_SIM01 STAMPP_SIM01                    1
## # … with 18 more rows

sim_02: low Fst

Generate a summary tibble:

fst.summary.sim02<- suppressWarnings(
  assigner.sim02.fst %>%
    dplyr::full_join(diversity.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(genodive.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(hierfstat.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stampp.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(stratag.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::full_join(snprelate.sim02.fst, by = c("POP1", "POP2")) %>% 
    dplyr::mutate_if(.tbl = ., .predicate = is.numeric, .funs = round, 4)
)
fst.summary.sim02 <- readr::read_tsv(file = "fst.summary.sim02.tsv", col_types = "ccnnnnnnnn")
fst.summary.sim02
## # A tibble: 10 x 10
##    POP1  POP2  ASSIGNER_SIM02 DIVERSITY_SIM02 GENODIVE_SIM02 HIERFSTAT_SIM02
##    <chr> <chr>          <dbl>           <dbl>          <dbl>           <dbl>
##  1 POP_1 POP_2         0.001           0.001          0.001           0.001 
##  2 POP_1 POP_3         0.0032          0.0032         0.0032          0.0032
##  3 POP_1 POP_4         0.0021          0.0021         0.0021          0.0021
##  4 POP_1 POP_5         0.0022          0.0022         0.0022          0.0022
##  5 POP_2 POP_3         0.0001          0.0001         0.0001          0.0001
##  6 POP_2 POP_4         0.0002          0.0002         0.0002          0.0002
##  7 POP_2 POP_5         0.0005          0.0005         0.0005          0.0005
##  8 POP_3 POP_4         0              -0.0006        -0.0006         -0.0006
##  9 POP_3 POP_5         0.0028          0.0028         0.0028          0.0028
## 10 POP_4 POP_5         0.0017          0.0017         0.0017          0.0017
## # … with 4 more variables: STAMPP_SIM02 <dbl>, STRATAG_SIM02 <dbl>,
## #   SNPRELATE_MEAN_SIM02 <dbl>, SNPRELATE_WEIGHTED_SIM02 <dbl>

Generate the correlation tibble:

fst.corr.tibble.sim02 <- round(cor(x = dplyr::select(fst.summary.sim02, -POP1, -POP2)), 6) %>% 
 radiator::distance2tibble(
  x = ., 
  na.diag = TRUE, 
  relative = FALSE, 
  pop.levels = c("HIERFSTAT_SIM02", "STRATAG_SIM02", "ASSIGNER_SIM02", "DIVERSITY_SIM02", "STAMPP_SIM02", "GENODIVE_SIM02", "SNPRELATE_MEAN_SIM02", "SNPRELATE_WEIGHTED_SIM02")) %>%
  magrittr::set_colnames(x = ., value = c("ID1", "ID2", "CORRELATION_SIM02")) %>% 
  dplyr::arrange(dplyr::desc(CORRELATION_SIM02))
fst.corr.tibble.sim02
## # A tibble: 28 x 3
##    ID1             ID2             CORRELATION_SIM02
##    <ord>           <ord>                       <dbl>
##  1 DIVERSITY_SIM02 GENODIVE_SIM02                  1
##  2 DIVERSITY_SIM02 HIERFSTAT_SIM02                 1
##  3 GENODIVE_SIM02  HIERFSTAT_SIM02                 1
##  4 DIVERSITY_SIM02 STAMPP_SIM02                    1
##  5 GENODIVE_SIM02  STAMPP_SIM02                    1
##  6 HIERFSTAT_SIM02 STAMPP_SIM02                    1
##  7 DIVERSITY_SIM02 STRATAG_SIM02                   1
##  8 GENODIVE_SIM02  STRATAG_SIM02                   1
##  9 HIERFSTAT_SIM02 STRATAG_SIM02                   1
## 10 STAMPP_SIM02    STRATAG_SIM02                   1
## # … with 18 more rows

Table of features

The table timing was produced using the empirical dataset.

summary.table <- readr::read_tsv(file = "fst_comparisons_packages_features.tsv", col_types = readr::cols(.default = readr::col_character()))
knitr::kable(summary.table) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
## Warning in kableExtra::kable_styling(., bootstrap_options = c("striped", :
## Please specify format in kable. kableExtra can customize either HTML or LaTeX
## outputs. See https://haozhu233.github.io/kableExtra/ for details.
FEATURES HIERFSTAT STRATAG ASSIGNER DIVERSITY STAMPP GENODIVE SNPRELATE VCFtools
Function pairwise.WCfst popStructTest fst_WC84 diffCalc stamppFst F_st snpgdsFst –weir-fst-pop
W&C 1984 Fst yes yes yes yes yes yes yes yes
Language R R, C R R, C++ R Objective-C R, C++ Perl, C++
Data object object both file object file object file
Format hierfstat gtypes several genepop genlight several SNPRelate GDS VCF
Polymorphic markers yes ? yes ? yes ? yes yes
Tailor common markers yes ? yes ? yes ? ? yes
Output number of markers no no yes no no no yes no
Global fst no yes yes yes no no yes yes
Pairwise fst yes yes yes yes yes yes no* no
Bootstrap individuals no no no yes no no no no
Boostrap loci no yes yes yes yes yes no no
CI no no yes yes yes no no no
P-Value no yes no no yes yes no no
Locus pairwise no no no yes no no no no
Figures no no yes no no no no no
Tibble no yes yes no no no no no
Full upper and lower matrix yes no yes no no yes yes no
Upper matrix no no yes no no no no no
Lower matrix no yes yes yes yes no no no
Combo matrix (Fst + stats) no yes yes no no no no no
Fst markers yes no yes yes no no yes yes
Fis yes no yes yes no no no no
Write no no yes yes no no no yes
Parallelize no yes yes yes yes* yes yes yes
OS (linux, macOS, windows) all all all all all macOS all all
Impacted by missingness no yes no yes yes yes yes yes
Timing (sec) 1234 657 49 43 39 3 0.245 2*

Note:

  • VCFtools: we used only 2 populations and timing is inside VCFtools only, it doesn’t account for manipulations of output.
  • SNPRelate: No out-of-the-box pairwise Fst analysis, but very quick solution is proposed in this vignette.

Conclusion

You could go and look at the 36 pairwise comparisons, but the end of the story is very simple:

  • For Fst, I’m definitely staying away from VCFtools and SNPRelate MeanFst.
  • SNPRelate MeanFst as a systematic downward bias and the weighted estimate is usually a little higher.
  • Empirical and simulated data: assigner and hierfstat have equal values down to a very large number of decimals!
  • The packages show different colors with empirical and simulated data.
    • contrary to empirical data, with simulated data all the packages have the same results.
    • missing data is probably the stats driving those differences.
    • assigner and hierfstat show consistent behaviors in the presence of missing data.
    • high or low Fst in simulated data: doesn’t impact the packages.
    • assigner is different than the others: it transform - Fst to 0.
    • it should be worth exploring the impact of missing data patterns and level of missingness on those Fst calculations…

In the absence of missing data, because overall Fst values are similar, in the end, it’s worth looking at:

  • speed: if this is an issue for you (e.g. if running multiple simulations, assignment analysis, etc).
  • features: lots of differences between packages/software.

Please let me know if I did something wrong or forgot to mention an interesting feature of your favorite package.