Haplotypes file summary — summary

STACKS batch_x.haplotypes.tsv file summary. The output of the function is a summary table for populations with:

assembly artifacts and/or sequencing errors: loci with > 2 alleles. Genotypes with more than 2 alleles are erased before estimating the subsequent statistics.
consensus reads (reads with no variation/polymorphism throughout the dataset).
monomorphic loci (at the population and overall level)
polymorphic loci (at the population and overall level)
haplotypes statistics for the observed and expected homozygosity and heterozygosity (HOM_O, HOM_E, HET_O, HET_E).
Gene Diversity within populations (Hs) and averaged over all populations. Not to confuse with Expected Heterozygosity(HET_E). Here, Hs statistic includes a correction for sampling bias stemming from sampling a limited number of individuals per population (Nei, 1987).
Wright’s inbreeding coefficient (Fis)
Nei's inbreeding coefficient (Gis) that include a correction for sampling bias.
IBDG: a proxy measure of the realized proportion of the genome that is identical by descent
FH measure: an individual-base estimate that is based on the excess in the observed number of homozygous genotypes within an individual relative to the mean number of homozygous genotypes expected under random mating (Keller et al., 2011; Kardos et al., 2015).
Pi: the nucleotide diversity (Nei & Li, 1979) measured here consider the consensus reads in the catalog (no variation between population sequences). The read.length argument is used directly in the calculations. To be correctly estimated, the reads obviously need to be of identical size...
PRIVATE_HAPLOTYPES: the number of private haplotypes.

summary_haplotypes(
  data,
  strata = NULL,
  read.length,
  whitelist.markers = NULL,
  blacklist.id = NULL,
  keep.consensus = TRUE,
  pop.levels = NULL,
  pop.labels = NULL,
  artifact.only = TRUE,
  parallel.core = parallel::detectCores() - 1
)

Arguments

data: The 'batch_x.haplotypes.tsv' created by STACKS.
strata: A tab delimited file with 2 columns with header: INDIVIDUALS and STRATA. The STRATA column can be any hierarchical grouping. To create a strata file see individuals2strata. If you have already run stacks on your data, the strata file is similar to a stacks `population map file`, make sure you have the required column names (INDIVIDUALS and STRATA). The strata also works as a whitelist of ids, individuals not in the strata file will be removed from the dataset.
read.length: (number) The length in nucleotide of your reads (e.g. read.length = 100).
whitelist.markers: (optional) A whitelist of loci with a column name 'LOCUS'. The whitelist is located in the global environment or in the directory (e.g. "whitelist.txt"). If a whitelist is used, the files written to the directory will have '.filtered' in the filename. Default: whitelist.markers = NULL.
blacklist.id: (optional) A blacklist with individual ID and a column header 'INDIVIDUALS'. The blacklist is in the global environment or in the directory (with "blacklist.txt"). Default: blacklist.id = NULL.
keep.consensus: (optional) Using whitelist of markers can automatically remove consensus markers from the nucleotide diversity (Pi) calculations. If you know what you are doing, play with this argument to keep consensus markers. Default: keep.consensus = TRUE.
pop.levels: (optional, string) This refers to the levels in a factor. In this case, the id of the pop. Use this argument to have the pop ordered your way instead of the default alphabetical or numerical order. e.g. pop.levels = c("QUE", "ONT", "ALB") instead of the default pop.levels = c("ALB", "ONT", "QUE"). White spaces in population names are replaced by underscore. Default: pop.levels = NULL.
pop.labels: (optional, string) Use this argument to rename/relabel your pop or combine your pop. e.g. To combine "QUE" and "ONT" into a new pop called "NEW": (1) First, define the levels for your pop with pop.levels argument: pop.levels = c("QUE", "ONT", "ALB"). (2) then, use pop.labels argument: pop.levels = c("NEW", "NEW", "ALB"). To rename "QUE" to "TAS": pop.labels = c("TAS", "ONT", "ALB"). Default: pop.labels = NULL. If you find this too complicated, there is also the strata argument that can do the same thing, see below. White spaces in population names are replaced by underscore.
artifact.only: (optional, logical) With default artifact.only = TRUE, the function will stop after generating blacklists of artifacts and consensus loci (no pi, no FH measures). This is faster and useful to run at the beginning of the pipeline. Run again at the end, with blacklists and whitelists to get reliable estimates of gene diverisity, IBDG and FH.
parallel.core: (optional) The number of core for parallel programming during Pi calculations. Default: parallel.core = parallel::detectCores() - 1.

Value

The function returns a list with:

$consensus.pop # if consensus reads are found
$consensus.loci # if consensus reads are found
$artifacts.pop # if artifacts are found
$artifacts.loci # if artifacts are found
$individual.summary: the individual's info for FH and Pi
$summary: the summary statistics per populations and averaged over all pops.
$private.haplotypes: a tibble with LOCUS, POP_ID and private haplotypes
$private.haplotypes.summary: a summary of the number of private haplotypes per populations

Also in the list 3 plots (also written in the folder):

$scatter.plot.Pi.Fh.ind: showing Pi and FH per individuals
$scatter.plot.Pi.Fh.pop: showing Pi and FH per pop
$boxplot.pi: showing the boxplot of Pi per pop
$boxplot.fh: showing the boxplot of FH per pop

use $ to access each #' objects in the list. The function potentially write 4 files in the working directory: blacklist of unique loci with assembly artifacts and/or sequencing errors (per individuals and globally), a blacklist of unique consensus loci and a summary of the haplotype file by population.

References

Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.

Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.

Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America, 76, 5269–5273.

Author

Thierry Gosselin thierrygosselin@icloud.com and Anne-Laure Ferchaud annelaureferchaud@gmail.com

Examples

if (FALSE) { # \dontrun{
# The simplest way to run the function:
sum <- summary_haplotypes(
data = "batch_1.haplotypes.tsv",
strata = "strata_brook_charr.tsv",
read.length = 90)
# if you want pi and fh calculated (ideally on filtered data):
sum <- summary_haplotypes(
data = "batch_1.haplotypes.tsv",
strata = "strata_brook_charr.tsv",
read.length = 90,
artifact.only = FALSE)
} # }