STACKS batch_x.haplotypes.tsv file summary. The output of the function is a summary table for populations with:
assembly artifacts and/or sequencing errors: loci with > 2 alleles. Genotypes with more than 2 alleles are erased before estimating the subsequent statistics.
consensus reads (reads with no variation/polymorphism throughout the dataset).
monomorphic loci (at the population and overall level)
polymorphic loci (at the population and overall level)
haplotypes statistics for the observed and expected homozygosity and heterozygosity (HOM_O, HOM_E, HET_O, HET_E).
Gene Diversity within populations (Hs) and averaged over all populations. Not to confuse with Expected Heterozygosity(HET_E). Here, Hs statistic includes a correction for sampling bias stemming from sampling a limited number of individuals per population (Nei, 1987).
Wright’s inbreeding coefficient (Fis)
Nei's inbreeding coefficient (Gis) that include a correction for sampling bias.
IBDG: a proxy measure of the realized proportion of the genome that is identical by descent
FH measure: an individual-base estimate that is based on the excess in the observed number of homozygous genotypes within an individual relative to the mean number of homozygous genotypes expected under random mating (Keller et al., 2011; Kardos et al., 2015).
Pi: the nucleotide diversity (Nei & Li, 1979) measured here consider the
consensus reads in the catalog (no variation between population sequences).
The read.length
argument is used directly in the calculations.
To be correctly estimated, the reads obviously need to be of identical size...
PRIVATE_HAPLOTYPES: the number of private haplotypes.
summary_haplotypes( data, strata = NULL, read.length, whitelist.markers = NULL, blacklist.id = NULL, keep.consensus = TRUE, pop.levels = NULL, pop.labels = NULL, artifact.only = TRUE, parallel.core = parallel::detectCores() - 1 )
data | The 'batch_x.haplotypes.tsv' created by STACKS. |
---|---|
strata | A tab delimited file with 2 columns with header:
|
read.length | (number) The length in nucleotide of your reads
(e.g. |
whitelist.markers | (optional) A whitelist of loci with a column name
'LOCUS'. The whitelist is located in the global environment or in the
directory (e.g. "whitelist.txt").
If a whitelist is used, the files written to the directory will have
'.filtered' in the filename.
Default: |
blacklist.id | (optional) A blacklist with individual ID and
a column header 'INDIVIDUALS'. The blacklist is in the global environment
or in the directory (with "blacklist.txt").
Default: |
keep.consensus | (optional) Using whitelist of markers can automatically
remove consensus markers from the nucleotide diversity (Pi) calculations.
If you know what you are doing, play with this argument
to keep consensus markers.
Default: |
pop.levels | (optional, string) This refers to the levels in a factor. In this
case, the id of the pop.
Use this argument to have the pop ordered your way instead of the default
alphabetical or numerical order. e.g. |
pop.labels | (optional, string) Use this argument to rename/relabel
your pop or combine your pop. e.g. To combine |
artifact.only | (optional, logical) With default |
parallel.core | (optional) The number of core for parallel
programming during Pi calculations.
Default: |
The function returns a list with:
$consensus.pop # if consensus reads are found
$consensus.loci # if consensus reads are found
$artifacts.pop # if artifacts are found
$artifacts.loci # if artifacts are found
$individual.summary: the individual's info for FH and Pi
$summary: the summary statistics per populations and averaged over all pops.
$private.haplotypes: a tibble with LOCUS, POP_ID and private haplotypes
$private.haplotypes.summary: a summary of the number of private haplotypes per populations
Also in the list 3 plots (also written in the folder):
$scatter.plot.Pi.Fh.ind: showing Pi and FH per individuals
$scatter.plot.Pi.Fh.pop: showing Pi and FH per pop
$boxplot.pi: showing the boxplot of Pi per pop
$boxplot.fh: showing the boxplot of FH per pop
use $ to access each #' objects in the list. The function potentially write 4 files in the working directory: blacklist of unique loci with assembly artifacts and/or sequencing errors (per individuals and globally), a blacklist of unique consensus loci and a summary of the haplotype file by population.
Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.
Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.
Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America, 76, 5269–5273.
Thierry Gosselin thierrygosselin@icloud.com and Anne-Laure Ferchaud annelaureferchaud@gmail.com
if (FALSE) { # The simplest way to run the function: sum <- summary_haplotypes( data = "batch_1.haplotypes.tsv", strata = "strata_brook_charr.tsv", read.length = 90) # if you want pi and fh calculated (ideally on filtered data): sum <- summary_haplotypes( data = "batch_1.haplotypes.tsv", strata = "strata_brook_charr.tsv", read.length = 90, artifact.only = FALSE) }