Detect heterozygotes outliers and estimate miscall rate — detect_het

Explore departure from H-W equilibrium in bi-allelic RADseq data. Highlight excess of homozygotes present in numeros RADseq studies. The function estimate the genotyping error rate and heterozygote miscall rate. The model focus on heterozygotes being incorrectly called as homozygotes. See details below for more info.

detect_het_outliers(
  data,
  nreps = 2000,
  burn.in = NULL,
  strata = NULL,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

nreps

(integer, optional) The number of MCMC sweeps to do. Default: nreps = 2000.

burn.in

(integer, optional) The number of MCMC burn-in reps. With default, during execution, you will be asked to enter the nuber of burn-in. For this, a plot showing the heterozygote miscall rate for all the MCMC sweeps will be printed. This plot will help pinpoint the number of burn-in. The remaining MCMC sweeps will be used to average the heterozygote miscall rate. e.g. of common value burn.in = 500. With default: burn.in = NULL.

strata

(path or object) The strata file or object. Additional documentation is available in read_strata. Use that function to whitelist/blacklist populations/individuals. Option to set pop.levels/pop.labels is also available.

filename

(optional, character) If !is.null(blacklist.id) || !is.null(pop.select), the modified strata is written by default in the working directory with date and time appended to strata_radiator_filtered, to make the file unique. If you plan on writing more than 1 strata file per minute, use this argument to supply the unique filename. When filename is not NULL, it will also trigger saving the strata to a file. Default: filename = NULL.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) To pass further arguments for fine-tuning the function.

Value

A folder generated automatically with date and time, the file het.summary.tsv contains the summary statistics. The file markers.genotypes.boundaries.pdf is the plot with boundaries. The overall genotyping and heterozygotes miscall rate is writen in the file overall_error_rate.tsv. The function also returns a list inside the global environment with 8 objects:

input the input data, cleaned if filters were used during import.
outlier.summary a list with a tibble and plot of genotypes frequencies and boundaries (also written in the folder).
summary.alt.allele a tibble summarizing the number of markers with:
- no homozygote for the alternate allele (NO_HOM_ALT)
- no heterozygote genotype (NO_HET)
- one homozygote for the alternate allele(ONE_HOM_ALT)
- one heterozygote genotype (ONE_HET)
- one homozygote for the alternate allele only (ONE_HOM_ALT_ONLY)
- one heterozygote genotype only (ONE_HET_ONLY)
- one homozygote for the alternate allele and one heterozygote genotype only (ONE_HOM_ALT_ONE_HET_ONLY)
m.nreps A tibble with the heterozygote miscall rate for each MCMC replicate
overall.genotyping.error.rate The overall genotyping error rate
overall.m The overall heterozygote miscall rate
simmed_genos The simulated genotypes

The statistics are summarized per population and overall, the grouping is found in the last column called POP_ID.

Details

Before using the function:

Don't use raw RADseq data, this function will work best with filtered data
Remove duplicate detect_duplicate_genomes.
Remove mixed samples detect_mixed_genomes.
Look at other filters in radiator package...

During import:

By default the function will keep only polymorphic markers and markers common between all populations. If you supply a tidy data frame or a .rad file, the function skip all the filters, pop selection, etc. It will however scan and remove monomorphic markers automatically.

Keep track of the data:

Use the argument filename to write the imported (and maybe further filtered) tidy genomic data set inside the folder. The filename will be automatically appended .rad to it. This file can be used again directly inside this function and other radiator functions. See read_rad.

Author

Eric Anderson eric.anderson@noaa.gov and Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
het.prob <- radiator::detect_het_outliers(
data = "tuna.vcf", strata = "tuna.strata.tsv", nreps = 2000)
} # }