Filter markers based on Hardy-Weinberg Equilibrium

Testing markers for Hardy-Weinberg proportions is a valuable tool for the analysis and quality control of RADseq datasets. HWE can highligh genotyping errors, presence of null alleles, sequence duplication, copy number variation and other sequencing problems related to read depth. This function is designed for bi-allelic markers and uses the exact test from the package HardyWeinberg. The function is speedy because it uses the C++ code developed by Christopher Chang and also available in PLINK. The p-value generated by the function is the mid p-value. It's computed as half the probability of the current sample + the probabilities of all samples that are more extreme (see references below). Several output are generated to help users filter the data (see details).

sampling sites vs well defined populations: be careful what strata you're investigating and adjust filtering threshold accordinghly for downstream analysis. If you're still at the populations discovery steps, markers under HWD are totally normal.

Prior to HW filtering, I highly recommend removing outlier individuals, filtering coverage and genotype likelihood (see details).

filter_hwe(
  data,
  interactive.filter = TRUE,
  filter.hwe = TRUE,
  strata = NULL,
  hw.pop.threshold = NULL,
  midp.threshold = 4L,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

A tidy data frame object in the global environment or a tidy data frame in wide or long format in the working directory. How to get a tidy data frame ? Look into radiator tidy_genomic_data.

interactive.filter

(optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE.

filter.hwe

(optional, logical) Used inside radiator pipeline. Default: filter.hwe = TRUE.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

hw.pop.threshold

(integer, optional) Remove markers that have a certain number of pops in Hardy-Weinberg disequilibrium. With default, all populations in dataset need to be in HWD before discarding the marker. Default: hw.pop.threshold = NULL.

midp.threshold

(integer, optional) By default the function generates blacklists/whitelists of markers and filtered tidy datasets for the 5 mid p-value. However, to get a final filtered object associated with the output of the function, user need to choose one of the 5 mid p-value:

1 = 0.05 (*)
2 = 0.001 (**)
3 = 0.0001 (***)
4 = 0.0001 (****)
5 = 0.00001 (*****)

. With default, a very conservative mid p-value threshold is selected. Default: midp.threshold = 4.

filename

(optional) The function uses write.fst, to write the tidy data frame in the folder created in the working directory. The file extension appended to the filename provided is .rad. Default: filename = NULL.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

Value

A list in the global environment objects:

$path.folder: the path to the folder generated.
$hw.pop.threshold: the number of populations tolerated to be in HWD before blacklisting the markers.
$plot.hwd.thresholds: useful figure that highlight the number of markers blacklisted based on the number of populations in HWD and mid p-value thresholds.
$plot.tern: ternary plot of markers (currently unavailable until ggtern is updated).
$hw.manhattan: manhattan plot of markers in Hardy-Weinberg disequilibrium.
$hwe.pop.sum: a summary tibble with populations, number of markers in total, number of markers monomorphic for the populations, number of markers in Hardy-Weinberg Equilibrium (HWE), number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different mid p-values observed on the data.
$midp.threshold: the mid p-value threshold chosen for the final dataset (next)
$tidy.hw.filtered: the final filtered dataset (oter datasets .rad are generated automatically by the function, check the folder)

Written in the folder:

genotypes.summary.tsv: A tibble with these columns: MARKERS, STRATA, HET, HOM_ALT, HOM_REF, MISSING, N, FREQ_ALT, FREQ_REF, FREQ_HET, FREQ_HOM_REF_O, FREQ_HET_O, FREQ_HOM_ALT_O, FREQ_HOM_REF_E, FREQ_HET_E, FREQ_HOM_ALT_E, N_HOM_REF_EXP, N_HET_EXP, N_HOM_ALT_EXP, HOM_REF_Z_SCORE, HOM_HET_Z_SCORE, HOM_ALT_Z_SCORE, READ_DEPTH
hw.pop.sum.tsv: a summary tibble with populations, number of markers in total, number of markers monomorphic for the populations, number of markers in Hardy-Weinberg Equilibrium (HWE), number of markers in Hardy-Weinberg Dquilibrium (HWD) with all the different mid p-values observed on the data.
hwd.helper.table.tsv: useful tibble that highlight the number of markers blacklisted based on the number of populations in HWD and mid p-value thresholds.
hwd.plot.blacklist.markers.pdf: useful figure that highlight the number of markers blacklisted based on the number of populations in HWD and mid p-value thresholds.
hwe.manhattan.plot.pdf: manhattan plot of markers in Hardy-Weinberg disequilibrium.
tidy.filtered.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.rad: several tidy dataset filtered with different mid p value and populations in HWD thresholds
whitelist.markers.hwe.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several whitelist of markers with different mid p value and populations in HWD thresholds
blacklist.markers.hwd.xxx.mid.p.value.xxx.hw.pop.threshold.tsv: several blacklist of markers with different mid p value and populations in HWD thresholds

Details

Interactive version

The user is asked to look at figures before choosing filter thresholds.

HWE threshold

I recommend starting with a low threshold. Serious genotyping errors will generate extreme p-values (e.g. 1e-50), which are detected by any reasonable configuration of this test, while various life-history caracteristics will deflate/inflate Hardy-Weinberg equilibrium. Consequently, it's dangerous to choose a threshold that filters out too many markers.

strategies: Disk space is cheap! Consequently, the function will automatically generate several blacklists/whitelists of markers and filtered tidy data (in the directory) based on the hw.pop.threshold for 5 groups of mid p-values:

1: MID_P_VALUE <= 0.00001: *****
2: MID_P_VALUE <= 0.0001: ****
3: MID_P_VALUE <= 0.001: ***
4: MID_P_VALUE <= 0.01: **
5: MID_P_VALUE <= 0.05: *

Test the sensitivity of downstream analysis and delete unwanted datasets.

missing data

The mid-p adjustment tends to bring the null rejection rate in line with the nominal p-value, and also reduces the filter's tendency to favor retention of SNPs with missing data (Graffelman and Moreno, 2013, Purcell et al., 2007).

If pattern of missing data is present in the dataset, or when missing data accross markers vary by more than 0.10, you should not apply a single mid-p-value threshold accross markers.

Read depth, pooling lanes/chips and weird pattern of individual heterozygosity

Because of read depth, heterozygote deficiency is usually observed in RADseq data, but if sequencing lanes/chips were combined to generate individuals with more coverage, the situation will likely be the reverse: heterozygote excess. If lanes/chips pooling was used or if highly variable sequencing coverage is observed between individuals and/or markers, there's a couple qc and filtering steps you should do before conducting HWE filtering.

run radiator detect_mixed_genomes and detect_het_outliers to highlight outlier individuals with potential heterozygosity problems and get an idea of the genotyping and heterozygote miscall rate
I recommend normalizing the data before de novo assembly, if this is not possible...
use radiator::filter_rad or a more chirurgical approach to coverage and genotype likelihood with radiator::filter_rad.

Permutation test Hardy-Weinberg equilibrium refers to the statistical independence of alleles within individuals. This independence can also be assessed by permutation test inside the HardyWeinberg package of Jan Graffelman. To filter out markers with genotyping problems the approach provided in this function is enough.

Power test for HWE Based on allele count/frequency and sample size. This function is longer to generate for each markers and is on my todo list to include it in this filter.

Markers under selection and genome scans:

Scared of deleting those precious markers or that the filter might interfere with genome scan analysis/detection ? Don't be. Your markers or analysis is no good if it's done on bad data... Test the sensitivity of your downstream analysis with the datasets generated with the different thresholds.

Note

Hardy-Weinberg assumptions (refresh):

Diploid organisms
Only reproduction sexual occurs
Generations are non-overlapping
Mating is random
Population size is infinitely large
Allele frequencies are equal in the sexes
Migration, Mutation and Selection are negligible

References

Weir, B.S. (1996) Genetic data analysis II. Sinauer Associates, Massachusetts. See Chapter3.

Wigginton, J.E., Cutler, D.J. and Abecasis, G.R. (2005) A note on exact tests of Hardy-Weinberg equilibrium, American Journal of Human Genetics (76) pp. 887-893.

Purcell et al. (2007) PLINK: A Toolset for Whole-Genome Association and Population-Based Linkage Analysis. American Journal of Human Genetics 81(3) pp. 559-575.

Graffelman, J. and Moreno, V. (2013) The mid p-value in exact tests for Hardy-Weinberg equilibrium, Statistical Applications in Genetics and Molecular Biology 12(4) pp. 433-448.

Graffelman J, Jain D, Weir B (2017) A genome-wide study of Hardy-Weinberg equilibrium with next generation sequence data. Human Genetics, 136, 727-741.

Examples

if (FALSE) { # \dontrun{
require(HardyWeinberg)
library(radiator)
# for the interactive version (recommended)
turtle.pop <- radiator::filter_hwe(
   data = "turtle.vcf",
   strata = "turtle.strata.tsv",
   filename = "hwe.turtle"
)
} # }