Highlight outliers individual's observed heterozygosity for a quick diagnostic of mixed samples or poor polymorphism discovery due to DNA quality, sequencing effort, etc.

detect_mixed_genomes(
  data,
  interactive.filter = TRUE,
  detect.mixed.genomes = TRUE,
  ind.heterozygosity.threshold = NULL,
  by.strata = FALSE,
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

(4 options) A file or object generated by radiator:

  • tidy data

  • Genomic Data Structure (GDS)

How to get GDS and tidy data ? Look into tidy_genomic_data, read_vcf or tidy_vcf.

interactive.filter

(optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE.

detect.mixed.genomes

(optional, logical) For use inside radiator pipelines. Default: detect.mixed.genomes = TRUE.

ind.heterozygosity.threshold

(string, double, optional) Blacklist individuals based on observed heterozygosity (averaged across markers).

The string contains 2 thresholds values (min and max). The values are proportions (0 to 1), where 0 turns off the min threshold and 1 turns off the max threshold. Individuals with mean observed heterozygosity higher (>) or lower (<) than the thresholds will be blacklisted.

Default: ind.heterozygosity.threshold = NULL will turn off completely the filter and the function will only output the plots and table of heterozygosity.

by.strata

(optional, logical) Can only be used when interactive.filter = TRUE, allows to enter heterozygosity thresholds by strata, instead of overall. This is not recommended to use inside filter_rad or when doing population discovery. This is a good use when dealing with different species. Default: by.strata = FALSE.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

Value

The function returns inside the global environment a list with 5 objects:

  1. the individual's heterozigosity ($individual.heterozygosity) a dataframe containing for each individual, the population id, the number of genotyped markers, the number of missing genotypes (based on the number of markers of the population and overall), the number of markers genotyped as heterozygote and it's proportion based on the number of genotyped markers.

  2. the heterozygosity statistics per populations and overall:$heterozygosity.statistics

  3. the blacklisted individuals if ind.heterozygosity.threshold was selected: $blacklist.ind.het

  4. the boxplot of individual heterozygosity:$individual.heterozygosity.boxplot

  5. the manhattan plot of individual heterozygosity ($individual.heterozygosity.manhattan.plot) contrasted with missingness proportion based on the number of markers (population or overall). The 2 facets will be identical when the dataset as common markers between the populations. The dotted lines are the mean hetegozygosities.

Details

To help discard an individual based on his observed heterozygosity (averaged across markers), use the manhanttan plot to:

  1. contrast the individual with population and overall samples.

  2. visualize the impact of missigness information (based on population or overall number of markers) and the individual observed heterozygosity. The larger the point, the more missing genotypes.

Outlier above average:

  • potentially represent two samples mixed together (action: blacklist), or...

  • a sample with more sequecing effort (point size small): did you merge your replicates fq files ? (action : keep and monitor)

  • a sample with poor sequencing effort (point size large) where the genotyped markers are all heterozygotes, verify this with missingness (action: discard)

In all cases, if there is no bias in the population sequencing effort, the size of the point will usually be "average" based on the population or overall number of markers.

You can visualize individual observed heterozygosity, choose thresholds and then visualize, choose thresholds and filter markers based on observed heterozygosity in one run with: radiator filter_het.

Outlier below average:

  • A point with a size larger than the population or overall average (= lots of missing): the poor polymorphism discovery of the sample is probably the result of bad DNA quality, a bias in sequencing effort, etc. (action: blacklist)

  • A point with a size that looks average (not much missing): this sample requires more attention (action: blacklist) and conduct more tests. e.g. for biallelic data, look for coverage imbalance between ALT/REF allele. At this point you need to distinguish between an artifact of poor polymorphism discovery or a biological reason (highly inbred individual, etc.).

heterozygosity and missing data: If you see a pattern with the heterozygosity and missing data, try changing the genotyping rate required to keep markers and/or individuals.

Note

Thanks to Peter Grewe for very useful comments on previous version of this function.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) {
#Step1: highlight outlier individuals, the simplest way to run:

data <- radiator::detect_mixed_genomes(data = "wombat_tidy.rad")
# where the .rad file is a tidy data set generated by radiator.

# Or if you don't have a tidy df:

data <- radiator::tidy_genomic_data(
    data = "wombat.vcf",
    strata = "strata.wombat.tsv"
    ) %>%
radiator::detect_mixed_genomes(.)

}