Detect paralogs — detect_paralogs • radiator

The method uses the proportion of heterozygotes and deviations in read ratios described in McKinney et al. 2017 to highlight potential paralogous markers. Additionally, the function produces the HD figure with marker's missingness accounted for.

Note: Violating Assumptions/Prerequisites (see below) can lead to false positive or the absence of detection of paralogous markers.

detect_paralogs(
  data,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

(2 options) A Genomic Data Structure (GDS) file or object generated by radiator.

How to get GDS data ? Look into tidy_genomic_data, read_vcf or tidy_vcf.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

Value

The function returns the GDS data inside the global environment. In the directory generated with detect_paralogs in it, the function writes a tibble with the summary and the HD plot.

The tibble as 1 line per SNPs/MARKERS and columns are:

VARIANT_ID: the GDS SNP id, to have all the markers metadata use extract_markers_metadata.
NUMBER_HET: the number of heterozygote genotype.
NUMBER_MISSING: the number of missing genotype.
NUMBER_GENOTYPED: the number of individual genotyped for the marker.
TOTAL: the total number of sample for this marker.
DEPTH_REF: the sum of the REF allele depth/coverage.
DEPTH_ALT: the sum of the ALT allele depth/coverage.
MISSING_PROP: Missing genotype proportion.
HET_PROP: the proportion if heterozygote sample.
TOTAL_COUNTS: Total coverage/counts for the marker.
RATIO: the allele ratio.
Z_SCORE: the z-score (deviation in read ratio).

Assumptions/Prerequisites

read the paper: Read Garrett McKinney paper, twice. It's good. Reference is below.
Depth/Coverage information: The function requires read depth information for each alleles (REF/ALT). This info will not be found in genepop or structure file format. You usually need a VCF file for this. stacks, ANGSD, GATK, platypus, ipyrad, DArT, VCFTools, PLINK and freebayes are software producing the required output.
Batch effect: You don't have any batch effect, e.g. with sequencer id, library, sampling sites, tissue preservation, person behind the wet-lab bench conducting the DNA extraction (operating the pipette, etc.).
Normalization: Any form of normalization, before or after sequencing, as the potential to bias paralogs discovery. This include: combining lanes or chips to increase individual coverage, normalizing individuals FASTQ files, and of course normalizing read depth.
Identity-by-Missingness: Absence of artifactual pattern of missingness (see missing visualization)
Low genotyping error rate: see detect_het_outliers and whoa. You can also use: filter_hwe with very low threshold to discard markers with obvious genotyping errors.
Low heterozygosity miscall rate: see detect_het_outliers and whoa.
Absence of pattern of heterozygosity driven by missingness: see detect_mixed_genomes. Don't use this function if the individual heterozygosity is not under control. If your data is not uniform inside this function, don't expect much of detect_paralogs.
Use unfiltered dataset for fun only.
Use this function preferably on filtered data, check filter_rad.
Sex-biased sampling and sex markers will generate false-positive with this function. Use sexy_markers if you have sexe information.

References

McKinney GJ, Waples RK, Seeb LW, Seeb JE (2017) Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping-by-sequencing data from natural populations. Molecular Ecology Resources, 17, 656-669.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
# require(httr)
# using the Chinook VCF dataset in dryad, from McKinney et al. 2017 paper:
writeBin(object = httr::content(x = httr::GET(
url = "https://datadryad.org/bitstream/handle/10255/dryad.123464/
Chinook_sequenceReads.vcf?sequence=1"),
as = "raw"), con = "chinook.vcf")

# Import VCF and run the function
chinook.paralogs <- radiator::read_vcf(data = "chinook.vcf", vcf.stats = FALSE) %>%
    radiator::detect_paralogs(data = .)
} # }