The method uses the proportion of heterozygotes and deviations in read ratios described in McKinney et al. 2017 to highlight potential paralogous markers. Additionally, the function produces the HD figure with marker's missingness accounted for.
Note: Violating Assumptions/Prerequisites (see below) can lead to false positive or the absence of detection of paralogous markers.
detect_paralogs(
data,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
(2 options) A Genomic Data Structure (GDS) file or object generated by radiator.
How to get GDS data ?
Look into tidy_genomic_data
,
read_vcf
or
tidy_vcf
.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)
The function returns the GDS data inside the global environment.
In the directory generated with detect_paralogs
in it, the function
writes a tibble with the summary and the HD plot.
The tibble as 1 line per SNPs/MARKERS and columns are:
VARIANT_ID: the GDS SNP id, to have all the markers metadata use
extract_markers_metadata
.
NUMBER_HET: the number of heterozygote genotype.
NUMBER_MISSING: the number of missing genotype.
NUMBER_GENOTYPED: the number of individual genotyped for the marker.
TOTAL: the total number of sample for this marker.
DEPTH_REF: the sum of the REF allele depth/coverage.
DEPTH_ALT: the sum of the ALT allele depth/coverage.
MISSING_PROP: Missing genotype proportion.
HET_PROP: the proportion if heterozygote sample.
TOTAL_COUNTS: Total coverage/counts for the marker.
RATIO: the allele ratio.
Z_SCORE: the z-score (deviation in read ratio).
read the paper: Read Garrett McKinney paper, twice. It's good. Reference is below.
Depth/Coverage information: The function requires read depth information for each alleles (REF/ALT). This info will not be found in genepop or structure file format. You usually need a VCF file for this. stacks, ANGSD, GATK, platypus, ipyrad, DArT, VCFTools, PLINK and freebayes are software producing the required output.
Batch effect: You don't have any batch effect, e.g. with sequencer id, library, sampling sites, tissue preservation, person behind the wet-lab bench conducting the DNA extraction (operating the pipette, etc.).
Normalization: Any form of normalization, before or after sequencing, as the potential to bias paralogs discovery. This include: combining lanes or chips to increase individual coverage, normalizing individuals FASTQ files, and of course normalizing read depth.
Identity-by-Missingness: Absence of artifactual pattern of missingness (see missing visualization)
Low genotyping error rate: see detect_het_outliers
and whoa. You can also use:
filter_hwe
with very low threshold to discard markers with
obvious genotyping errors.
Low heterozygosity miscall rate: see detect_het_outliers
and
whoa.
Absence of pattern of heterozygosity driven by missingness:
see detect_mixed_genomes
. Don't use this function if the
individual heterozygosity is not under control. If your data is not uniform
inside this function, don't expect mush of detect_paralogs
.
Use unfiltered dataset for fun only.
Use this function preferably on filtered data, check
filter_rad
.
Sex-biased sampling and sex markers will generate
false-positive with this function.
Use sexy_markers
function before!
McKinney GJ, Waples RK, Seeb LW, Seeb JE (2017) Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping-by-sequencing data from natural populations. Molecular Ecology Resources, 17, 656-669.
if (FALSE) { # \dontrun{
# require(httr)
# using the Chinook VCF dataset in dryad, from McKinney et al. 2017 paper:
writeBin(object = httr::content(x = httr::GET(
url = "https://datadryad.org/bitstream/handle/10255/dryad.123464/
Chinook_sequenceReads.vcf?sequence=1"),
as = "raw"), con = "chinook.vcf")
# Import VCF and run the function
chinook.paralogs <- radiator::read_vcf(data = "chinook.vcf", vcf.stats = FALSE) %>%
radiator::detect_paralogs(data = .)
} # }