R/missing_visualization.R
missing_visualization.Rd
Use this function to visualize pattern of missing data.
Input file: various file formats are supported
(see data
argument below).
IBM-PCoA: conduct identity-by-missingness analyses using Principal Coordinates Analysis, PCoA (also called Multidimensional Scaling, MDS).
RDA: Redundancy Analysis using the strata provided to test the null hypothesis of no pattern of missingness between strata.
FH measure vs missingness: missingness at the individual level
is contrasted against FH, a new measure of IBDg
(Keller et al., 2011; Kardos et al., 2015; Hedrick & Garcia-Dorado, 2016)
FH is based on the excess in the observed number of homozygous
genotypes within an individual relative to the mean number of homozygous
genotypes expected under random mating.
IBDg is a proxy measure of the realized proportion of the genome
that is identical by descent.
Within this function, we're using a modified version of the measure
described in (Keller et al., 2011; Kardos et al., 2015).
The new measure is population-wise and tailored for RADseq data
(see ibdg_fh
for details).
Figures and Tables: figures and summary tables of missing information at the marker, individual and population level are generated.
Whitelists: create whitelists of markers based on desired thresholds of missing genotypes.
Blacklists: create blacklists of individuals based on desired thresholds of missing genotypes.
Tidy data: if the filename argument is used, the function also output the data in the directory in a tidy format (see details).
missing_visualization( data, strata = NULL, strata.select = "POP_ID", distance.method = "euclidean", ind.missing.geno.threshold = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90), filename = NULL, parallel.core = parallel::detectCores() - 1, write.plot = TRUE, ... )
data | 14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset. |
---|---|
strata | (optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
strata.select | (optional, character) Use this argument to select the column
from the strata file to generate the PCoA-IBM plot. More than 1 column you
want to visualize, use a string of character
e.g. |
distance.method | (character) The distance measure to be used.
This must be one of "euclidean", "maximum", "manhattan", "canberra",
"binary" or "minkowski". The function uses |
ind.missing.geno.threshold | (string) Percentage of missing genotype
allowed per individuals (to create the blacklists).
Default: |
filename | (optional) Name of the tidy data set, written to the directory created by the function. |
parallel.core | (optional) The number of core used for parallel
execution during import.
Default: |
write.plot | (optional, logical) When |
... | (optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see advance mode or special sections below). |
A list is created with several objects: the principal coordinates with eigenvalues of the PCoA, the identity-by-missingness plot, several summary tables and plots of missing information per individuals, populations and markers. Blacklisted ids are also included. Whitelists of markers with different level of missingness are also generated automatically. A heatmap showing the missing values in black and genotypes in grey provide a general overview of the missing data. The heatmap is usually long to generate, and thus, it's just included as an object in the list and not written in the folder.
filename
The function uses write.fst
,
to write the tidy data frame in the directory.
The file extension appended to
the filename
provided is .rad
.
The file is written with the
Lightning Fast Serialization of Data Frames for R package.
To read the tidy data file back in R use read.fst
.
Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English edition. Amsterdam: Elsevier Science BV.
Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.
Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.
Hedrick PW, Garcia-Dorado A. (2016) Understanding Inbreeding Depression, Purging, and Genetic Rescue. Trends in Ecology and Evolution. 2016;31: 940-952. doi:10.1016/j.tree.2016.09.005
Thierry Gosselin thierrygosselin@icloud.com and Eric Archer eric.archer@noaa.gov
if (FALSE) { #Using a VCF file, the simplest for of the function: ibm.koala <- missing_visualization( data = "batch_1.vcf", strata = "population.map.strata.tsv" ) # To see what's inside the list names(ibm.koala) # To view the heatmap: ibm.koala$heatmap # To save the heatmap # move to the appropriate directory ggplot2::ggsave( filename = "heatmap.missing.pdf", plot = ibm.koala$heatmap, width = 15, height = 20, dpi = 600, units = "cm", useDingbats = FALSE) # To view the IBM analysis plot: ibm.koala$ibm_plot }