Visualize missing genotypes in genomic data set — missing

Use this function to visualize pattern of missing data.

Input file: various file formats are supported (see data argument below).
IBM-PCoA: conduct identity-by-missingness analyses using Principal Coordinates Analysis, PCoA (also called Multidimensional Scaling, MDS).
RDA: Redundancy Analysis using the strata provided to test the null hypothesis of no pattern of missingness between strata.
FH measure vs missingness: missingness at the individual level is contrasted against FH, a new measure of IBDg (Keller et al., 2011; Kardos et al., 2015; Hedrick & Garcia-Dorado, 2016) FH is based on the excess in the observed number of homozygous genotypes within an individual relative to the mean number of homozygous genotypes expected under random mating. IBDg is a proxy measure of the realized proportion of the genome that is identical by descent. Within this function, we're using a modified version of the measure described in (Keller et al., 2011; Kardos et al., 2015). The new measure is population-wise and tailored for RADseq data (see ibdg_fh for details).
Figures and Tables: figures and summary tables of missing information at the marker, individual and population level are generated.
Whitelists: create whitelists of markers based on desired thresholds of missing genotypes.
Blacklists: create blacklists of individuals based on desired thresholds of missing genotypes.
Tidy data: if the filename argument is used, the function also output the data in the directory in a tidy format (see details).

missing_visualization(
  data,
  strata = NULL,
  strata.select = "POP_ID",
  distance.method = "euclidean",
  ind.missing.geno.threshold = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90),
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  write.plot = TRUE,
  ...
)

Arguments

data	14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use `detect_genomic_format` (see example below). Documented in Input genomic datasets of `tidy_genomic_data`. DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
strata	(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: `INDIVIDUALS` and `STRATA`. If a `strata` file is specified the strata argument will have precedence on the population groupings (`POP_ID`) used internally. The `STRATA` column can be any hierarchical grouping. For `missing_visualization` function, use additional columns in the strata file to store metadata that you want to look for pattern of missingness. e.g. lanes, chips, sequencers, etc. Note that you need different values inside the `STRATA` for the function to work. Default: `strata = NULL`.
strata.select	(optional, character) Use this argument to select the column from the strata file to generate the PCoA-IBM plot. More than 1 column you want to visualize, use a string of character e.g. `strata.select = c("POP_ID", "LANES", "SEQUENCER", "WATERSHED")` to test 4 grouping columns inside the `strata` file. Default: `strata.select = "POP_ID"`
distance.method	(character) The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". The function uses `dist`. Default: `distance.method = "euclidean"`.
ind.missing.geno.threshold	(string) Percentage of missing genotype allowed per individuals (to create the blacklists). Default:`ind.missing.geno.threshold = c(10, 20, 30, 40, 50, 60, 70, 80, 90)`.
filename	(optional) Name of the tidy data set, written to the directory created by the function.
parallel.core	(optional) The number of core used for parallel execution during import. Default: `parallel.core = parallel::detectCores() - 1`.
write.plot	(optional, logical) When `write.plot = TRUE`, the function will write to the directory created by the function the plots, except the heatmap that take longer to generate. For this, do it manually following example below. Default: `write.plot = TRUE`.
...	(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see advance mode or special sections below).

Value

A list is created with several objects: the principal coordinates with eigenvalues of the PCoA, the identity-by-missingness plot, several summary tables and plots of missing information per individuals, populations and markers. Blacklisted ids are also included. Whitelists of markers with different level of missingness are also generated automatically. A heatmap showing the missing values in black and genotypes in grey provide a general overview of the missing data. The heatmap is usually long to generate, and thus, it's just included as an object in the list and not written in the folder.

Details

filename

The function uses write.fst, to write the tidy data frame in the directory. The file extension appended to the filename provided is .rad. The file is written with the Lightning Fast Serialization of Data Frames for R package. To read the tidy data file back in R use read.fst.

References

Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English edition. Amsterdam: Elsevier Science BV.

Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.

Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.

Hedrick PW, Garcia-Dorado A. (2016) Understanding Inbreeding Depression, Purging, and Genetic Rescue. Trends in Ecology and Evolution. 2016;31: 940-952. doi:10.1016/j.tree.2016.09.005

Author

Thierry Gosselin thierrygosselin@icloud.com and Eric Archer eric.archer@noaa.gov

Examples

if (FALSE) {
#Using a  VCF file, the simplest for of the function:
ibm.koala <- missing_visualization(
data = "batch_1.vcf",
strata = "population.map.strata.tsv"
)

# To see what's inside the list
names(ibm.koala)

# To view the heatmap:
ibm.koala$heatmap

# To save the heatmap
# move to the appropriate directory
ggplot2::ggsave(
filename = "heatmap.missing.pdf",
plot = ibm.koala$heatmap,
width = 15, height = 20,
dpi = 600, units = "cm", useDingbats = FALSE)

# To view the IBM analysis plot:
ibm.koala$ibm_plot
}