Use this function to keep the pattern of missing data (0/1). The pattern can be randomized based on dataset attributes/covariates. This can be useful to generate missingness on simulated dataset with the same number of individuals, populations and markers or to analyze the accuracy of imputation algorithms. A vignette is under construction to leverage this function.

memorize_missing(data, strata = NULL, randomize = NULL, filename = NULL)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

strata

(optional/required) Required for VCF and haplotypes files, optional for the other formats supported. See documentation of tidy_genomic_data for more info. Default: strata = NULL.

randomize

(optional, string) To randomize the missingness of specific attributes. Available options: "markers", "populations", "individuals" and "overall". All options can be selected in a string, randomize = c("markers", "populations", "individuals", "overall") Default: randomize = NULL will only keep the original missingness pattern.

filename

(optional) The name of the file (extension not necessary) written to the working directory and containing the missing info. Default: filename = NULL, the missing info is in the global environment only.

grur takes advantage of the lightweight and speedy file reading/writing package fst (Lightning Fast Serialization of Data Frames for R) to write the dataframe to the working directory. This file can be used inside generate_missing (coming soon) function.

Value

A tidy dataframe in the global environment with columns: POP_ID, INDIVIDUALS, MARKERS, and in the subsequent columns, the missingness info coded 0 for missing and 1 for genotyped. Depending on the value chosen for the argument randomize, the columns are:

  • MISSING_ORIGINAL: for the original missing pattern (always present)

  • MISSING_MARKERS_MIX: for the missing pattern randomized by markers (optional)

  • MISSING_POP_MIX: for the missing pattern randomized by populations (optional)

  • MISSING_INDIVIDUALS_MIX: for the missing pattern randomized by individuals (optional)

  • MISSING_OVERALL_MIX: for the missing pattern randomized overall (optional)

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { missing.memory <- memorize_missing( data = "batch_1.vcf", strata = "population.map.strata.tsv", randomize = "populations", filename = "missing.memory.panda" ) }