Memorize missingness pattern and randomize attributes

Use this function to keep the pattern of missing data (0/1). The pattern can be randomized based on dataset attributes/covariates. This can be useful to generate missingness on simulated dataset with the same number of individuals, populations and markers or to analyze the accuracy of imputation algorithms. A vignette is under construction to leverage this function.

memorize_missing(data, strata = NULL, randomize = NULL, filename = NULL)

Arguments

data	14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use `detect_genomic_format` (see example below). Documented in Input genomic datasets of `tidy_genomic_data`. DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
strata	(optional/required) Required for VCF and haplotypes files, optional for the other formats supported. See documentation of `tidy_genomic_data` for more info. Default: `strata = NULL`.
randomize	(optional, string) To randomize the missingness of specific attributes. Available options: `"markers", "populations", "individuals" and "overall"`. All options can be selected in a string, `randomize = c("markers", "populations", "individuals", "overall")` Default: `randomize = NULL` will only keep the original missingness pattern.
filename	(optional) The name of the file (extension not necessary) written to the working directory and containing the missing info. Default: `filename = NULL`, the missing info is in the global environment only. `grur` takes advantage of the lightweight and speedy file reading/writing package `fst` (Lightning Fast Serialization of Data Frames for R) to write the dataframe to the working directory. This file can be used inside `generate_missing` (coming soon) function.

Value

A tidy dataframe in the global environment with columns: POP_ID, INDIVIDUALS, MARKERS, and in the subsequent columns, the missingness info coded 0 for missing and 1 for genotyped. Depending on the value chosen for the argument randomize, the columns are:

MISSING_ORIGINAL: for the original missing pattern (always present)
MISSING_MARKERS_MIX: for the missing pattern randomized by markers (optional)
MISSING_POP_MIX: for the missing pattern randomized by populations (optional)
MISSING_INDIVIDUALS_MIX: for the missing pattern randomized by individuals (optional)
MISSING_OVERALL_MIX: for the missing pattern randomized overall (optional)

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) {
missing.memory <- memorize_missing(
data = "batch_1.vcf",
strata = "population.map.strata.tsv", 
randomize = "populations", filename = "missing.memory.panda"
)
}