Transform common genomic dataset format in a tidy data frame

Transform genomic data set produced by massive parallel sequencing pipeline (e.g.GBS/RADseq, SNP chip, DArT, etc) into a tidy format. The use of blacklist and whitelist along several filtering options are available to prune the dataset. Several arguments are available to make your data population-wise and easily rename the pop id. Used internally in radiator and assigner and might be of interest for users.

tidy_genomic_data(
  data,
  strata = NULL,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

filename

(optional) The function uses write.fst, to write the tidy data frame in the working directory. The file extension appended to the filename provided is .rad. With default: filename = NULL, the tidy data frame is in the global environment only (i.e. not written in the working directory...).

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) To pass further arguments for fine-tuning the function.

Value

The output in your global environment is a tidy data frame. If filename is provided, the tidy data frame is also written in the working directory with file extension .rad. The file is written with the Lightning Fast Serialization of Data Frames for R package. To read the file back in R use read.fst.

Input genomic datasets

VCF files must end with .vcf: documented in tidy_vcf
PLINK files must end with .tped or .bed: documented in tidy_plink
genind object from adegenet: documented in tidy_genind.
genlight object from adegenet: documented in tidy_genlight.
gtypes object from strataG: documented in tidy_gtypes.
dart data from DArT: documented in read_dart.
genepop file must end with .gen, documented in tidy_genepop.
fstat file must end with .dat, documented in tidy_fstat.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"). To make the haplotype file population ready, you need the strata argument.
Data frames: documented in tidy_wide

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

vcf.metadata (optional, logical or string). Default: vcf.metadata = TRUE. Documented in tidy_vcf.
vcf.stats (optional, logical). Default: vcf.stats = TRUE. Documented in tidy_vcf.
whitelist.markers (optional, path or object) To keep only markers in a whitelist. Default whitelist.markers = NULL. Documented in read_whitelist.
blacklist.id (optional) Default: blacklist.id = NULL. Ideally, managed in the strata file. Documented in read_strata and read_blacklist_id.
filter.common.markers (optional, logical). Default: filter.common.markers = TRUE, Documented in filter_common_markers.
filter.monomorphic (logical, optional) Should the monomorphic markers present in the dataset be filtered out ? Default: filter.monomorphic = TRUE. Documented in filter_monomorphic.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")


# using VCF file as input
require(SeqArray)
tidy.vcf <- tidy_genomic_data(
   data = "populations.snps.vcf", strata = "strata.treefrog.tsv",
   whitelist.markers = "whitelist.vcf.txt")
} # }