R/radiator_tidy.R
tidy_genomic_data.RdTransform genomic data set produced by massive parallel sequencing pipeline (e.g.GBS/RADseq, SNP chip, DArT, etc) into a tidy format. The use of blacklist and whitelist along several filtering options are available to prune the dataset. Several arguments are available to make your data population-wise and easily rename the pop id. Used internally in radiator and assigner and might be of interest for users.
tidy_genomic_data(
data,
strata = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use detect_genomic_format (see example below).
Documented in Input genomic datasets of tidy_genomic_data.
DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS and STRATA. Documented in read_strata.
DArT data: a third column TARGET_ID is required.
Documented on read_dart. Also use the strata read function to
blacklist individuals.
Default: strata = NULL.
(optional) The function uses write.fst,
to write the tidy data frame in
the working directory. The file extension appended to
the filename provided is .rad.
With default: filename = NULL, the tidy data frame is
in the global environment only (i.e. not written in the working directory...).
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE.
(optional) To pass further arguments for fine-tuning the function.
The output in your global environment is a tidy data frame.
If filename is provided, the tidy data frame is also
written in the working directory with file extension .rad.
The file is written with the
Lightning Fast Serialization of Data Frames for R package.
To read the file back in R use read.fst.
VCF files must end with .vcf: documented in tidy_vcf
PLINK files must end with .tped or .bed: documented in tidy_plink
genind object from
adegenet:
documented in tidy_genind.
genlight object from
adegenet:
documented in tidy_genlight.
gtypes object from
strataG:
documented in tidy_gtypes.
genepop file must end with .gen, documented in tidy_genepop.
fstat file must end with .dat, documented in tidy_fstat.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv").
To make the haplotype file population ready, you need the strata argument.
Data frames: documented in tidy_wide
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
vcf.metadata (optional, logical or string).
Default: vcf.metadata = TRUE. Documented in tidy_vcf.
vcf.stats (optional, logical).
Default: vcf.stats = TRUE.
Documented in tidy_vcf.
whitelist.markers (optional, path or object) To keep only markers in a whitelist.
Default whitelist.markers = NULL.
Documented in read_whitelist.
blacklist.id (optional) Default: blacklist.id = NULL.
Ideally, managed in the strata file.
Documented in read_strata and read_blacklist_id.
filter.common.markers (optional, logical).
Default: filter.common.markers = TRUE,
Documented in filter_common_markers.
filter.monomorphic (logical, optional) Should the monomorphic
markers present in the dataset be filtered out ?
Default: filter.monomorphic = TRUE.
Documented in filter_monomorphic.
if (FALSE) { # \dontrun{
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")
# using VCF file as input
require(SeqArray)
tidy.vcf <- tidy_genomic_data(
data = "populations.snps.vcf", strata = "strata.treefrog.tsv",
whitelist.markers = "whitelist.vcf.txt")
} # }