R/radiator_tidy.R
tidy_genomic_data.Rd
Transform genomic data set produced by massive parallel sequencing pipeline (e.g.GBS/RADseq, SNP chip, DArT, etc) into a tidy format. The use of blacklist and whitelist along several filtering options are available to prune the dataset. Several arguments are available to make your data population-wise and easily rename the pop id. Used internally in radiator and assigner and might be of interest for users.
tidy_genomic_data(
data,
strata = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use detect_genomic_format
(see example below).
Documented in Input genomic datasets of tidy_genomic_data
.
DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
and STRATA
. Documented in read_strata
.
DArT data: a third column TARGET_ID
is required.
Documented on read_dart
. Also use the strata read function to
blacklist individuals.
Default: strata = NULL
.
(optional) The function uses write.fst
,
to write the tidy data frame in
the working directory. The file extension appended to
the filename
provided is .rad
.
With default: filename = NULL
, the tidy data frame is
in the global environment only (i.e. not written in the working directory...).
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) To pass further arguments for fine-tuning the function.
The output in your global environment is a tidy data frame.
If filename
is provided, the tidy data frame is also
written in the working directory with file extension .rad
.
The file is written with the
Lightning Fast Serialization of Data Frames for R package.
To read the file back in R use read.fst
.
VCF files must end with .vcf
: documented in tidy_vcf
PLINK files must end with .tped
or .bed
: documented in tidy_plink
genind object from
adegenet:
documented in tidy_genind
.
genlight object from
adegenet:
documented in tidy_genlight
.
gtypes object from
strataG:
documented in tidy_gtypes
.
genepop file must end with .gen
, documented in tidy_genepop
.
fstat file must end with .dat
, documented in tidy_fstat
.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"
).
To make the haplotype file population ready, you need the strata
argument.
Data frames: documented in tidy_wide
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
vcf.metadata
(optional, logical or string).
Default: vcf.metadata = TRUE
. Documented in tidy_vcf
.
vcf.stats
(optional, logical).
Default: vcf.stats = TRUE
.
Documented in tidy_vcf
.
whitelist.markers
(optional, path or object) To keep only markers in a whitelist.
Default whitelist.markers = NULL
.
Documented in read_whitelist
.
blacklist.id
(optional) Default: blacklist.id = NULL
.
Ideally, managed in the strata file.
Documented in read_strata
and read_blacklist_id
.
filter.common.markers
(optional, logical).
Default: filter.common.markers = TRUE
,
Documented in filter_common_markers
.
filter.monomorphic
(logical, optional) Should the monomorphic
markers present in the dataset be filtered out ?
Default: filter.monomorphic = TRUE
.
Documented in filter_monomorphic
.
if (FALSE) { # \dontrun{
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")
# using VCF file as input
require(SeqArray)
tidy.vcf <- tidy_genomic_data(
data = "populations.snps.vcf", strata = "strata.treefrog.tsv",
whitelist.markers = "whitelist.vcf.txt")
} # }