The function allows to tidy a VCF file.
Used internally in radiator and might be of interest for users.
It is highly recommended to use filter_rad
to reduce
the number of markers. Advance options below are also available to
to manipulate and prune the dataset with blacklists and whitelists along
several other filtering options.
tidy_vcf(
data,
strata = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = FALSE,
...
)
(VCF file, character string) The VCF SNPs are biallelic or haplotypes.
To make the VCF population-ready, the argument strata
is required.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
and STRATA
. Documented in read_strata
.
DArT data: a third column TARGET_ID
is required.
Documented on read_dart
. Also use the strata read function to
blacklist individuals.
Default: strata = NULL
.
(optional) The function uses write.fst
,
to write the tidy data frame in
the working directory. The file extension appended to
the filename
provided is .rad
.
With default: filename = NULL
, the tidy data frame is
in the global environment only (i.e. not written in the working directory...).
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) To pass further argument for fine-tuning the tidying (read below).
The output in your global environment is a tidy data frame, the GDS file generated is in the working directory under the name given during function execution.
PLINK: radiator fills the LOCUS
column of PLINK VCFs with
a unique integer based on the CHROM
column
(as.integer(factor(x = CHROM))
).
The COL
column is filled with 1L for lack of bettern info on this.
Not what you need ? Open an issue on GitHub for a request.
ipyrad: the pattern locus_
in the CHROM
column
is removed and used. The COL
column is filled with the same value as
POS
.
GATK: Some VCF have an ID
column filled with .
,
the LOCUS information is all contained along the linkage group in the
CHROM
column. To make it work with
radiator,
the ID
column is filled with the POS
column info.
GATK with a mix of multi- and bi-allelic dataset won't generate VCF stats.
platypus: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above.
freebayes: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above.
stacks: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
stacks problem: current version as some intrinsic problem with missing allele depth info, during the tidying process a message will highlight the number of genotypes impacted by the problem. When possible, the problem is corrected by adding the read depth info into the allele depth field.
The arguments below are not available using code completion (e.g. with TAB), consequently any misspelling will generate an error or be ignored.
dots-dots-dots ... arguments names and values are reported and written
in the working directory when internal = FALSE
and verbose = TRUE
.
General arguments:
path.folder
: to write ouput in a specific path
(used internally in radiator). Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
internal:
(optional, character)
Default (internal = FALSE
). A folder is generated to write the files.
random.seed
: (integer, optional) For reproducibility, set an integer
that will be used inside codes that uses randomness. With default,
a random number is generated, printed and written in the directory.
Default: random.seed = NULL
.
parameters
It's a parameter file where radiator output results of
filtering. Used internally.
Default: parameters = NULL
.
tidying arguments/behavior:
tidy.vcf:
(optional, logical)
Default: tidy.vcf = TRUE
. But you can always stop the process after
the creation of the GDS file (equivalent of running read_vcf
).
tidy.check:
(optional, logical)
Default: tidy.check = TRUE
. By default, the number of markers just before
tidying is checked. Tidying a VCF file with more than 20000 markers is
sub-optimal:
a computer with lots of RAM is required
it's very slow to generate
it's very slow to run codes after
for most non model species this number of markers is not realistic...
Consequently, the function execution is suspended and user are asked if they still want to continue with the tidying or stop and keep the GDS file/object.
This behavior can be annoying, if the user knows what he's doing, to turn off
use: tidy.check = FALSE
.
calibrate.alleles:
(optional, logical)
Default: calibrate.alleles = FALSE
.
Documented in calibrate_alleles
.
vcf.stats:
(optional, logical) Generates individuals and
markers statistics helpful for filtering.
These are very fast to generate and because computational
cost is minimal, even for huge VCFs, the default is vcf.stats = TRUE
.
vcf.metadata:
(optional, logical or character string)
With vcf.metadata = FALSE
, only the genotypes are kept (GT field)
in the tidy dataset.
With vcf.metadata = TRUE
,
all the metadata contained in the FORMAT
field will be kept in
the tidy data file. radiator is currently keeping and cleaning these metadata:
"DP", "AD", "GL", "PL", "GQ", "HQ", "GOF", "NR", "NV", "CATG"
.
e.g. you only want AD and PL, vcf.metadata = c("AD", "PL")
.
Need another metadata ? Submit a request on github...
Default: vcf.metadata = TRUE
.
Filtering arguments:
blacklist.id:
(optional, character)
Default (blacklist.id = NULL
).
Documented in tidy_genomic_data
.
filter.strands
: (optional, character)
Default (filter.strands = "blacklist"
).
documented in read_vcf
.
whitelist.markers:
(optional, path)
Default: whitelist.markers = NULL
.
Documented in filter_whitelist
.
filter.individuals.missing
: (double)
Default: filter.individuals.missing = NULL
.
Documented in filter_individuals
.
filter.monomorphic
: (logical)
Default: filter.monomorphic = TRUE
.
Documented in filter_monomorphic
.
Required package: UpSetR
.
filter.common.markers
: (logical)
Default: filter.common.markers = TRUE
.
Documented in filter_common_markers
.
Required package: UpSetR
.
filter.ma
: (integer)
Default: filter.ma = NULL
.
Documented in filter_ma
.
filter.coverage
: (logical)
Default: filter.coverage = NULL
.
Documented in filter_coverage
.
filter.genotyping
: (integer)
Default: filter.genotyping = NULL
.
Documented in filter_genotyping
.
filter.snp.position.read:
(optional, character, integer)
Default: filter.snp.position.read = NULL
.
Documented in filter_snp_position_read
.
filter.snp.number:
(optional, character, integer)
Default: filter.snp.number = NULL
.
Documented in filter_snp_number
.
filter.short.ld
: (optional, character)
Default: filter.short.ld = NULL
.
Documented in filter_ld
.
filter.long.ld
: (optional, character)
Default: filter.long.ld = NULL
.
Documented in filter_ld
.
Required package: SNPRelate
.
long.ld.missing
: Documented in filter_ld
.
Default: long.ld.missing = FALSE
.
ld.method
: Documented in filter_ld
.
Default: ld.method = "r2"
.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.
if (FALSE) { # \dontrun{
# very basic with built-in defaults (not recommended):
prep.data <- radiator::tidy_vcf(data = "populations.snps.vcf")
# Using more arguments and filters (recommended):
tidy.data <- radiator::tidy_vcf(
data = "populations.snps.vcf",
strata = "strata_salamander.tsv",
filter.individuals.missing = "outlier",
filter.ma = 4,
filter.genotyping = 0.1,
filter.snp.position.read = "outliers",
filter.short.ld = "mac",
path.folder = "salamander/prep_data",
verbose = TRUE)
} # }