The function reads VCF files for radiator and
generate a connection SeqArray
GDS object/file of class SeqVarGDSClass
(Zheng et al. 2017).
The Genomic Data Structure (GDS) file format is detailed in
gdsfmt.
The function as an advance mode that allows various filtering arguments to be used. Handy to prune the dataset based on various QCs but also to remove artifacts, duplicated information and thus lower the overall noise in the VCF.
Used internally in radiator and might be of interest for users.
read_vcf(
data,
strata = NULL,
filename = NULL,
vcf.stats = TRUE,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
(path, character) The VCF markers are biallelic SNPs or haplotypes.
To make the VCF population-ready, you have to use strata
argument.
GATK, platypus, samtools and ipyrad VCFs:
Some VCFs have an ID
column filled with .
,
the LOCUS information is all contained along the linkage group in the
CHROM
column. Consequently, the short read locus information is unknown.
To make it work with
radiator,
the ID
column is filled with the POS
column info.
stacks VCFs: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
DArT VCFs: CHROM with .
are replaced by denovo
.
POS
with NA
are replaced by 50.
COL
: the position of the SNP on the read is extracted from the
LOCUS
as any other DArT data.
LOCUS
: are the first group of digits the rest of the info is discarded
and LOCUS
is then joined with POS
by a _
.
POS
is replaced by COL
, (col info is duplicated).
This is necessary to make the lines in the VCF unique.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
and STRATA
. Documented in read_strata
.
DArT data: a third column TARGET_ID
is required.
Documented on read_dart
. Also use the strata read function to
blacklist individuals.
Default: strata = NULL
.
(optional) The file name of the Genomic Data Structure (GDS) file.
radiator will append .gds.rad
to the filename.
If the filename chosen exists in the working directory,
the default radiator_datetime.gds
is chosen.
Default: filename = NULL
.
(optional, logical) Generates individuals and markers
statistics helpful for filtering.
These are very fast to generate and because computational
cost is minimal, even for huge VCFs, the default is vcf.stats = TRUE
.
Individual's missing genotype proportion, averaged heterozygosity,
total coverage, mean genotype coverage and marker's metadata
along count for ref and alt alleles and mean
coverage is generated and written in the working directory.
Default: vcf.stats = TRUE
.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) To pass further arguments for fine-tuning the function.
The function returns a GDS object.
A vcf file of 35 GB with ~4 millions SNPs take about ~7 min with 8 CPU. A vcf file of 21 GB with ~2 millions SNPs take about ~5 min with 7 CPU.
After the file is generated, you can close your computer and come back to it a months later and it's now a matter of sec to open a connection. See example below.
markers heterozygosity and inbreeding coefficient:
The heterozygosity statistics (het obs and Fis) presented in this function
are calculated globally across strata, without the possibility to filter out
markers. We think this filtering for these statistics are best
addressed in filter_het
, filter_fis
and
filter_hwe
, after outlier individuals and
markers (for other stats) are blacklisted. Why ? The reason is that an excess of
heterozygotes and negative Fis values are not a bad thing per se.
Genomic regions under balancing selection may contain such markers and
statistics.
PLINK: radiator fills the LOCUS
column of PLINK VCFs with
a unique integer based on the CHROM
column
(as.integer(factor(x = CHROM))
).
The COL
column is filled with 1L for lack of bettern info on this.
Not what you need ? Open an issue on GitHub for a request.
ipyrad: the pattern locus_
in the CHROM
column
is removed and used. The COL
column is filled with the same value as
POS
.
GATK: Some VCF have an ID
column filled with .
,
the LOCUS information is all contained along the linkage group in the
CHROM
column. To make it work with
radiator,
the ID
column is filled with the POS
column info.
platypus: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above.
freebayes: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. samtools: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. stacks: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
stacks problem: current version as some intrinsic problem with missing allele depth info, during the tidying process a message will highlight the number of genotypes impacted by the problem. When possible, the problem is corrected by adding the read depth info into the allele depth field.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
whitelist.markers
: detailed in filter_whitelist
.
blacklist.id
: detailed in tidy_genomic_data
.
pop.select
: detailed in tidy_genomic_data
.
pop.levels
: detailed in tidy_genomic_data
.
pop.labels
: detailed in tidy_genomic_data
.
filter.strands
: (optional, character) Filter duplicate SNPs
found on different strands (+/-), options are:
"keep.both", "best.stats", "blacklist"
. "keep.both"
: does nothing
and duplicated markers are kept (not recommended, but here for testing purposes),
"best.stats"
: will keep only one, based on the best statistics
(MAC and missingness). "blacklist"
: discard all duplicated markers.
Default (filter.strands = "blacklist"
).
filter.common.markers
: (logical) Default: filter.common.markers = TRUE
.
Documented in filter_common_markers
.
filter.ma
: (integer) Default: filter.ma = NULL
.
To blacklist markers below a specific Minor Allele Count, Frequency or Depth (calculated overall/global).
Documented in filter_ma
.
filter.coverage
: (optional, string)
Default: filter.coverage = NULL
. To blacklist markers based on mean coverage.
Documented in filter_coverage
.
filter.genotyping
: (integer) To blacklist markers with too
many missing data. e.g. filter.genotyping = 0.1
, will only keep
markers with missing rate <= to 10 percent.
Documented in filter_genotyping
.
filter.snp.position.read
: 3 options available, "outliers", "q75", "iqr"
.
This argument will blacklist markers based on it's position on the read.
filter.snp.position.read = "outliers"
, will remove markers based
on outlier statistics of the position on the reads. e.g. if majority of SNPs
are found between 10 and 90 pb, and very few above and below, those markers are
discarded. Use this function argument with dataset with problematic assembly,
or de novo assembly with undocumented or poorly selected
mismatch threshold.
filter.short.ld
: Described in filter_ld
filter.long.ld
: Described in filter_ld
long.ld.missing
: Described in filter_ld
ld.method
: (optional, character) The values available are
"composite"
, for LD composite measure, "r"
for R coefficient
(by EM algorithm assuming HWE, it could be negative), "r2"
for r^2,
"dprime"
for D',
"corr"
for correlation coefficient. The method corr and composite are
equivalent when SNPs are coded based on the presence of the alternate allele
(0, 1, 2
).
Default: ld.method = "r2"
.
filter.individuals.missing
: (double) Use this argument to
blacklist individuals with too many missing data.
e.g. filter.individuals.missing = 0.7
, will remove individuals with >
0.7 or 70
with some dataset.
markers.info
: (character) With default: markers.info = NULL
,
all the variable in the vcf INFO field are imported.
To import only DP (the SNP total read depth) and AF (the SNP allele frequency),
use markers.info = c("DP", "AF")
.
Using, markers.info = character(0)
will not import INFO variables.
path.folder
: to write ouput in a specific path
(used internally in radiator). Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
random.seed
: (integer, optional) For reproducibility, set an integer
that will be used inside codes that uses randomness. With default,
a random number is generated, printed and written in the directory.
Default: random.seed = NULL
.
subsample.markers.stats
: By default, when no filters are
requested and that the number of markers is > 200K,
0.2 of markers are randomly selected to generate the
statistics (individuals and markers). This is an all-around and
reliable number.
In doubt, overwrite this value by using 1 (all markers selected) and
expect a small computational cost.
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.
if (FALSE) { # \dontrun{
# require(SeqArray)
# with built-in defaults:
data <- radiator::read_vcf(data = "populations.snps.vcf")
# Because no strata file is provided, the individuals are assumed to be in
# the same group.
# If you need to filter the VCF I recommend using ?radiator::filter_rad
# But if you're certain about the filters thresholds, additional arguments are available:
data <- radiator::read_vcf(
data = "populations.snps.vcf",
strata = "strata_salamander.tsv",
path.folder = "salamander",
filter.individuals.missing = "outliers",
filter.common.markers = TRUE,
filter.strands = "blacklist",
filter.ma = 4,
filter.genotyping = 0.3,
filter.snp.position.read = "outliers",
filter.short.ld = "mac",
filter.long.ld = NULL,
verbose = TRUE
)
# In a new R session, to re-open the GDS file/connection:
data <- radiator::read_rad(data = "radiator_20200911@0748.gds")
} # }