Read VCF files and write a GDS file

The function reads VCF files for radiator and generate a connection SeqArray GDS object/file of class SeqVarGDSClass (Zheng et al. 2017). The Genomic Data Structure (GDS) file format is detailed in gdsfmt.

The function as an advance mode that allows various filtering arguments to be used. Handy to prune the dataset based on various QCs but also to remove artifacts, duplicated information and thus lower the overall noise in the VCF.

Used internally in radiator and might be of interest for users.

read_vcf(
  data,
  strata = NULL,
  filename = NULL,
  vcf.stats = TRUE,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

(path, character) The VCF markers are biallelic SNPs or haplotypes. To make the VCF population-ready, you have to use strata argument.

GATK, platypus, samtools and ipyrad VCFs: Some VCFs have an ID column filled with ., the LOCUS information is all contained along the linkage group in the CHROM column. Consequently, the short read locus information is unknown. To make it work with radiator, the ID column is filled with the POS column info.
stacks VCFs: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
DArT VCFs: CHROM with . are replaced by denovo. POS with NA are replaced by 50. COL: the position of the SNP on the read is extracted from the LOCUS as any other DArT data. LOCUS: are the first group of digits the rest of the info is discarded and LOCUS is then joined with POS by a _. POS is replaced by COL, (col info is duplicated). This is necessary to make the lines in the VCF unique.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

filename

(optional) The file name of the Genomic Data Structure (GDS) file. radiator will append .gds.rad to the filename. If the filename chosen exists in the working directory, the default radiator_datetime.gds is chosen. Default: filename = NULL.

vcf.stats

(optional, logical) Generates individuals and markers statistics helpful for filtering. These are very fast to generate and because computational cost is minimal, even for huge VCFs, the default is vcf.stats = TRUE. Individual's missing genotype proportion, averaged heterozygosity, total coverage, mean genotype coverage and marker's metadata along count for ref and alt alleles and mean coverage is generated and written in the working directory. Default: vcf.stats = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) To pass further arguments for fine-tuning the function.

Value

The function returns a GDS object.

Details

A vcf file of 35 GB with ~4 millions SNPs take about ~7 min with 8 CPU. A vcf file of 21 GB with ~2 millions SNPs take about ~5 min with 7 CPU.

After the file is generated, you can close your computer and come back to it a months later and it's now a matter of sec to open a connection. See example below.

markers heterozygosity and inbreeding coefficient: The heterozygosity statistics (het obs and Fis) presented in this function are calculated globally across strata, without the possibility to filter out markers. We think this filtering for these statistics are best addressed in filter_het, filter_fis and filter_hwe, after outlier individuals and markers (for other stats) are blacklisted. Why ? The reason is that an excess of heterozygotes and negative Fis values are not a bad thing per se. Genomic regions under balancing selection may contain such markers and statistics.

VCF file format

PLINK: radiator fills the LOCUS column of PLINK VCFs with a unique integer based on the CHROM column (as.integer(factor(x = CHROM))). The COL column is filled with 1L for lack of bettern info on this. Not what you need ? Open an issue on GitHub for a request.

ipyrad: the pattern locus_ in the CHROM column is removed and used. The COL column is filled with the same value as POS.

GATK: Some VCF have an ID column filled with ., the LOCUS information is all contained along the linkage group in the CHROM column. To make it work with radiator, the ID column is filled with the POS column info.

platypus: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above.

freebayes: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. samtools: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. stacks: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".

stacks problem: current version as some intrinsic problem with missing allele depth info, during the tidying process a message will highlight the number of genotypes impacted by the problem. When possible, the problem is corrected by adding the read depth info into the allele depth field.

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

whitelist.markers: detailed in filter_whitelist.
blacklist.id: detailed in tidy_genomic_data.
pop.select: detailed in tidy_genomic_data.
pop.levels: detailed in tidy_genomic_data.
pop.labels: detailed in tidy_genomic_data.
filter.strands: (optional, character) Filter duplicate SNPs found on different strands (+/-), options are: "keep.both", "best.stats", "blacklist". "keep.both": does nothing and duplicated markers are kept (not recommended, but here for testing purposes), "best.stats": will keep only one, based on the best statistics (MAC and missingness). "blacklist": discard all duplicated markers. Default (filter.strands = "blacklist").
filter.common.markers: (logical) Default: filter.common.markers = TRUE. Documented in filter_common_markers.
filter.ma: (integer) Default: filter.ma = NULL. To blacklist markers below a specific Minor Allele Count, Frequency or Depth (calculated overall/global). Documented in filter_ma.
filter.coverage: (optional, string) Default: filter.coverage = NULL. To blacklist markers based on mean coverage. Documented in filter_coverage.
filter.genotyping: (integer) To blacklist markers with too many missing data. e.g. filter.genotyping = 0.1, will only keep markers with missing rate <= to 10 percent. Documented in filter_genotyping.
filter.snp.position.read: 3 options available, "outliers", "q75", "iqr". This argument will blacklist markers based on it's position on the read. filter.snp.position.read = "outliers", will remove markers based on outlier statistics of the position on the reads. e.g. if majority of SNPs are found between 10 and 90 pb, and very few above and below, those markers are discarded. Use this function argument with dataset with problematic assembly, or de novo assembly with undocumented or poorly selected mismatch threshold.
filter.short.ld: Described in filter_ld
filter.long.ld: Described in filter_ld
long.ld.missing: Described in filter_ld
ld.method: (optional, character) The values available are "composite", for LD composite measure, "r" for R coefficient (by EM algorithm assuming HWE, it could be negative), "r2" for r^2, "dprime" for D', "corr" for correlation coefficient. The method corr and composite are equivalent when SNPs are coded based on the presence of the alternate allele (0, 1, 2). Default: ld.method = "r2".
filter.individuals.missing: (double) Use this argument to blacklist individuals with too many missing data. e.g. filter.individuals.missing = 0.7, will remove individuals with > 0.7 or 70 with some dataset.
markers.info: (character) With default: markers.info = NULL, all the variable in the vcf INFO field are imported. To import only DP (the SNP total read depth) and AF (the SNP allele frequency), use markers.info = c("DP", "AF"). Using, markers.info = character(0) will not import INFO variables.
path.folder: to write ouput in a specific path (used internally in radiator). Default: path.folder = getwd(). If the supplied directory doesn't exist, it's created.
random.seed: (integer, optional) For reproducibility, set an integer that will be used inside codes that uses randomness. With default, a random number is generated, printed and written in the appropriate directory. Default: random.seed = NULL.
subsample.markers.stats: By default, when no filters are requested and that the number of markers is > 200K, 0.2 of markers are randomly selected to generate the statistics (individuals and markers). This is an all-around and reliable number. In doubt, overwrite this value by using 1 (all markers selected) and expect a small computational cost.

References

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics.

Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
# require(SeqArray)
# with built-in defaults:

data <- radiator::read_vcf(data = "populations.snps.vcf")

# Because no strata file is provided, the individuals are assumed to be in
# the same group.

# If you need to filter the VCF I recommend using ?radiator::filter_rad

# But if you're certain about the filters thresholds, additional arguments are available:

data <- radiator::read_vcf(
    data = "populations.snps.vcf",
    strata = "strata_salamander.tsv",
    path.folder = "salamander",
    filter.individuals.missing = "outliers",
    filter.common.markers = TRUE,
    filter.strands = "blacklist",
    filter.ma = 4,
    filter.genotyping = 0.3,
    filter.snp.position.read = "outliers",
    filter.short.ld = "mac",
    filter.long.ld = NULL,
    verbose = TRUE
    )

# In a new R session, to re-open the GDS file/connection:

data <- radiator::read_rad(data = "radiator_20200911@0748.gds")
} # }