Conversion tool among several genomic formats

The arguments in the genomic_converter function were tailored for the reality of GBS/RADseq data while maintaining a reproducible workflow.

Input file: 14 diploid file formats are supported (see data argument below).
Filters: see Advance mode section below for ways to use blacklist and whitelist related arguments. For best results with unfiltered datasets, use filter_rad (genomic_converter is included in that function!).
Imputations: deprecated module no longer available in genomic_converter (see Life cycle section below).
Parallel: Some parts of the function are designed to be conduncted on multiple CPUs
Output: 29 output file formats are supported (see output argument below)

genomic_converter(
  data,
  strata = NULL,
  output = NULL,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

output

29 genomic data formats can be exported: tidy (by default), genepop, genind, genlight, vcf (for file format version, see details below), plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG), bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray, snprelate, maverick, genepopedit, rubias, hapmap and dadi. Use a character string, e.g. output = c("genind", "genepop", "structure"), to have preferred output formats generated. With default, only the tidy format is generated.

Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference).

Default: output = NULL.

filename

(optional) The filename prefix for the object in the global environment or the working directory. Default: filename = NULL. A default name will be used, customized with the output file(s) selected.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) To pass further arguments for fine-tuning the function.

Value

The function returns an object (list). The content of the object can be listed with names(object) and use $ to isolate specific object (see examples). Some output format will write the output file in the working directory. The tidy genomic data frame is generated automatically.

Input genomic datasets

GDS file or object, must end with .gds or .rad: documented in read_vcf
VCF files must end with .vcf: documented in tidy_vcf
PLINK files must end with .tped or .bed: documented in tidy_plink
genind object from adegenet: documented in tidy_genind.
genlight object from adegenet: documented in tidy_genlight.
gtypes object from strataG: documented in tidy_gtypes.
dart data from DArT: documented in read_dart.
genepop file must end with .gen, documented in tidy_genepop.
fstat file must end with .dat, documented in tidy_fstat.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"). To make the haplotype file population ready, you need the strata argument.
Data frames: documented in tidy_wide.

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

path.folder: use this argument to specify an output folder. Default: path.folder = "radiator_genomic_converter".
vcf.metadata (optional, logical or string). Default: vcf.metadata = TRUE. Documented in tidy_vcf.
vcf.stats (optional, logical). Default: vcf.stats = TRUE. Documented in tidy_vcf.
whitelist.markers (optional) Default whitelist.markers = NULL. Documented in read_whitelist.
filter.common.markers (optional, logical). Default: filter.common.markers = TRUE. By defaults, only common markers are kept in the dataset. Documented in filter_common_markers.
filter.monomorphic (logical, optional) Default: filter.monomorphic = TRUE. By defaults, only polymorphic markers across strata are kept in the dataset. Documented in filter_monomorphic.
individuals to blacklist ? Use the strata file for this. Documented in read_strata.
keep.allele.names argument used when tidying genind object. Documented in tidy_genind. Default: keep.allele.names = FALSE.
blacklist.id (optional) Default: blacklist.id = NULL. Ideally, managed in the strata file. Documented in read_strata and read_blacklist_id.

Life cycle

Map-independent imputation of missing genotype is avaible in my other R package called grur.

Use grur to :

Visualize your missing data: before imputing your genotypes, visualize your missing data. Several visual tools are available inside grur to help you decide the best strategy after.
Optimize: use grur imputation module and other functions to optimize the imputations of your dataset. You need to test arguments. Failing to conduct tests and adjust imputations arguments will generate artifacts and/or exacerbate bias. Using defaults is not optional here...
genomic_converter: use the output argument inside grur imputation module to generate the required formats.

VCF file format version

If you need a different VCF file format version than the current one, just change the version inside the newly created VCF, that should do the trick. For more information on Variant Call Format specifications.

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24, 1403-1405.

Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics, 27, 3070-3071.

Lamy T, Legendre P, Chancerelle Y, Siu G, Claudet J (2015) Understanding the Spatio-Temporal Response of Coral Reef Fish Communities to Natural Disturbances: Insights from Beta-Diversity Decomposition. PLoS ONE, 10, e0138696.

Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795

Goudet, J. (1995) FSTAT (Version 1.2): A computer program to calculate F- statistics. Journal of Heredity, 86, 485-486.

Goudet, J. (2005) hierfstat, a package for r to compute and test hierarchical F-statistics. Molecular Ecology Notes, 5, 184-186.

Eric Archer, Paula Adams and Brita Schneiders (2016). strataG: Summaries and Population Structure Analyses of Genetic Data. R package version 1.0.5. https://CRAN.R-project.org/package=strataG

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28: 3326-3328. doi:10.1093/bioinformatics/bts606

Foll, M and OE Gaggiotti (2008) A genome scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective. Genetics 180: 977-993

Foll M, Fischer MC, Heckel G and L Excoffier (2010) Estimating population structure from AFLP amplification intensity. Molecular Ecology 19: 4638-4647

Fischer MC, Foll M, Excoffier L and G Heckel (2011) Enhanced AFLP genome scans detect local adaptation in high-altitude populations of a small rodent (Microtus arvalis). Molecular Ecology 20: 1450-1462

Malinsky M, Trucchi E, Lawson D, Falush D (2018) RADpainter and fineRADstructure: population inference from RADseq data. bioRxiv, 057711.

Pew J, Muir PH, Wang J, Frasier TR (2015) related: an R package for analysing pairwise relatedness from codominant molecular markers. Molecular Ecology Resources, 15, 557-561.

Raj A, Stephens M, Pritchard JK (2014) fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Datasets. Genetics, 197, 573-589.

Verity R, Nichols RA (2016) Estimating the Number of Subpopulations (K) in Structured Populations. Genetics, 203, genetics.115.180992-1839.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")

# The simplest form of the function:
require(strataG) # for the gtypes format...
snowcrab <- genomic_converter(
                   data = "populations.snps.vcf", strata = "snowcrab.strata.tsv",
                   output = c("genlight", "genepop", "gtypes"))

#Get the content of the object created using:
names(snowcrab)
#To isolate the genlight object (without imputation):
genlight <- snowcrab$genlight
} # }