The arguments in the genomic_converter
function were tailored for the
reality of GBS/RADseq data while maintaining a reproducible workflow.
Input file: 14 diploid file formats are supported
(see data
argument below).
Filters: see Advance mode section below for ways to
use blacklist and whitelist related arguments.
For best results with unfiltered datasets, use filter_rad
(genomic_converter
is included in that function!).
Imputations: deprecated module no longer available in genomic_converter (see Life cycle section below).
Parallel: Some parts of the function are designed to be conduncted on multiple CPUs
Output: 29 output file formats are supported (see output
argument below)
genomic_converter(
data,
strata = NULL,
output = NULL,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use detect_genomic_format
(see example below).
Documented in Input genomic datasets of tidy_genomic_data
.
DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
and STRATA
. Documented in read_strata
.
DArT data: a third column TARGET_ID
is required.
Documented on read_dart
. Also use the strata read function to
blacklist individuals.
Default: strata = NULL
.
29 genomic data formats can be exported: tidy (by default),
genepop, genind, genlight, vcf (for file format version, see details below),
plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG),
bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray,
snprelate, maverick, genepopedit, rubias, hapmap and dadi.
Use a character string,
e.g. output = c("genind", "genepop", "structure")
, to have preferred
output formats generated. With default, only the tidy format is generated.
Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference).
Default: output = NULL
.
(optional) The filename prefix for the object in the global environment
or the working directory. Default: filename = NULL
. A default name will be used,
customized with the output file(s) selected.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) To pass further arguments for fine-tuning the function.
The function returns an object (list). The content of the object
can be listed with names(object)
and use $
to isolate specific
object (see examples). Some output format will write the output file in the
working directory. The tidy genomic data frame is generated automatically.
GDS file or object, must end with .gds
or .rad
:
documented in read_vcf
VCF files must end with .vcf
: documented in tidy_vcf
PLINK files must end with .tped
or .bed
: documented in tidy_plink
genind object from
adegenet:
documented in tidy_genind
.
genlight object from
adegenet:
documented in tidy_genlight
.
gtypes object from
strataG:
documented in tidy_gtypes
.
genepop file must end with .gen
, documented in tidy_genepop
.
fstat file must end with .dat
, documented in tidy_fstat
.
haplotype file created in STACKS (e.g. data = "batch_1.haplotypes.tsv"
).
To make the haplotype file population ready, you need the strata
argument.
Data frames: documented in tidy_wide
.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
path.folder: use this argument to specify an output folder.
Default: path.folder = "radiator_genomic_converter"
.
vcf.metadata
(optional, logical or string).
Default: vcf.metadata = TRUE
. Documented in tidy_vcf
.
vcf.stats
(optional, logical).
Default: vcf.stats = TRUE
.
Documented in tidy_vcf
.
whitelist.markers
(optional) Default whitelist.markers = NULL
.
Documented in read_whitelist
.
filter.common.markers
(optional, logical).
Default: filter.common.markers = TRUE
.
By defaults, only common markers are kept in the dataset.
Documented in filter_common_markers
.
filter.monomorphic
(logical, optional)
Default: filter.monomorphic = TRUE
.
By defaults, only polymorphic markers across strata are kept in the dataset.
Documented in filter_monomorphic
.
individuals to blacklist ? Use the strata file for this.
Documented in read_strata
.
keep.allele.names
argument used when tidying genind object.
Documented in tidy_genind
.
Default: keep.allele.names = FALSE
.
blacklist.id
(optional) Default: blacklist.id = NULL
.
Ideally, managed in the strata file.
Documented in read_strata
and read_blacklist_id
.
Map-independent imputation of missing genotype is avaible in my other R package called grur.
Use grur to :
Visualize your missing data: before imputing your genotypes, visualize your missing data. Several visual tools are available inside grur to help you decide the best strategy after.
Optimize: use grur imputation module and other functions to optimize the imputations of your dataset. You need to test arguments. Failing to conduct tests and adjust imputations arguments will generate artifacts and/or exacerbate bias. Using defaults is not optional here...
genomic_converter: use the output argument inside grur imputation module to generate the required formats.
If you need a different VCF file format version than the current one, just change the version inside the newly created VCF, that should do the trick. For more information on Variant Call Format specifications.
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.
Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24, 1403-1405.
Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics, 27, 3070-3071.
Lamy T, Legendre P, Chancerelle Y, Siu G, Claudet J (2015) Understanding the Spatio-Temporal Response of Coral Reef Fish Communities to Natural Disturbances: Insights from Beta-Diversity Decomposition. PLoS ONE, 10, e0138696.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795
Goudet, J. (1995) FSTAT (Version 1.2): A computer program to calculate F- statistics. Journal of Heredity, 86, 485-486.
Goudet, J. (2005) hierfstat, a package for r to compute and test hierarchical F-statistics. Molecular Ecology Notes, 5, 184-186.
Eric Archer, Paula Adams and Brita Schneiders (2016). strataG: Summaries and Population Structure Analyses of Genetic Data. R package version 1.0.5. https://CRAN.R-project.org/package=strataG
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28: 3326-3328. doi:10.1093/bioinformatics/bts606
Foll, M and OE Gaggiotti (2008) A genome scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective. Genetics 180: 977-993
Foll M, Fischer MC, Heckel G and L Excoffier (2010) Estimating population structure from AFLP amplification intensity. Molecular Ecology 19: 4638-4647
Fischer MC, Foll M, Excoffier L and G Heckel (2011) Enhanced AFLP genome scans detect local adaptation in high-altitude populations of a small rodent (Microtus arvalis). Molecular Ecology 20: 1450-1462
Malinsky M, Trucchi E, Lawson D, Falush D (2018) RADpainter and fineRADstructure: population inference from RADseq data. bioRxiv, 057711.
Pew J, Muir PH, Wang J, Frasier TR (2015) related: an R package for analysing pairwise relatedness from codominant molecular markers. Molecular Ecology Resources, 15, 557-561.
Raj A, Stephens M, Pritchard JK (2014) fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Datasets. Genetics, 197, 573-589.
Verity R, Nichols RA (2016) Estimating the Number of Subpopulations (K) in Structured Populations. Genetics, 203, genetics.115.180992-1839.
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics.
beta.div
is available on Pierre Legendre web site http://adn.biol.umontreal.ca/~numericalecology/Rcode/
if (FALSE) { # \dontrun{
#To verify your file is detected by radiator as the correct format:
radiator::detect_genomic_format(data = "populations.snps.vcf")
# The simplest form of the function:
require(strataG) # for the gtypes format...
snowcrab <- genomic_converter(
data = "populations.snps.vcf", strata = "snowcrab.strata.tsv",
output = c("genlight", "genepop", "gtypes"))
#Get the content of the object created using:
names(snowcrab)
#To isolate the genlight object (without imputation):
genlight <- snowcrab$genlight
} # }