Designed for RADseq data, it's radiator integrated pipeline that links
several filter_
functions of radiator.
Rapidly get an idea of what you can and cannot do with your dataset.
Novices, start with this one!
filter_rad(
data,
strata = NULL,
interactive.filter = TRUE,
output = NULL,
filename = NULL,
verbose = TRUE,
parallel.core = parallel::detectCores() - 1,
...
)
14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use detect_genomic_format
(see example below).
Documented in Input genomic datasets of tidy_genomic_data
.
DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
and STRATA
. Documented in read_strata
.
DArT data: a third column TARGET_ID
is required.
Documented on read_dart
. Also use the strata read function to
blacklist individuals.
Default: strata = NULL
.
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: interactive.filter = TRUE
.
29 genomic data formats can be exported: tidy (by default),
genepop, genind, genlight, vcf (for file format version, see details below),
plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG),
bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray,
snprelate, maverick, genepopedit, rubias, hapmap and dadi.
Use a character string,
e.g. output = c("genind", "genepop", "structure")
, to have preferred
output formats generated. With default, only the tidy format is generated.
Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference).
Default: output = NULL
.
(optional) The filename prefix for the objet in the global environment
or the working directory. Default: filename = NULL
. A default name will be used,
customized with the output file(s) selected.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional) To pass further argument for fine-tuning the function.
The function returns an object (list). The content of the object
can be listed with names(object)
and use $
to isolate specific
object (see examples). Some output format will write the output file in the
working directory. The tidy genomic data frame is generated automatically.
radiator:
the GDS file ending with .gds.rad
. To open in R use read_rad
.
filters_parameters.tsv
: file containing all the parameters and
values of the filtering process.
radiator_filter_rad_args.tsv
: the function calls and values, for
reproducibility.
radiator_tidy_dart_metadata.rad
: for DArT data, this is the markers
metadata file. To open in R use read_rad
.
random.seed
: for reproducibility.
filter_dart_reproducibility: filter reproducibility of markers
(DArT data only). Described in filter_dart_reproducibility
.
blacklist.dart.reproducibility.tsv
: blacklisted markers.
whitelist.dart.reproducibility.tsv
: whitelisted markers.
dart_reproducibility_boxplot.pdf
: the boxplot of reproducibility.
dart_reproducibility_stats.tsv
: reproducibility associated summary statistics.
dart.reproducibility.helper.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
dart.reproducibility.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted.
radiator_filter_dart_reproducibility_args.tsv
: the function calls and values.
filter_monomorphic: removes monomorphic markers.
Described in filter_monomorphic
.
blacklist.monomorphic.markers.tsv
: blacklisted markers.
whitelist.polymorphic.markers
: whitelisted markers.
radiator_filter_monomorphic_args
: the function calls and values.
filter_common_markers: keep only markers in common between strata.
Described in filter_common_markers
.
common.markers.upsetrplot.pdf
: the UpSetR plot highlight the number
of markers in common between strata.
blacklist.common.markers.tsv
: blacklisted markers.
whitelist.common.markers.tsv
: whitelisted markers.
radiator_filter_common_markers_args.tsv
: the function calls and values.
filter_individuals: blacklist individuals based on missingness
and/or heterozygosity and/or total coverage.
Described in filter_individuals
.
individuals.qc.pdf
: several boxplots showing different individual
quality metrics.
individuals.qc.stats.tsv
: tibble with individual's
proportion of missing genotypes, heterozygosity, total and mean coverage.
individuals.qc.stats.summary.tsv
: individual's summary statistics.
blacklist.individuals.missing.tsv
: blacklisted individuals based
on missingness (genotyping rate of individuals).
radiator_filter_individuals_args.tsv
: the function calls and values.
The function will remove automatically monomorphic markers if individuals are removed.
filter_ma: remove/blacklist markers based on Minor/Alternate Allele Count (MAC), Frequency (MAF) or Depth (MAD).
Described in filter_ma
.
distribution.mac.global.pdf
: distribution of overall MAC.
ma.boxplot.pdf
: boxplot of the MAC.
ma.global.tsv
: a tibble with the global MAC and MAF.
mac.markers.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
mac.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted.
ma.summary.stats.tsv
: MAC summary statistics.
blacklist.markers.ma.tsv
: blacklisted markers.
whitelist.markers.ma.tsv
: whitelisted markers.
radiator_filter_ma_args.tsv
: the function calls and values.
filter_coverage: remove/blacklist markers based on mean coverage information.
Described in filter_coverage
.
markers_metadata.tsv
: all the markers coverage metadata available.
markers_metadata_stats.tsv
: summary statistics of coverage information.
markers_qc.pdf
: several coverage boxplots (total coverage, mean coverage, etc.)
coverage.low.helper.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
coverage.low.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted.
coverage.high.helper.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
coverage.high.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted.
blacklist.markers.coverage.tsv
: blacklisted markers.
whitelist.markers.coverage.tsv
: whitelisted markers.
radiator_filter_coverage_args.tsv
: the function calls and values.
filter_genotyping: remove/blacklist markers based on genotyping/call rate.
Described in filter_genotyping
.
markers_qc.pdf
: the missing genotypes boxplot.
markers_metadata.tsv
: the missing proportion per markers along other stats.
markers_metadata_stats.tsv
: the summary statistics of markes genotyping rate..
markers.genotyping.helper.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
genotyping.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted (overall).
markers.pop.missing.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted (strata).
blacklist.markers.genotyping.tsv
: blacklisted markers.
whitelist.markers.genotyping.tsv
: whitelisted markers.
radiator_filter_genotyping_args.tsv
: the function calls and values.
filter_snp_position_read: removes markers/SNPs based on their
position on the read
Described in filter_snp_position_read
.
snp.position.read.boxplot.pdf
: boxplot of SNP position on the read
snp.position.read.helper.table.tsv
: a tibble with
the impact of thresholds on the the number
of markers blacklisted and whitelisted.
snp.position.read.distribution.pdf
: distribution of the SNP position
on the read.
blacklits.markers.snp.position.read.tsv
: blacklisted markers.
whitelist.markers.snp.position.read.tsv
: whitelisted markers.
radiator_filter_snp_position_read_args.tsv
: the function calls and values.
filter_snp_number: removes outlier markers with too many SNP
number per locus/read.
Described in filter_snp_number
.
markers_metadata.tsv
: metadata associated with the number of SNP/locus
snp_per_locus.pdf
: boxplot of number of SNP/locus.
snp_per_locus_distribution.pdf
: distribution on the number od
snp.per.locus.helper.plot.pdf
: the helper plot showing
the impact of thresholds
on the number of markers blacklisted and whitelisted.
blacklist.snp.per.locus.tsv
: blacklisted markers.
whitelist.snp.per.locus.tsv
: whitelisted markers.
radiator_filter_snp_number_args.tsv
: the function calls and values.
filter_ld: SNP short and long distance linkage disequilibrium
pruning.
Described in filter_ld
.
short.ld.locus.stats.tsv
: number of locus with more than 1, 2, 3 SNPs.
blacklist.short.ld.tsv
: blacklisted markers.
whitelist.short.ld.tsv
: whitelisted markers.
blacklist.long.ld.tsv
: blacklisted markers.
whitelist.long.ld.tsv
: whitelisted markers.
radiator_filter_ld_args.tsv
: the function calls and values.
detect_mixed_genomes: highlight outliers individual's.
observed heterozygosity.
Described in detect_mixed_genomes
.
individuals.qc.stats.tsv
: individual's heterozygosity and
missingness proportion.
individuals.qc.stats.summary.tsv
: heterozygosity summary statistics.
heterozygosity.statistics.tsv
: heterozygosity summary statistics
per strata and overall.
individual.heterozygosity.manhattan.plot.pdf
: manhattan plot
highlighting potential patterns of heterozygosity and missingness.
individual.heterozygosity.boxplot.pdf
: boxplot
highlighting potential patterns of heterozygosity and missingness.
blacklist.ind.het.tsv
: blacklisted individuals.
radiator_detect_mixed_genomes_args.tsv
: the function calls and values.
The function will remove automatically monomorphic markers if individuals are removed.
detect_duplicate_genomes: highligh potential duplicate individuals.
Described in detect_duplicate_genomes
.
genotyped.statistics.tsv
: genotyping statistics used in the analysis.
individuals.pairwise.dist.tsv
: pairwise distances measures.
individuals.pairwise.distance.stats.tsv
: summary statistics of
pairwise distances measures.
individuals.qc.stats.tsv
: individuals summary produced by default.
individuals.qc.stats.summary.tsv
: individuals summary produced by default.
manhattan.plot.distance.pdf
: manhattan plot highlighting duplicates.
violin.plot.distance.pdf
: boxplot highlighting duplicates.
blacklist.id.similar.tsv
: blacklisted individuals.
radiator_detect_duplicate_genomes_args.tsv
: the function calls and values.
filter_hwe: remove/blacklist markers based on Hardy-Weinberg Equilibrium
Described in filter_hwe
.
genotypes.summary.tsv
: summary of genotypes groups.
hwd.helper.table.tsv
: number of markers blacklisted based on thresholds.
hw.pop.sum.tsv
: summary of hwe/hwd.
hwd.plot.blacklist.markers.pdf
: helper plot.
hwe.manhattan.plot.pdf
: overview of markers in HWD.
hwe.ternary.plots.missing.data.pdf
: overview of markers in HWD.
blacklisted and whitelisted markers based on thresholds
.
radiator_filter_hwe_args.tsv
: the function calls and values.
filtered:
.rad
file is the tidy data set. To open in R use read_rad
.
strata.filtered.tsv
the filtered strata file.
blacklist.id.tsv
: blacklisted individuals (total).
blacklist.markers.tsv
: blacklisted markers.
whitelist.markers.tsv
: whitelisted markers.
markers_qc.pdf
: markers qc after filters.
individuals.qc.pdf
: individuals qc after filters.
individuals.qc.stats.tsv
: individuals qc stats.
individuals.qc.stats.summary.tsv
: individuals qc stats summary.
markers_metadata.tsv
: markers final metadata.
markers_metadata_stats.tsv
: markers metadata stats.
radiator_genind_.RData
and radiator_stockr.RData
, the
genind and stockr files generated by the output argument.
Ideally, use the function in interactive mode and forget about this section. For advance user who want to run in batch mode with pre-defined arguments. Below are the arguments available. Read the function documentation associated with the arguments. Most of them reside in separate module that can be explored separately. dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
filter.reproducibility: detailed in filter_dart_reproducibility
.
filter.monomorphic: detailed in filter_monomorphic
.
filter.common.markers: detailed in filter_common_markers
.
whitelist.markers
: detailed in filter_whitelist
.
filter.individuals.missing: detailed in filter_individuals
.
filter.individuals.heterozygosity: detailed in filter_individuals
.
filter.individuals.coverage.total: detailed in filter_individuals
.
filter.ma: detailed in filter_ma
.
filter.coverage: detailed in filter_coverage
.
filter.genotyping: described in filter_genotyping
.
filter.snp.position.read: described in filter_snp_position_read
.
filter.snp.number: described in filter_snp_number
.
filter.short.ld: described in filter_ld
.
filter.long.ld: described in filter_ld
.
long.ld.missing: described in filter_ld
.
ld.method: described in filter_ld
.
detect.mixed.genomes: described in detect_mixed_genomes
.
detect.duplicate.genomes: described in detect_duplicate_genomes
.
dup.threshold: described in detect_duplicate_genomes
.
filter.hwe: described in filter_hwe
.
hw.pop.threshold: described in filter_hwe
.
midp.threshold: described in filter_hwe
.
filter.hwe: described in filter_hwe
.
filter.hwe: described in filter_hwe
.
ind.heterozygosity.threshold: described in detect_mixed_genomes
.
ind.heterozygosity.threshold: described in detect_mixed_genomes
.
ind.heterozygosity.threshold: described in detect_mixed_genomes
.
if (FALSE) { # \dontrun{
# Some of the packages used in this function are not installed by default.
# Installed the required packages:
radiator_pkg_install()
# Very simple:
shark <- radiator::filter_rad(
data = "data.shark.vcf",
strata = "strata.shark.tsv")
# With filename and output
shark <- radiator::filter_rad(
data = "data.shark.vcf",
strata = "strata.shark.tsv",
output = "genind",
filename = "shark")
} # }