ONE FUNCTION TO RULE THEM ALL — filter

Designed for RADseq data, it's radiator integrated pipeline that links several filter_ functions of radiator. Rapidly get an idea of what you can and cannot do with your dataset. Novices, start with this one!

filter_rad(
  data,
  strata = NULL,
  interactive.filter = TRUE,
  output = NULL,
  filename = NULL,
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

14 options for input (diploid data only): VCFs (SNPs or Haplotypes, to make the vcf population ready), plink (tped, bed), stacks haplotype file, genind (library(adegenet)), genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT, and a data frame in long/tidy or wide format. To verify that radiator detect your file format use detect_genomic_format (see example below). Documented in Input genomic datasets of tidy_genomic_data.

DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

interactive.filter

(optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE.

output

29 genomic data formats can be exported: tidy (by default), genepop, genind, genlight, vcf (for file format version, see details below), plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG), bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray, snprelate, maverick, genepopedit, rubias, hapmap and dadi. Use a character string, e.g. output = c("genind", "genepop", "structure"), to have preferred output formats generated. With default, only the tidy format is generated.

Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference).

Default: output = NULL.

filename

(optional) The filename prefix for the objet in the global environment or the working directory. Default: filename = NULL. A default name will be used, customized with the output file(s) selected.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) To pass further argument for fine-tuning the function.

Value

The function returns an object (list). The content of the object can be listed with names(object) and use $ to isolate specific object (see examples). Some output format will write the output file in the working directory. The tidy genomic data frame is generated automatically.

Filtering steps and folders generated

radiator:
- the GDS file ending with .gds.rad. To open in R use read_rad.
- filters_parameters.tsv: file containing all the parameters and values of the filtering process.
- radiator_filter_rad_args.tsv: the function calls and values, for reproducibility.
- radiator_tidy_dart_metadata.rad: for DArT data, this is the markers metadata file. To open in R use read_rad.
- random.seed: for reproducibility.
filter_dart_reproducibility: filter reproducibility of markers (DArT data only). Described in filter_dart_reproducibility.
- blacklist.dart.reproducibility.tsv: blacklisted markers.
- whitelist.dart.reproducibility.tsv: whitelisted markers.
- dart_reproducibility_boxplot.pdf: the boxplot of reproducibility.
- dart_reproducibility_stats.tsv: reproducibility associated summary statistics.
- dart.reproducibility.helper.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- dart.reproducibility.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted.
- radiator_filter_dart_reproducibility_args.tsv: the function calls and values.
filter_monomorphic: removes monomorphic markers. Described in filter_monomorphic.
- blacklist.monomorphic.markers.tsv: blacklisted markers.
- whitelist.polymorphic.markers: whitelisted markers.
- radiator_filter_monomorphic_args: the function calls and values.
filter_common_markers: keep only markers in common between strata. Described in filter_common_markers.
- common.markers.upsetrplot.pdf: the UpSetR plot highlight the number of markers in common between strata.
- blacklist.common.markers.tsv: blacklisted markers.
- whitelist.common.markers.tsv: whitelisted markers.
- radiator_filter_common_markers_args.tsv: the function calls and values.
filter_individuals: blacklist individuals based on missingness and/or heterozygosity and/or total coverage. Described in filter_individuals.
- individuals.qc.pdf: several boxplots showing different individual quality metrics.
- individuals.qc.stats.tsv: tibble with individual's proportion of missing genotypes, heterozygosity, total and mean coverage.
- individuals.qc.stats.summary.tsv: individual's summary statistics.
- blacklist.individuals.missing.tsv: blacklisted individuals based on missingness (genotyping rate of individuals).
- radiator_filter_individuals_args.tsv: the function calls and values.
The function will remove automatically monomorphic markers if individuals are removed.
filter_ma: remove/blacklist markers based on Minor/Alternate Allele Count (MAC), Frequency (MAF) or Depth (MAD). Described in filter_ma.
- distribution.mac.global.pdf: distribution of overall MAC.
- ma.boxplot.pdf: boxplot of the MAC.
- ma.global.tsv: a tibble with the global MAC and MAF.
- mac.markers.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- mac.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted.
- ma.summary.stats.tsv: MAC summary statistics.
- blacklist.markers.ma.tsv: blacklisted markers.
- whitelist.markers.ma.tsv: whitelisted markers.
- radiator_filter_ma_args.tsv: the function calls and values.
filter_coverage: remove/blacklist markers based on mean coverage information. Described in filter_coverage.
- markers_metadata.tsv: all the markers coverage metadata available.
- markers_metadata_stats.tsv: summary statistics of coverage information.
- markers_qc.pdf: several coverage boxplots (total coverage, mean coverage, etc.)
- coverage.low.helper.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- coverage.low.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted.
- coverage.high.helper.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- coverage.high.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted.
- blacklist.markers.coverage.tsv: blacklisted markers.
- whitelist.markers.coverage.tsv: whitelisted markers.
- radiator_filter_coverage_args.tsv: the function calls and values.
filter_genotyping: remove/blacklist markers based on genotyping/call rate. Described in filter_genotyping.
- markers_qc.pdf: the missing genotypes boxplot.
- markers_metadata.tsv: the missing proportion per markers along other stats.
- markers_metadata_stats.tsv: the summary statistics of markes genotyping rate..
- markers.genotyping.helper.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- genotyping.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted (overall).
- markers.pop.missing.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted (strata).
- blacklist.markers.genotyping.tsv: blacklisted markers.
- whitelist.markers.genotyping.tsv: whitelisted markers.
- radiator_filter_genotyping_args.tsv: the function calls and values.
filter_snp_position_read: removes markers/SNPs based on their position on the read Described in filter_snp_position_read.
- snp.position.read.boxplot.pdf: boxplot of SNP position on the read
- snp.position.read.helper.table.tsv: a tibble with the impact of thresholds on the the number of markers blacklisted and whitelisted.
- snp.position.read.distribution.pdf: distribution of the SNP position on the read.
- blacklits.markers.snp.position.read.tsv: blacklisted markers.
- whitelist.markers.snp.position.read.tsv: whitelisted markers.
- radiator_filter_snp_position_read_args.tsv: the function calls and values.
filter_snp_number: removes outlier markers with too many SNP number per locus/read. Described in filter_snp_number.
- markers_metadata.tsv: metadata associated with the number of SNP/locus
- snp_per_locus.pdf: boxplot of number of SNP/locus.
- snp_per_locus_distribution.pdf: distribution on the number od
- snp.per.locus.helper.plot.pdf: the helper plot showing the impact of thresholds on the number of markers blacklisted and whitelisted.
- blacklist.snp.per.locus.tsv: blacklisted markers.
- whitelist.snp.per.locus.tsv: whitelisted markers.
- radiator_filter_snp_number_args.tsv: the function calls and values.
filter_ld: SNP short and long distance linkage disequilibrium pruning. Described in filter_ld.
- short.ld.locus.stats.tsv: number of locus with more than 1, 2, 3 SNPs.
- blacklist.short.ld.tsv: blacklisted markers.
- whitelist.short.ld.tsv: whitelisted markers.
- blacklist.long.ld.tsv: blacklisted markers.
- whitelist.long.ld.tsv: whitelisted markers.
- radiator_filter_ld_args.tsv: the function calls and values.
detect_mixed_genomes: highlight outliers individual's. observed heterozygosity. Described in detect_mixed_genomes.
- individuals.qc.stats.tsv: individual's heterozygosity and missingness proportion.
- individuals.qc.stats.summary.tsv: heterozygosity summary statistics.
- heterozygosity.statistics.tsv: heterozygosity summary statistics per strata and overall.
- individual.heterozygosity.manhattan.plot.pdf: manhattan plot highlighting potential patterns of heterozygosity and missingness.
- individual.heterozygosity.boxplot.pdf: boxplot highlighting potential patterns of heterozygosity and missingness.
- blacklist.ind.het.tsv: blacklisted individuals.
- radiator_detect_mixed_genomes_args.tsv: the function calls and values.
The function will remove automatically monomorphic markers if individuals are removed.
detect_duplicate_genomes: highligh potential duplicate individuals. Described in detect_duplicate_genomes.
- genotyped.statistics.tsv: genotyping statistics used in the analysis.
- individuals.pairwise.dist.tsv: pairwise distances measures.
- individuals.pairwise.distance.stats.tsv: summary statistics of pairwise distances measures.
- individuals.qc.stats.tsv: individuals summary produced by default.
- individuals.qc.stats.summary.tsv: individuals summary produced by default.
- manhattan.plot.distance.pdf: manhattan plot highlighting duplicates.
- violin.plot.distance.pdf: boxplot highlighting duplicates.
- blacklist.id.similar.tsv: blacklisted individuals.
- radiator_detect_duplicate_genomes_args.tsv: the function calls and values.
filter_hwe: remove/blacklist markers based on Hardy-Weinberg Equilibrium Described in filter_hwe.
- genotypes.summary.tsv: summary of genotypes groups.
- hwd.helper.table.tsv: number of markers blacklisted based on thresholds.
- hw.pop.sum.tsv: summary of hwe/hwd.
- hwd.plot.blacklist.markers.pdf: helper plot.
- hwe.manhattan.plot.pdf: overview of markers in HWD.
- hwe.ternary.plots.missing.data.pdf: overview of markers in HWD.
- blacklisted and whitelisted markers based on thresholds.
- radiator_filter_hwe_args.tsv: the function calls and values.
filtered:
- .rad file is the tidy data set. To open in R use read_rad.
- strata.filtered.tsv the filtered strata file.
- blacklist.id.tsv: blacklisted individuals (total).
- blacklist.markers.tsv: blacklisted markers.
- whitelist.markers.tsv: whitelisted markers.
- markers_qc.pdf: markers qc after filters.
- individuals.qc.pdf: individuals qc after filters.
- individuals.qc.stats.tsv: individuals qc stats.
- individuals.qc.stats.summary.tsv: individuals qc stats summary.
- markers_metadata.tsv: markers final metadata.
- markers_metadata_stats.tsv: markers metadata stats.
- radiator_genind_.RData and radiator_stockr.RData, the genind and stockr files generated by the output argument.

Advance mode

Ideally, use the function in interactive mode and forget about this section. For advance user who want to run in batch mode with pre-defined arguments. Below are the arguments available. Read the function documentation associated with the arguments. Most of them reside in separate module that can be explored separately. dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

filter.reproducibility: detailed in filter_dart_reproducibility.
filter.monomorphic: detailed in filter_monomorphic.
filter.common.markers: detailed in filter_common_markers.
whitelist.markers: detailed in filter_whitelist.
filter.individuals.missing: detailed in filter_individuals.
filter.individuals.heterozygosity: detailed in filter_individuals.
filter.individuals.coverage.total: detailed in filter_individuals.
filter.ma: detailed in filter_ma.
filter.coverage: detailed in filter_coverage.
filter.genotyping: described in filter_genotyping.
filter.snp.position.read: described in filter_snp_position_read.
filter.snp.number: described in filter_snp_number.
filter.short.ld: described in filter_ld.
filter.long.ld: described in filter_ld.
long.ld.missing: described in filter_ld.
ld.method: described in filter_ld.
detect.mixed.genomes: described in detect_mixed_genomes.
ind.heterozygosity.threshold: described in detect_mixed_genomes.
detect.duplicate.genomes: described in detect_duplicate_genomes.
dup.threshold: described in detect_duplicate_genomes.
filter.hwe: described in filter_hwe.
hw.pop.threshold: described in filter_hwe.
midp.threshold: described in filter_hwe.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
# Some of the packages used in this function are not installed by default.
# Installed the required packages:
radiator_pkg_install()

# Very simple:
shark <- radiator::filter_rad(
    data = "data.shark.vcf",
    strata = "strata.shark.tsv")


# With filename and output
shark <- radiator::filter_rad(
    data = "data.shark.vcf",
    strata = "strata.shark.tsv",
    output = "genind",
    filename = "shark")
} # }