This filter removes outlier markers with too many SNP number per locus/read. The data requires snp and locus information (e.g. from a VCF file). Having a higher than "normal" SNP number is usually the results of assembly artifacts or bad assembly parameters. This filter is population-agnostic, but still requires a strata file if a vcf file is used as input.

Filter targets: Markers

Statistics: The number of SNPs per locus.

filter_snp_number(
  data,
  strata = NULL,
  interactive.filter = TRUE,
  filter.snp.number = NULL,
  filename = NULL,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

(4 options) A file or object generated by radiator:

  • tidy data

  • Genomic Data Structure (GDS)

How to get GDS and tidy data ? Look into tidy_genomic_data, read_vcf or tidy_vcf.

strata

(path or object) The strata file or object. Additional documentation is available in read_strata. Use that function to whitelist/blacklist populations/individuals. Option to set pop.levels/pop.labels is also available.

interactive.filter

(optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE.

filter.snp.number

(integer) This is best decided after viewing the figures. If the argument is set to 2, locus with 3 and more SNPs will be blacklisted. Default: filter.snp.number = NULL.

filename

(optional) Name of the filtered tidy data frame file written to the working directory (ending with .tsv) Default: filename = NULL.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

Value

A list in the global environment with 6 objects:

  1. $snp.number.markers

  2. $number.snp.reads.plot

  3. $whitelist.markers

  4. $tidy.filtered.snp.number

  5. $blacklist.markers

  6. $filters.parameters

The object can be isolated in separate object outside the list by following the example below.

Details

Interactive version

There are 2 steps in the interactive version to visualize and filter the data based on the number of SNP on the read/locus:

Step 1. SNP number per read/locus visualization

Step 2. Choose the filtering thresholds

Examples

if (FALSE) {
turtle.outlier.snp.number <- radiator::filter_snp_number(
data = "turtle.vcf",
strata = "turtle.strata.tsv",
max.snp.number = 4,
filename = "tidy.data.turtle.tsv"
)

tidy.data <- turtle.outlier.snp.number$tidy.filtered.snp.number

#Inside the same list, to isolate the markers blacklisted:
blacklist <- turtle.outlier.snp.number$blacklist.markers

}