Compute pairwise genome similarity or distance between individuals to highligh potential duplicate individuals

The function can compute two methods to highligh potential duplicate individuals.

distance between individuals and/or
pairwise genome similarity

detect_duplicate_genomes(
  data,
  interactive.filter = TRUE,
  detect.duplicate.genomes = TRUE,
  dup.threshold = 0,
  distance.method = "manhattan",
  genome = FALSE,
  threshold.common.markers = NULL,
  blacklist.duplicates = FALSE,
  parallel.core = parallel::detectCores() - 1,
  verbose = TRUE,
  ...
)

Arguments

data

(4 options) A file or object generated by radiator:

tidy data
Genomic Data Structure (GDS)

How to get GDS and tidy data ? Look into tidy_genomic_data, read_vcf or tidy_vcf.

interactive.filter

(optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE.

detect.duplicate.genomes

(optional, logical) For use inside radiator pipelines. Default: detect.duplicate.genomes = TRUE.

dup.threshold

Default: dup.threshold = 0. Turn off filter.

distance.method

(character) Depending on input data, 2 different methods are used (give similar results):

gds data The calculation is fast it's SNPRelate::snpgdsIBS under the hood.
tidy data The distance measure uses amap::Dist. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary".

Using distance.method = NULL will not run this method. Default: distance.method = "manhattan". This is very fast compared to the genome similarity method. It uses allele counts and the codes are tailored for biallelic and multiallelic markers.

genome

(logical) Computes pairwise genome similarity in parallel. The proportion of the shared genotypes is averaged across shared markers between each pairwise comparison. This method makes filtering easier because the threshold is more intuitive with the plots produced, but it's much longer to run, even in parallel, so better to run overnight. Default: genome = FALSE.

threshold.common.markers

(double, optional) When using the pairwise genome similarity approach (genome = TRUE), using a threshold will filter the pairwise comparison before generating the results. This is usefull if the number of pairwise comparisons (n*(n-1)/2) is very large and allows to reduce the number of false positive, when missing data is involved. e.g. 2 samples might have 100 30 Default: threshold.common.markers = NULL.

blacklist.duplicates

(optional, logical) With blacklist.duplicates = TRUE, after visualization, the user is asked to enter a threshold to filter out duplicates. With default, blacklist.duplicates = FALSE, the function as no interaction with user.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

...

(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

Value

A list with potentially 8 objects:

$distance : results of the distance method.
$distance.stats: Summary statistics of the distance method.
$pairwise.genome.similarity: results of the genome method.
$genome.stats: Summary statistics of the genome method.
$violin.plot.distance: violin plot showing the distribution of pairwise distances.
$manhattan.plot.distance: same info different visual with manhattan plot.
$violin.plot.genome: violin plot showing the distribution of pairwise genome similarities.
$manhattan.plot.genome: same info different visual with manhattan plot.
$blacklist.id.similar: blacklisted duplicates.

Saved in the working directory: individuals.pairwise.dist.tsv, individuals.pairwise.distance.stats.tsv, individuals.pairwise.genome.similarity.tsv, individuals.pairwise.genome.stats.tsv, blackliste.id.similar.tsv, blacklist.pairs.threshold.tsv

Details

Strategically, run the default first (distance.method, no genome)

distance.method argument is fast, but...

you don't know if the observed comparison (close or distant) is influenced by missing values/the number of markers in common between the pair compared. This is something that needs to be considered. Be suspicious of a distant outlier from the same pop pairwise comparison, and similarly, be suspicious of a close outlier from a different pop pairwise comparisons.

If there is no outlier, don't bother running the function again with (genome = TRUE).

Relative distance

Is the normalized distance for your dataset (not calculated by strata). For each individual, it's the distance divided by the maximum distance observed. The range is limited between 0 and 1. Closer to 0 = the more similar and closer to 1, the more distant.

genome = TRUE

The function will run slower, but... If you see outliers with the first run, take the time to run the function with genome = TRUE. Because this option is much better at detecting duplicated individuals and it also shows the impact of missingness or the number of shared markers between comparisons.

Your outlier duo could well be the result of one of the individual having an extremely low number of genotypes...

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples

if (FALSE) { # \dontrun{
# First run and simplest way (if you have the tidy tibble):
dup <- radiator::detect_duplicate_genomes(data = "wombat_tidy.rad")

# This will use by default:
# distance.method = "manhattan"
# genome = FALSE
# parallel.core = all my CPUs - 1

# If you need a tidy tibble: use one of radiator \code{tidy_} function or
# \code{radiator::tidy_genomic_data}


# To view the manhattan plot:
dup$manhattan.plot.distance

# to view the data stats
dup.data.stats <- dup$distance.stats

# to view the data
dup.data <- dup$distance

# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset to highlight potential duplicates.

# To run the distance (with euclidean distance instead of the default manhattan,
# and also carry the second analysis (with the genome method):
dup <- radiator::tidy_genomic_data(
    data = wombat_tidy_object,
    strata = "wombat.strata.tsv",
    vcf.metadata = FALSE
) %>%
radiator::detect_duplicate_genomes(
    data = .,
    distance.method = "euclidean",
    genome = TRUE
    )

# to view the data of the genome data
dup.data <- dup$pairwise.genome.similarity

# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset based on 98% of identical genotype proportion,
# to highlight potential duplicates:
dup.filtered <- dplyr::filter(.data = dup.data, PROP_IDENTICAL > 0.98)

# Get the list of duplicates id
dup.list.names <- data.frame(INDIVIDUALS = unique(c(dup.filtered$ID1, dup.filtered$ID2)))
} # }