R/detect_duplicate_genomes.R
detect_duplicate_genomes.Rd
The function can compute two methods to highligh potential duplicate individuals.
distance between individuals and/or
pairwise genome similarity
detect_duplicate_genomes(
data,
interactive.filter = TRUE,
detect.duplicate.genomes = TRUE,
dup.threshold = 0,
distance.method = "manhattan",
genome = FALSE,
threshold.common.markers = NULL,
blacklist.duplicates = FALSE,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
(4 options) A file or object generated by radiator:
tidy data
Genomic Data Structure (GDS)
How to get GDS and tidy data ?
Look into tidy_genomic_data
,
read_vcf
or
tidy_vcf
.
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: interactive.filter = TRUE
.
(optional, logical) For use inside radiator pipelines.
Default: detect.duplicate.genomes = TRUE
.
Default: dup.threshold = 0
. Turn off filter.
(character) Depending on input data, 2 different methods are used (give similar results):
gds data The calculation is fast it's SNPRelate::snpgdsIBS
under
the hood.
tidy data The distance measure uses amap::Dist
.
This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary".
Using distance.method = NULL
will not run this method.
Default: distance.method = "manhattan"
. This is very fast
compared to the genome similarity method. It uses allele counts and the codes
are tailored for biallelic and multiallelic markers.
(logical) Computes pairwise genome similarity in parallel.
The proportion of the shared genotypes is averaged across shared markers between
each pairwise comparison. This method makes filtering easier because the
threshold is more intuitive with the plots produced, but it's much longer
to run, even in parallel, so better to run overnight.
Default: genome = FALSE
.
(double, optional) When using the
pairwise genome similarity approach (genome = TRUE
), using a threshold
will filter the pairwise comparison before generating the results. This is
usefull if the number of pairwise comparisons (n*(n-1)/2) is very large and
allows to reduce the number of false positive, when missing data is involved.
e.g. 2 samples might have 100
30
Default: threshold.common.markers = NULL
.
(optional, logical)
With blacklist.duplicates = TRUE
, after visualization,
the user is asked to enter a threshold to filter out duplicates.
With default, blacklist.duplicates = FALSE
, the function as no
interaction with user.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)
A list with potentially 8 objects:
$distance
: results of the distance method.
$distance.stats
: Summary statistics of the distance method.
$pairwise.genome.similarity
: results of the genome method.
$genome.stats
: Summary statistics of the genome method.
$violin.plot.distance
: violin plot showing the distribution of pairwise distances.
$manhattan.plot.distance
: same info different visual with manhattan plot.
$violin.plot.genome
: violin plot showing the distribution of pairwise genome similarities.
$manhattan.plot.genome
: same info different visual with manhattan plot.
$blacklist.id.similar
: blacklisted duplicates.
Saved in the working directory: individuals.pairwise.dist.tsv, individuals.pairwise.distance.stats.tsv, individuals.pairwise.genome.similarity.tsv, individuals.pairwise.genome.stats.tsv, blackliste.id.similar.tsv, blacklist.pairs.threshold.tsv
Strategically, run the default first (distance.method
,
no genome
)
distance.method
argument is fast, but...
you don't know if the observed comparison (close or distant) is influenced by missing values/the number of markers in common between the pair compared. This is something that needs to be considered. Be suspicious of a distant outlier from the same pop pairwise comparison, and similarly, be suspicious of a close outlier from a different pop pairwise comparisons.
If there is no outlier, don't bother running the function again with
(genome = TRUE
).
Relative distance
Is the normalized distance for your dataset (not calculated by strata). For each individual, it's the distance divided by the maximum distance observed. The range is limited between 0 and 1. Closer to 0 = the more similar and closer to 1, the more distant.
genome = TRUE
The function will run slower, but...
If you see outliers with the first run, take the time to run the function
with genome = TRUE
. Because this option is much better at detecting
duplicated individuals and it also shows the impact of missingness
or the number of shared markers between comparisons.
Your outlier duo could well be the result of one of the individual having an extremely low number of genotypes...
if (FALSE) { # \dontrun{
# First run and simplest way (if you have the tidy tibble):
dup <- radiator::detect_duplicate_genomes(data = "wombat_tidy.rad")
# This will use by default:
# distance.method = "manhattan"
# genome = FALSE
# parallel.core = all my CPUs - 1
# If you need a tidy tibble: use one of radiator \code{tidy_} function or
# \code{radiator::tidy_genomic_data}
# To view the manhattan plot:
dup$manhattan.plot.distance
# to view the data stats
dup.data.stats <- dup$distance.stats
# to view the data
dup.data <- dup$distance
# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset to highlight potential duplicates.
# To run the distance (with euclidean distance instead of the default manhattan,
# and also carry the second analysis (with the genome method):
dup <- radiator::tidy_genomic_data(
data = wombat_tidy_object,
strata = "wombat.strata.tsv",
vcf.metadata = FALSE
) %>%
radiator::detect_duplicate_genomes(
data = .,
distance.method = "euclidean",
genome = TRUE
)
# to view the data of the genome data
dup.data <- dup$pairwise.genome.similarity
# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset based on 98% of identical genotype proportion,
# to highlight potential duplicates:
dup.filtered <- dplyr::filter(.data = dup.data, PROP_IDENTICAL > 0.98)
# Get the list of duplicates id
dup.list.names <- data.frame(INDIVIDUALS = unique(c(dup.filtered$ID1, dup.filtered$ID2)))
} # }