R/filter_ld.R
filter_ld.Rd
SNP short and long distance linkage disequilibrium pruning.
What sets appart radiator LD pruning is the RADseq data tailored arguments:
minimize short linkage disequilibrium (LD):
5 values available for filter.short.ld
argument (see below).
reduce long distance LD: Long distance LD pruning is usually advised to avoid capturing the variance LD in PCA analysis.
Use the argument filter.long.ld
with values between 0.7 and 0.9 is a
good starting point. Ideally, you want to visualize the LD before choosing a threshold.
Strategically, run the function with filter.long.ld
argument and
refilter the data using the outlier statistic
generated by the function (printed on the figure in the output) and using
long.ld.missing = TRUE
. This advanced argument will choose the best SNP
based on missing data statistics, instead of choosing randomly one SNP
(see details).
This function is used internally in radiator and might be of interest for users.
filter_ld(
data,
interactive.filter = TRUE,
filter.short.ld = "mac",
filter.long.ld = NULL,
parallel.core = parallel::detectCores() - 1,
filename = NULL,
verbose = TRUE,
...
)
(4 options) A file or object generated by radiator:
tidy data
Genomic Data Structure (GDS)
How to get GDS and tidy data ?
Look into tidy_genomic_data
,
read_vcf
or
tidy_vcf
.
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: interactive.filter = TRUE
.
(character) 5 options (default: filter.short.ld = "mac"
):
filter.short.ld = "random"
for a random selection of 1 SNP on the read,
filter.short.ld = "first"
for the first one on the read...,
filter.short.ld = "last"
for the last SNP on the read and
filter.short.ld = "middle"
for locus with > 2 SNPs/read the option to select at random
one SNP between the first and the last SNP on the read. If the locus as <= 2
SNPs on the read, the first one is selected. Note that for that last option,
the numbers are reported.
filter.short.ld = "mac"
will select the SNP on the locus with the maximum global
Minor Allele Count (MAC).
Using filter.short.ld = NULL
, skip this filter.
(optional, double) The threshold to prune SNP based on
Long Distance Linkage Disequilibrium. The argument filter.long.ld is
the absolute value of measurement.
Default: filter.long.ld = NULL
.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional, character) File name prefix for file written in
the working directory.
Default: filename = NULL
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)
A list in the global environment, with these objects:
$ld.summary: tibble with LD statistics used for the boxplot
$ld.boxplot: box plot of LD values
$whitelist.ld: whitelist of markers kept after filtering for LD.
The argument filter.long.ld
must be used to generate the whitelist.
$blacklist.ld: blacklist of markers prunned during the filtering
for LD.
The argument filter.long.ld
must be used to generate the blacklist.
$data: The filtered tidy dataset.
$gds: the path to the GDS file.
The function requires SNPRelate (see example below on how to install).
Advance mode, using dots-dots-dots
maf.data
(path) this argument is no longer supported.
It's a small cost in time in favour of making sure the MAC/MAF fits the actual
data.
long.ld.missing
(logical) With long.ld.missing = TRUE
.
The function first generates long distance LD values between markers along
the same chromosome or scaffold with SNPRelate::snpgdsLDMat.
Based on the LD threshold (filter.long.ld
) SNPs in LD will be pruned
based on missingness.
e.g. if 4 SNPs are in LD, the 1 SNP selected in
the end is base on genotyping rate/missingness. If this statistic is equal
between the SNPs in LD, 1 SNP is chosen randomly.
Using missigness add extra computational time. To speed the analysis when
missingness between markers is not an issue, use long.ld.missing = FALSE
.
The function will use SNPRelate::snpgdsLDpruning
to prune the dataset. SNPs in LD are selected randomly.
Default: long.ld.missing = FALSE
.
ld.method
: (optional, character) The values available are
"composite"
, for LD composite measure, "r"
for R coefficient
(by EM algorithm assuming HWE, it could be negative), "r2"
for r^2,
"dprime"
for D',
"corr"
for correlation coefficient. The method corr and composite are
equivalent when SNPs are coded based on the presence of the alternate allele
(0, 1, 2
).
Default: ld.method = "r2"
.
ld.figures
: (logical) Generate long distance LD statistics and
figures.
Default: ld.figures = TRUE
path.folder
: to write ouput in a specific path
(used internally in radiator). Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. (2012) A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 28: 3326-3328. doi:10.1093/bioinformatics/bts606
if (FALSE) { # \dontrun{
require(SNPRelate)
#To install SNPRelate:
install.packages("BiocManager")
BiocManager::install ("SNPRelate")
library(radiator)
data <- radiator::read_vcf(data = "my.vcf", strata = "my.strata.tsv", verbose = TRUE)
# short distance LD, no long distance LD:
check.short.ld <- radiator::filter_ld(data = data, filter.short.ld = "mac")
# short distance LD and long distance LD:
pruned.ld <- radiator::filter_ld(
data = data,
filter.short.ld = "mac",
filter.long.ld = 0.8)
# short distance LD and long distance LD, incorporating missing data:
pruned.ld <- radiator::filter_ld(
data = data, # a GDS object generated by radiator
filter.short.ld = "mac",
filter.long.ld = 0.8,
long.ld.missing = TRUE)
} # }