SNP short and long distance linkage disequilibrium pruning.

What sets appart radiator LD pruning is the RADseq data tailored arguments:

• minimize short linkage disequilibrium (LD): 5 values available for filter.short.ld argument (see below).

• reduce long distance LD: Long distance LD pruning is usually advised to avoid capturing the variance LD in PCA analysis.

Use the argument filter.long.ld with values between 0.7 and 0.9 is a good starting point. Ideally, you want to visualize the LD before choosing a threshold.

Strategically, run the function with filter.long.ld argument and refilter the data using the outlier statistic generated by the function (printed on the figure in the output) and using long.ld.missing = TRUE. This advanced argument will choose the best SNP based on missing data statistics, instead of choosing randomly one SNP (see details).

This function is used internally in radiator and might be of interest for users.

filter_ld(
data,
interactive.filter = TRUE,
filter.short.ld = "mac",
filter.long.ld = NULL,
parallel.core = parallel::detectCores() - 1,
filename = NULL,
verbose = TRUE,
...
)

## Arguments

data (4 options) A file or object generated by radiator: tidy data Genomic Data Structure (GDS) How to get GDS and tidy data ? Look into tidy_genomic_data, read_vcf or tidy_vcf. (optional, logical) Do you want the filtering session to be interactive. Figures of distribution are shown before asking for filtering thresholds. Default: interactive.filter = TRUE. (character) 5 options (default: filter.short.ld = "mac"): filter.short.ld = "random" for a random selection of 1 SNP on the read, filter.short.ld = "first" for the first one on the read..., filter.short.ld = "last" for the last SNP on the read and filter.short.ld = "middle" for locus with > 2 SNPs/read the option to select at random one SNP between the first and the last SNP on the read. If the locus as <= 2 SNPs on the read, the first one is selected. Note that for that last option, the numbers are reported. filter.short.ld = "mac" will select the SNP on the locus with the maximum global Minor Allele Count (MAC). Using filter.short.ld = NULL, skip this filter. (optional, double) The threshold to prune SNP based on Long Distance Linkage Disequilibrium. The argument filter.long.ld is the absolute value of measurement. Default: filter.long.ld = NULL. (optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1. (optional, character) File name prefix for file written in the working directory. Default: filename = NULL. (optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE. (optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)

## Value

A list in the global environment, with these objects:

1. $ld.summary: tibble with LD statistics used for the boxplot 2.$ld.boxplot: box plot of LD values

3. $whitelist.ld: whitelist of markers kept after filtering for LD. The argument filter.long.ld must be used to generate the whitelist. 4.$blacklist.ld: blacklist of markers prunned during the filtering for LD. The argument filter.long.ld must be used to generate the blacklist.

5. $data: The filtered tidy dataset. 6.$gds: the path to the GDS file.

## Details

The function requires SNPRelate (see example below on how to install).

• maf.data (path) this argument is no longer supported. It's a small cost in time in favour of making sure the MAC/MAF fits the actual data.

• long.ld.missing (logical) With long.ld.missing = TRUE. The function first generates long distance LD values between markers along the same chromosome or scaffold with SNPRelate::snpgdsLDMat. Based on the LD threshold (filter.long.ld) SNPs in LD will be pruned based on missingness. e.g. if 4 SNPs are in LD, the 1 SNP selected in the end is base on genotyping rate/missingness. If this statistic is equal between the SNPs in LD, 1 SNP is chosen randomly.

Using missigness add extra computational time. To speed the analysis when missingness between markers is not an issue, use long.ld.missing = FALSE. The function will use SNPRelate::snpgdsLDpruning to prune the dataset. SNPs in LD are selected randomly. Default: long.ld.missing = FALSE.

• ld.method: (optional, character) The values available are "composite", for LD composite measure, "r" for R coefficient (by EM algorithm assuming HWE, it could be negative), "r2" for r^2, "dprime" for D', "corr" for correlation coefficient. The method corr and composite are equivalent when SNPs are coded based on the presence of the alternate allele (0, 1, 2). Default: ld.method = "r2".

• ld.figures: (logical) Generate long distance LD statistics and figures. Default: ld.figures = TRUE

• path.folder: to write ouput in a specific path (used internally in radiator). Default: path.folder = getwd(). If the supplied directory doesn't exist, it's created.

## References

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. (2012) A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 28: 3326-3328. doi:10.1093/bioinformatics/bts606

## Author

Thierry Gosselin thierrygosselin@icloud.com

## Examples

if (FALSE) {
require(SNPRelate)
#To install SNPRelate:
install.packages("BiocManager")
BiocManager::install ("SNPRelate")

# short distance LD, no long distance LD:
check.short.ld <- radiator::filter_ld(data = data, filter.short.ld = "mac")

# short distance LD and long distance LD:
data = data,
filter.short.ld = "mac",
filter.long.ld = 0.8)

# short distance LD and long distance LD, incorporating missing data: