Map-independent imputations of missing genotypes

The goal of this module is to provide a simple solution for a complicated problem: missing genotypes in RADseq genomic datasets. This function will performed map-independent imputations of missing genotypes.

Key features:

Imputation methods: several machine learning algorithms; Random forests (predictive and on-the-fly-imputations); Extreme gradient tree boosting (XGBoost); Fast, distributed, high performance gradient boosting (LightGBM). Furthermore, the module allows to compare these algorithms with the classic Strawman imputation (~ max/mean/mode: the most frequently observed, i.e. non-missing, genotypes is used).
Hierarchical level: Imputations conducted by strata or globally.
SNP linkage/Haplotypes: Correlation among SNPs is accounted for during rf and tree boosting imputations, i.e. the imputations are automatically conducted by haplotypes when marker meta-information is available (chromosome, locus and position, usually taken from VCF files). The alternative, considers all the markers independent and imputation is conducted by SNPs/markers.
Genotype likelihood (GL): The GL info is detected automatically (GL column in FORMAT field of VCF files). Genotypes with higher likelihood will have higher probability during bootstrap samples of trees in Random forests (under devel).
Predictive mean matching: random forest in prediction mode uses a fast k-nearest neighbor (KNN) searching algorithms (see argument documentation and details below).
Optimized for speed: There's a tutorial and a vignette detailing the procedure to install the packages from source to enable OpenMP. Depending on algorithm used, a progress bar is sometimes available to see if you have time for a coffee break!

Before using this function to populate the original dataset with synthetic data, grur assumes:

filter your data: please check out filter_rad.
correlations: machine learning algorithms will work better and faster if correlation are reduced. Check your dataset for short and long LD filter_rad and filter_ld.
pattern of individual heterozygosity: If you have individual heterozygosity patterns and/or correlation of individual heterozygosity with missingness, you might want to skip imputation and go back in the wet-lab. Please check out detect_mixed_genomes and detect_het_outliers.
patterns of missingness: look for patterns of missingness with missing_visualization to better tailor the arguments inside this function and explore the reasons for their presence (see vignette).
default arguments: Use for testing only. Please, please, please, don't use defaults for publications!

grur_imputations(
  data,
  output = NULL,
  strata = NULL,
  imputation.method = NULL,
  hierarchical.levels = "strata",
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  filename = NULL,
  ...
)

Arguments

data	(4 options) A file or object generated by `radiator`: tidy data Genomic Data Structure (GDS) How to get GDS and tidy data ? Look into `tidy_genomic_data`
output	29 genomic data formats can be exported: tidy (by default), genepop, genind, genlight, vcf (for file format version, see details below), plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG), bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray, snprelate, maverick, genepopedit, rubias, hapmap and dadi. Use a character string, e.g. `output = c("genind", "genepop", "structure")`, to have preferred output formats generated. With default, only the tidy format is generated. Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference). Default: `output = NULL`.
strata	(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: `INDIVIDUALS` and `STRATA`. Documented in `read_strata`. Default: `strata = NULL`.
imputation.method	(character, optional) Methods available for map-independent imputations of missing genotypes (see details for more info): `imputation.method = "max"` Strawman imputation, the most frequently observed genotypes (ties are broken at random). `imputation.method = "rf"` On-the-fly-imputations using Random Forests algorithm. `imputation.method = "rf_pred"` Random Forests algorithm is used as a prediction problem. `imputation.method = "xgboost"` extreme gradient boosting trees using depth-wise tree growth. `imputation.method = "lightgbm"` for a light and fast leaf-wise tree growth gradient boosting algorithm (in devel). `imputation.method = NULL` the function will stop. Default: `imputation.method = NULL`.
hierarchical.levels	(character, optional) `c("global", "strata")`. Should the imputations be computed by markers globally or by strata. Historically, this was `"populations"`. Note that imputing genotype globally in conjunction with `imputation.method = "max" or "strata"` can potentially create huge bias. e.g. by introducing foreign genotypes/haplotypes in some strata/populations (see note for more info). Default: `hierarchical.levels = "strata"`.
verbose	(optional, logical) When `verbose = TRUE` the function is a little more chatty during execution. Default: `verbose = TRUE`.
parallel.core	(optional, integer) The number of core used for parallel. For some algorithms, markers will be imputed in parallel and strata are processed sequentially. Default: `parallel.core = parallel::detectCores() - 1`.
filename	(optional) The file extension appended to the `filename` provided is `.rad`. With default: `filename = NULL`, the imputed tidy data frame is in the global environment only (i.e. not written in the working directory...).
...	(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see advance mode or special sections below).

Value

The output in your global environment is the imputed tidy data frame. If filename is provided, the imputed tidy data frame is also written to the working directory.

Details

Predictive mean matching:

Random Forests already behave like a nearest neighbor classifier, with adaptive metric. Now we have the option to conduct predictive mean matching on top of the prediction based missing value imputation. PMM tries to raise the variance in the resulting conditional distributions to a realistic level. The closest k predicted values are identified by a fast k-nearest neighbour approach wrapped in the package missRanger Returned value correspond to the mean value.

haplotype/SNP approach:

The haplotype approach is automatically used when markers meta-information is detected (chromosome/CHROM, locus/ID and SNP/POS columns, usually from a VCF file). Missing genotypes from SNPs on the same locus or same RADseq read is undertaken simulteneously to account for the correlation of the linked SNPs. When one or more SNPs on the same read/haplotype is missing, the haplotype is deleted and consequently, imputation might results in different genotype for those SNPs that were not missing. This approach is much safer than potentially creating weird chimeras during haplotype imputations. Alternatively, a snp approach is used, and the SNP are considered independent. Imputations of genotypes is then conducted for each marker separately.

Imputing globally or by strata ? hierarchical.levels = "global" argument will act differently depending on the imputation.method selected.

Strawman imputations (~ max/mean/mode) considerations:

With imputation.method = "max" and hierarchical.levels = "global" will likely create bias.

Example 1 (unbalanced sample size): Consider 2 populations evolving more by drift than selection: pop1 (n = 36) and pop2 (n = 50). You'll likely have a few polymorphic marker, where pop1 and pop2 are monomorphic for different alleles (pop1 is fixed for the minor/ALT allele and pop2 is fixed for the major/REF allele). Missing genotypes in pop1 using the most common filling technique in the literature (using mean/mode/max), will result in pop1 having individuals with the REF allele. Not something you want... unless your population membership is not 100 (e.g. you might have migrants or wrong assignation), which in this case you still don't want to impute with imputation.method = "max" (see alternative below).

Example 2 (balanced sample size): pop1 (n = 100) and pop2 (n = 100). For a particular marker, pop1 as 85 individuals genotyped and pop2 100. Again, if the populations are fixed for different alleles (pop1 = ALT and pop2 = REF), you will end up having REF allele in your pop1, not something you want... unless your population membership is not 100 (e.g. you might have migrants or wrong assignation), which in this case you still don't want to impute with imputation.method = "max" (see alternative below).

Random Forests imputations:

Random Forests use machine learning and you can take this into account while choosing argument values. Uncertain of the groupings ? Use random forests with hierarchical.levels = "global". Random forests will account for the potential linkage and correlation between markers and genotypes to make the best imputations available. This can potentially results in genotypes for a certain combo population/marker with new groupings (e.g. a new allele). This is much more accurate and not the same thing as the imputation.method = "max" because the imputed genotype is validated by considering all other variables (all the other markers genotyped for the individual). Test the option and report bug if you find one.

random forest with on-the-fly-imputation (rf): the technique is described in Tang and Ishwaran (2017). Non-missing genotypes are used for the split-statistics. Daughter node assignation membership use random non-missing genotypes from the inbag data. Missing genotypes are imputed at terminal nodes using maximal class rule with out-of-bag non-missing genotypes. Example of computation time: for 1500 individuals, 20 000 markers and 2 takes around 10 hours on 6 CPUs to complete the imputations.

random forest as a prediction problem (rf_pred): markers with missing genotypes are imputed one at a time. The fitted forest is used to predict missing genotypes. Missingness in the response variables are incorporated as attributes for growing the forest.

boosting trees: Prediction method is used for both XGBoost and LightGBM.

LightGBM documentation

XGBoost documentation

LightGBM and XGBoost parameters optimization

Note

Reference genome or linkage map available ? Numerous approaches are available and more appropriate, please search the literature (references).

What's the simple imputation message when running the function ?

Before conducting the imputations by strata with the machine learning algorithms, the data is first screened for markers that are monomorphic WITHIN the strata/populations. Because for those cases, it's clear what the missing genotypes should be, the imputations is very simple and missing genotypes are imputed with the only genotype found for the particular strata/populations. There's no need for a fancy method here. The small cost in time is worth it, because the model inside the machine learning algorithms will benefit from having a more complete and reliable genotype matrix.

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the imputations:

subsample.markers (optional, integer) To speed up computation and rapidly test the function's arguments (e.g. using 200 markers). Default: subsample.markers = NULL.

random.seed (integer, optional) For reproducibility, set an integer that will be used to initialize the random generator. With default, a random number is generated. Currently not implemented for XGBoost and LightGBM. Default: random.seed = NULL.

pmm (integer, optional) Predictive mean matching used in conjunction with random Forests and lightgbm (imputation.method = "rf_pred" or imputation.method = "lightgbm"). Number of candidate non-missing value to sample from during the predictive mean matching step. A fast k-nearest neighbor searching algorithms is used with this approach. pmm = 2 will use 2 neighbors. Default: pmm = 0, will avoids this step.

cpu.boost (optional, integer). Number of core for XGBoost and LightGBM. For the best speed, set this to the number of real CPU cores, not the number of threads. Most CPU using hyper-threading to generate 2 threads per CPU core. Be aware that task manager or any CPU monitoring tool might report cores not being fully utilized. This is normal. Default: cpu_boost = parallel::detectCores() / 2.

eXtreme Gradient Boosting tree method:

eta
gamma
max_depth
min_child_weight
subsample
colsample_bytree
num_parallel_tree
nrounds
save_name
early_stopping_rounds

Documented in xgboost.

LightGBM:

boosting
objective
learning_rate
feature_fraction
bagging_fraction
bagging_freq
max_depth
min_data_in_leaf
num_leaves
early_stopping_rounds
nrounds
max_depth
iteration.subsample: is the number of iteration to conduct training of the model with new subsamples. Default: iteration.subsample = 2.

Documented in LightGBM

Random forests methods:

num.tree
nodesize
nsplit
nimpute

Documented in randomForestSRC

Life cycle

Deprecated arguments:

hierarchical.levels = "populations" update your codes with "strata".
imputations.group is now replaced by hierarchical.levels
impute is no longer available. Imputing using impute = "allele" option was wrong because it was using F1 genotypes for imputations. Now imputation is only conducted at the genotype level.
iteration.rf is no longer used. This argument is now available inside the ... for on-the-fly-imputations (see details). The default is now set to 10 iterations.
split.number is automatically set.

References

Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Chen T, Guestrin C. (2016). XGBoost: A scalable tree boosting system. arXivorg. 2016. doi:10.1145/2939672.2939785

Tang F, Ishwaran H. (2017) Random Forest Missing Data Algorithms. arXiorg: 1–24.

Author

Thierry Gosselin thierrygosselin@icloud.com