The goal of this module is to provide a simple solution for a complicated problem: missing genotypes in RADseq genomic datasets. This function will performed map-independent imputations of missing genotypes.
Key features:
Imputation methods: several machine learning algorithms; Random forests (predictive and on-the-fly-imputations); Extreme gradient tree boosting (XGBoost); Fast, distributed, high performance gradient boosting (LightGBM). Furthermore, the module allows to compare these algorithms with the classic Strawman imputation (~ max/mean/mode: the most frequently observed, i.e. non-missing, genotypes is used).
Hierarchical level: Imputations conducted by strata or globally.
SNP linkage/Haplotypes: Correlation among SNPs is accounted for during rf and tree boosting imputations, i.e. the imputations are automatically conducted by haplotypes when marker meta-information is available (chromosome, locus and position, usually taken from VCF files). The alternative, considers all the markers independent and imputation is conducted by SNPs/markers.
Genotype likelihood (GL): The GL info is detected automatically (GL column in FORMAT field of VCF files). Genotypes with higher likelihood will have higher probability during bootstrap samples of trees in Random forests (under devel).
Predictive mean matching: random forest in prediction mode uses a fast k-nearest neighbor (KNN) searching algorithms (see argument documentation and details below).
Optimized for speed: There's a tutorial and a vignette detailing the procedure to install the packages from source to enable OpenMP. Depending on algorithm used, a progress bar is sometimes available to see if you have time for a coffee break!
Before using this function to populate the original dataset with synthetic data, grur assumes:
filter your data: please check out filter_rad
.
correlations: machine learning algorithms will work better and
faster if correlation are reduced. Check your dataset for short and long LD
filter_rad
and filter_ld
.
pattern of individual heterozygosity: If you have individual
heterozygosity patterns and/or correlation of individual heterozygosity with
missingness, you might want to skip imputation and go back in the wet-lab.
Please check out detect_mixed_genomes
and
detect_het_outliers
.
patterns of missingness: look for patterns of missingness
with missing_visualization
to better
tailor the arguments inside this function and explore the reasons for their presence
(see vignette).
default arguments: Use for testing only. Please, please, please, don't use defaults for publications!
grur_imputations( data, output = NULL, strata = NULL, imputation.method = NULL, hierarchical.levels = "strata", verbose = TRUE, parallel.core = parallel::detectCores() - 1, filename = NULL, ... )
data | (4 options) A file or object generated by
How to get GDS and tidy data ?
Look into |
---|---|
output | 29 genomic data formats can be exported: tidy (by default),
genepop, genind, genlight, vcf (for file format version, see details below),
plink, structure, faststructure, arlequin, hierfstat, gtypes (strataG),
bayescan, betadiv, pcadapt, hzar, fineradstructure, related, seqarray,
snprelate, maverick, genepopedit, rubias, hapmap and dadi.
Use a character string,
e.g. Make sure to read the particularities of each format, some might requires extra columns in the strata file. You can find the info in the corresponding write_ functions of radiator (reference). Default: |
strata | (optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
imputation.method | (character, optional) Methods available for map-independent imputations of missing genotypes (see details for more info):
|
hierarchical.levels | (character, optional) Note that imputing genotype globally in conjunction with
|
verbose | (optional, logical) When |
parallel.core | (optional, integer) The number of core used for parallel.
For some algorithms, markers will be imputed in parallel and
strata are processed sequentially.
Default: |
filename | (optional) The file extension appended to
the |
... | (optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see advance mode or special sections below). |
The output in your global environment is the imputed tidy data frame.
If filename
is provided, the imputed tidy data frame is also
written to the working directory.
Predictive mean matching:
Random Forests already behave like a nearest neighbor classifier, with adaptive metric. Now we have the option to conduct predictive mean matching on top of the prediction based missing value imputation. PMM tries to raise the variance in the resulting conditional distributions to a realistic level. The closest k predicted values are identified by a fast k-nearest neighbour approach wrapped in the package missRanger Returned value correspond to the mean value.
haplotype/SNP approach:
The haplotype approach is automatically used when markers meta-information is detected (chromosome/CHROM, locus/ID and SNP/POS columns, usually from a VCF file). Missing genotypes from SNPs on the same locus or same RADseq read is undertaken simulteneously to account for the correlation of the linked SNPs. When one or more SNPs on the same read/haplotype is missing, the haplotype is deleted and consequently, imputation might results in different genotype for those SNPs that were not missing. This approach is much safer than potentially creating weird chimeras during haplotype imputations. Alternatively, a snp approach is used, and the SNP are considered independent. Imputations of genotypes is then conducted for each marker separately.
Imputing globally or by strata ?
hierarchical.levels = "global"
argument will act differently depending
on the imputation.method
selected.
Strawman imputations (~ max/mean/mode) considerations:
With imputation.method = "max"
and hierarchical.levels = "global"
will likely create bias.
Example 1 (unbalanced sample size): Consider 2 populations evolving more
by drift than selection: pop1 (n = 36) and pop2 (n = 50).
You'll likely have a few polymorphic marker, where pop1 and pop2 are
monomorphic for different alleles (pop1 is fixed for the minor/ALT allele and
pop2 is fixed for the major/REF allele). Missing genotypes in pop1
using the most common filling technique in the literature (using mean/mode/max),
will result in pop1 having individuals with the REF allele.
Not something you want... unless your population membership is not 100
(e.g. you might have migrants or wrong assignation),
which in this case you still don't want to impute with
imputation.method = "max"
(see alternative below).
Example 2 (balanced sample size): pop1 (n = 100) and pop2 (n = 100).
For a particular marker, pop1 as 85 individuals genotyped and pop2 100.
Again, if the populations are fixed for different alleles
(pop1 = ALT and pop2 = REF), you will end up having REF allele in your pop1,
not something you want... unless your population membership is not 100
(e.g. you might have migrants or wrong assignation),
which in this case you still don't want to impute with
imputation.method = "max"
(see alternative below).
Random Forests imputations:
Random Forests use machine learning and you can take this into account while
choosing argument values. Uncertain of the groupings ? Use random forests with
hierarchical.levels = "global"
. Random forests will account for the
potential linkage and correlation between
markers and genotypes to make the best imputations available. This can potentially
results in genotypes for a certain combo population/marker with new groupings
(e.g. a new allele). This is much more accurate and not the same thing as
the imputation.method = "max"
because the imputed genotype is validated
by considering all other variables (all the other markers genotyped for the individual).
Test the option and report bug if you find one.
random forest with on-the-fly-imputation (rf): the technique is described in Tang and Ishwaran (2017). Non-missing genotypes are used for the split-statistics. Daughter node assignation membership use random non-missing genotypes from the inbag data. Missing genotypes are imputed at terminal nodes using maximal class rule with out-of-bag non-missing genotypes. Example of computation time: for 1500 individuals, 20 000 markers and 2 takes around 10 hours on 6 CPUs to complete the imputations.
random forest as a prediction problem (rf_pred): markers with missing genotypes are imputed one at a time. The fitted forest is used to predict missing genotypes. Missingness in the response variables are incorporated as attributes for growing the forest.
boosting trees: Prediction method is used for both XGBoost and LightGBM.
LightGBM and XGBoost parameters optimization
Reference genome or linkage map available ? Numerous approaches are available and more appropriate, please search the literature (references).
What's the simple imputation message when running the function ?
Before conducting the imputations by strata with the machine learning algorithms, the data is first screened for markers that are monomorphic WITHIN the strata/populations. Because for those cases, it's clear what the missing genotypes should be, the imputations is very simple and missing genotypes are imputed with the only genotype found for the particular strata/populations. There's no need for a fancy method here. The small cost in time is worth it, because the model inside the machine learning algorithms will benefit from having a more complete and reliable genotype matrix.
dots-dots-dots ... allows to pass several arguments for fine-tuning the imputations:
subsample.markers
(optional, integer) To speed up computation and rapidly
test the function's arguments (e.g. using 200 markers).
Default: subsample.markers = NULL
.
random.seed
(integer, optional) For reproducibility, set an integer
that will be used to initialize the random generator. With default,
a random number is generated. Currently not implemented for XGBoost and LightGBM.
Default: random.seed = NULL
.
pmm
(integer, optional) Predictive mean matching used in conjunction with
random Forests and lightgbm (imputation.method = "rf_pred"
or
imputation.method = "lightgbm"
).
Number of candidate non-missing
value to sample from during the predictive mean matching step.
A fast k-nearest neighbor searching algorithms is used with this approach.
pmm = 2
will use 2 neighbors.
Default: pmm = 0
, will avoids this step.
cpu.boost
(optional, integer). Number of core for XGBoost and LightGBM.
For the best speed, set this to the number of real CPU cores,
not the number of threads.
Most CPU using hyper-threading to generate 2 threads per CPU core.
Be aware that task manager or any CPU monitoring tool might report cores
not being fully utilized. This is normal.
Default: cpu_boost = parallel::detectCores() / 2
.
eXtreme Gradient Boosting tree method:
eta
gamma
max_depth
min_child_weight
subsample
colsample_bytree
num_parallel_tree
nrounds
save_name
early_stopping_rounds
Documented in xgboost.
LightGBM:
boosting
objective
learning_rate
feature_fraction
bagging_fraction
bagging_freq
max_depth
min_data_in_leaf
num_leaves
early_stopping_rounds
nrounds
max_depth
iteration.subsample:
is the number of iteration
to conduct training of the model with new subsamples.
Default: iteration.subsample = 2
.
Documented in LightGBM
Random forests methods:
num.tree
nodesize
nsplit
nimpute
Documented in randomForestSRC
Deprecated arguments:
hierarchical.levels = "populations"
update your codes with "strata".
imputations.group
is now replaced by hierarchical.levels
impute
is no longer available.
Imputing using impute = "allele"
option was wrong because it
was using F1 genotypes for imputations. Now imputation is only conducted at
the genotype level.
iteration.rf
is no longer used. This argument is now available
inside the ...
for on-the-fly-imputations (see details). The default
is now set to 10 iterations.
split.number
is automatically set.
Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. http://arxiv.org/abs/1508.04409.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chen T, Guestrin C. (2016). XGBoost: A scalable tree boosting system. arXivorg. 2016. doi:10.1145/2939672.2939785
Tang F, Ishwaran H. (2017) Random Forest Missing Data Algorithms. arXiorg: 1–24.
The package ranger (see Wright and Ziegler, 2016) provides a fast C++ version of the original implementation of rf from Breiman (2001).
The package randomForestSRC (see Tang and Ishwaran, 2017) provides a fast on-the-fly imputation method.
The package XGBoost (Chen and Guestrin, 2016) provides a fast C++ implementation for the extreme gradient tree boosting algorithm.
The package LightGBM is an exciting new algorithm to conduct tree boosting.
Thierry Gosselin thierrygosselin@icloud.com