vignettes/vignette_missing_data_analysis.Rmd
vignette_missing_data_analysis.Rmd
Learn how to visualize missing genotypes in your genomic dataset with the function grur::missing_visualization
(time = 15 min).
The function missing_visualization
in grur uses various genomic input files and conduct identity-by-missingness analyses (IBM) using Principal Coordinates Analysis (PCoA), also called Multidimensional Scaling (MDS) and RDA (Redundancy Analysis) to highlight missing data patterns. Figures and summary tables of missing information at the marker, individual and population level are generated. Below, the simplest form of the function to get results. More options are available, please see the function documentation.
Clean your desk
Follow the instruction to install grur
Load the required libraries:
Set your working directory, e.g.:
setwd("~/Documents/test_missing_visualization_vignette")
Note: running codes in chunks inside R Notebook might cause problem, run it outside in the console (the default here).
Dataset: in this example, we use the data in Ferchaud and Hansen (2015 and 2016) paper. The code below gets the vcf from Dryad directly. But you can skip the step if it’s already in the folder.
writeBin(httr::content(httr::GET("http://datadryad.org/bitstream/handle/10255/dryad.97237/sticklebacks_Danish.vcf?sequence=1"), "raw"), "stickleback_ferchaud_2015.vcf")
With a vcf you also need a strata file (indicating population groupings)
INDIVIDUALS
and STRATA
.STRATA
column and the remaining columns can be any hierarchical groupings you like.??radiator::read_strata
).population map file
. Just make sure you have the required column names: INDIVIDUALS
and STRATA
.to download the strata for this example: strata link
This is the simplest way to run the function:
ibm <- grur::missing_visualization( data = "stickleback_ferchaud_2015.vcf", strata = "strata.stickleback.tsv")
The function does a few automatic filters: * Monomorphic markers are removed * Only common markers between strata are kept for the analysis * Individuals and markers statistics are generated automatically
A new object ibm
was created in your global environment. It’s a list and to view it’s content use:
names(ibm)
Lots of info in there… Lets focus on just a few. A folder is also created automatically. The function generates by default a large object (list):
ibm
and written to the working directory (defaults from 10% to 70%, that automatically stops if no individuals at the threshold is found, can be changed with ind.missing.geno.threshold
argument).$
.To view the IBM-PCoA plot made with POP_ID grouping:
ibm$ibm.plots$ibm.strata.POP_ID
The dark green bubble from KIB it’s an individual with almost all of his genotypes missing. This one skip the radar of the authors ;)
The heatmap showing missingness:
heatmap <- ibm$heatmap heatmap
The vertical black line highlight the problem in the vcf with the individual missing almost all it’s genotypes.
View the table with summary of missing genotypes per individuals:
table.ind <- ibm$missing.genotypes.ind table.ind
To view the distribution
ibm$missing.genotypes.ind.plots
ibm$missing.genotypes.ind.histo
Show the helper figure showing how many individuals could potentially be blacklisted based on % on genotypes.
ibm$ind.genotyped.helper.plot
All these figures are combined in the folder…
To view the distribution of missingness per markers
ibm$missing.genotypes.markers.combined.plots
Other figures are created, explore the list of objects and folder. Read the doc.
To view the distribution of FH and missing genotypes per individuals
FH : is an identity by descent genomic (IBDg) measure based on the excess in the observed number of homozygous genotypes within an individual relative to the mean number of homozygous genotypes expected under random mating).
To learn more about it see (Keller et al., 2011; Kardos et al., 2015; Hedrick & Garcia-Dorado, 2016)
ibm$missing.genotypes.ind.fh.combined.plots
This is weird figure is caused by the outlier individual. To remove this individual, re-run missing_visualization
with the argument blacklist.id
and one of the several blacklists written to the working directory (e.g. blacklist.id.missing.70.tsv
).
Explore the rest by yourself!
Do you see patterns in your plots that provides insight about the relationships that missing values might have with other variables (inspired from r4ds).
If you see a pattern, ask yourself:
Is the pattern due to coincidence (i.e. random chance)?
Could you describe the relationship in the pattern ?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
Do you think the pattern observed in the data could impact the clustering analysis?
strata.select
to select columns from the strata file to generate PCoA-IBM plots. If you have several columns to test e.g. library, sequencer, sequencing lanes, sampling sites, populations, use inside a string to get all of them!distance.method
to explore other distance metric used by the function dist
radiator::filter_rad
.whitelist.markers
.Ferchaud A, Hansen MM (2016) The impact of selection, gene flow and demographic history on heterogeneous genomic divergence: threespine sticklebacks in divergent environments. Molecular Ecology 25(1): 238–259. http://dx.doi.org/10.1111/mec.13399
Ferchaud A, Hansen MM (2015) Data from: The impact of selection, gene flow and demographic history on heterogeneous genomic divergence: threespine sticklebacks in divergent environments. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.kp11q
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158.
Purcell S, Neale B, Todd-Brown K et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81, 559–575.
Keller MC, Visscher PM, Goddard ME (2011) Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics, 189, 237–249.
Kardos M, Luikart G, Allendorf FW (2015) Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity, 115, 63–72.
Hedrick PW, Garcia-Dorado A. (2016) Understanding Inbreeding Depression, Purging, and Genetic Rescue. Trends in Ecology and Evolution. 2016;31: 940-952. doi:10.1016/j.tree.2016.09.005