The arguments in the assignment_ngs
function were tailored for the
reality of RADseq data for assignment analysis while
maintaining a reproducible workflow.
Assignment assumptions are listed in the section below.
Input file: various file format are supported (see
data
argument below).Cross-Validations: Markers can be randomly selected for a classic LOO (Leave-One-Out) assignment or chosen based on ranked Fst for a thl (Training, Holdout, Leave-one-out) assignment analysis.
Assignment analysis: conducted in gsi_sim, a tool for doing and simulating genetic stock identification and developed by Eric C. Anderson, or adegenet, an R package developed by Thibaul Jombart.
Parallel: The assignment can be conduncted on multiple CPUs. The R GUI is unstable with this functions, I recommend using RStudio.
Usage
assignment_ngs(
data,
strata = NULL,
pop.levels = NULL,
assignment.analysis = c("gsim_sim", "adegenet"),
markers.sampling = c("ranked", "random"),
marker.number = "all",
thl = 1,
iteration.method = 10,
subsample = NULL,
iteration.subsample = 1,
verbose = TRUE,
parallel.core = parallel::detectCores() - 1,
...
)
Arguments
- data
Several input format are accepted. assigner uses radiator
tidy_genomic_data
module to import the data. See function documentation for more details.- strata
(optional) The strata file is a tab delimited file with a minimum of 2 columns headers:
INDIVIDUALS
andSTRATA
. Documented inread_strata
. DArT data: a third columnTARGET_ID
is required. Documented onread_dart
. Also use the strata read function to blacklist individuals. Default:strata = NULL
.- pop.levels
(optional, string) This refers to the levels in a factor. In this case, the id of the pop. Use this argument to have the pop ordered your way instead of the default alphabetical or numerical order. e.g.
pop.levels = c("QUE", "ONT", "ALB")
instead of the defaultpop.levels = c("ALB", "ONT", "QUE")
. White spaces in population names are replaced by underscore. Default:pop.levels = NULL
.- assignment.analysis
(character) Assignment analysis conducted with
assignment.analysis = "gsi_sim"
orassignment.analysis = "adegenet"
. See Details section below for installing gsi_sim.- markers.sampling
(character) 2 options for markers selection:
markers.sampling == "random"
the subset of markers are selected randomly, this is the classic Leave-One-Out (LOO) assignment.markers.sampling == "ranked"
the subset of markers are first ranked based on an overall decreasing Fst values. The Fst is computed withfst_WC84
function, that uses a fast implementation of Weir and Cockerham 1984 Fst/Theta equations. This selection method is used during Training-Holdout-Leave One Out (thl) assignment. How many markers are selected is controlled with argumentthl
.
- marker.number
(Integer or string of number or "all") The assignment analysis can use all your markers (default) or a subset of your markers. e.g. To test 500, 1000, 2000 and all the markers:
marker.number = c(500, 1000, 2000, "all")
. To use only 500 makersmarker.number = 500
. How those markers are sampled is determined with the argumentmarkers.sampling
, next. Default =marker.number = "all"
.- thl
(character, integer, proportion) For
markers.sampling = "ranked"
only. Several options are available:thl = 1
: Only 1 individual sample is used as holdout. This individual is not participating in the markers ranking. For each marker number, the analysis will be repeated with all the indiviuals in the data set (e.g. 500 individuals, 500 times 500, 1000, 2000 markers). This is the default.proportion
: e.g.thl = 0.15
, 15 percent of the individuals, in each strata, are chosen randomly as holdout individuals.thl = "all"
: all individuals are used for ranking (not good) and the argumentiteration.method = 1
is set by default. This is for testing only.
Different lists of holdout individuals are generated when the argument
iteration.method
(bootstrap) is used.- iteration.method
(integer) With random markers selection method, the iterations argument = the number of iterations to repeat marker resampling. Default:
iteration.method = 10
.With
marker.number = c(500, 1000)
and default iterations setting, 500 markers will be randomly chosen 10 times and 1000 markers will be randomly chosen 10 times.Notes: If all the markers are used, with
marker.number = "all"
or in a series of marker number groupingsmarker.number = c(200, 500, "all")
, the number of iteration is automatically set to 1. The remaining groupings are unaffected.With ranked makers selection method, using
thl = 1
, the analysis will be repeated for each individuals in the data set for everymarker.number
selected. With a proportion argumentthl = 0.15
, 15 percent of individuals in each populations are chosen randomly as holdout individuals and this process is reapeated the number of times chosen by theiteration.method
value.- subsample
(Integer or Character, optional) This argument subsample individuals. With
subsample = 36
, 36 individuals in each populations are chosen randomly to represent the dataset. This integer as to be smaller than the population with min sample size, if higher, the minimum sample size found in the data will be used as default. In doubt, usesubsample = "min"
, the function will use the smallest population sample size found in the data. The number of times this process is repeated is controlled by the argumentiteration.subsample
. Default:subsample = NULL
(no subsampling).- iteration.subsample
(Integer) The number of iterations to repeat subsampling of individuals. With
subsample = 20
anditeration.subsample = 10
, 20 individuals/populations will be randomly chosen 10 times. Default:iteration.subsample = 1
.- verbose
(optional, logical) When
verbose = TRUE
the function is a little more chatty during execution. Default:verbose = TRUE
.- parallel.core
(optional) The number of core used for parallel execution during import. Default:
parallel.core = parallel::detectCores() - 1
.- ...
(optional) To pass further argument for fine-tuning the function (see advanced section below).
Value
Depending on arguments selected, several folders and files are written:
01_radiator_tidy_genomic
this is the result of importing the data with radiator import module tidy_genomic_data.assigner_assignment_ngs_args_date@time.tsv
: For reproducibility, the function call, arguments and values used along the default arguments.assignment.plot.pdf
: The assignment figure.assignment.results.summary.stats.tsv
: tibble of summarized assignment statistics.assignment.ranked.results.summary.stats.all.subsamples.tsv
: When subsampling is used, this file contains the raw results of all subsample before summarizing.
THL: Training, Holdout, Leave-one-out:
Intermediate files are written, you can inspect specific iterations and/or subsample:
assignment.ranked.results.iterations.raw.tsv
: tibble with raw intermediate results of all iterations.assignment.ranked.results.iterations.summary.tsv
: tibble with intermediate summary over iterations.holdout.individuals.tsv
: tibble with the holdout individuals and associated iteration and random seed number.
LOO: Leave-one-out:
Intermediate files are written, you can inspect specific iterations and/or subsample:
assignment.random.results.iterations.raw.tsv
: tibble with raw intermediate results of all iterations.markers.random.tsv
: tibble with the random markers selected for each iteration with associated random seed number.
Folders
The names have the different iterations i
starting with assignment_
i contains:
assignment_
i.tsv
: the assignment result, for the iteration.fst.ranked_
i.tsv
: for THL method, the ranked Fst per markers, for the iteration.gsi_sim_seeds
: thegsi_sim
random seeds when this program is used, for the iteration.
The output in your global environment is a list. To view the assignment results
$assignment
to view the ggplot2 figure $assignment.plot
.
See example below.
Details
Using gsi_sim:
assignment_ngs
assumes that the command line version of
gsi_sim
is properly installed into file.path(system.file(package = "assigner"), "bin", "gsi_sim")
.
Things are set up so that it will try running gsi_sim, and if it does not find it, the
program will throw an error and ask the user to run install_gsi_sim
which will do its best to put a usable copy of gsi_sim where it is needed.
To do so, you must be connected to the internet. If that doesn't work, you will
need to compile the program yourself, or get it yourself, and the manually copy
it to file.path(system.file(package = "assigner"), "bin", "gsi_sim")
.
To compile gsi_sim, follow the
instruction here: https://github.com/eriqande/gsi_sim.
Advance mode
Ideally, forget about this section. For advance users, through dots-dots-dots ... you can pass several arguments for fine-tuning the function:
adegenet.dapc.opt
(optional, character) Argument available only when using:assignment.analysis = "adegenet"
withmarkers.sampling == "random"
.The assignment using dapc can use the optimized alpha score
adegenet.dapc.opt == "optim.a.score"
or cross-validationadegenet.dapc.opt == "xval"
for stability of group membership probabilities. For fine tuning the trade-off between power of discrimination and over-fitting. See adegenet documentation for more details.adegenet.dapc.opt == "xval"
doesn't work with missing data, so it's only available with full dataset or imputed dataset. With non imputed data or the default:adegenet.dapc.opt == "optim.a.score"
.adegenet.n.rep
: (optional, integer) Whenadegenet.dapc.opt == "xval"
, the number of replicates to be carried out at each level of PC retention. Default:adegenet.n.rep = 30
. See adegenet documentation for more details.adegenet.training
: (optional, numeric) Whenadegenet.dapc.opt == "xval"
, the proportion of data (individuals) to be used for the training set. Default:adegenet.training = 0.9
, if all groups have >= 10 members. Otherwise, training.set scales automatically to the largest proportion that still ensures all groups will be present in both training and validation sets. See adegenet documentation for more details.folder
: (optional) The name of the folder created in the working directory to save the files/results. Default:folder = NULL
will create the folder for you with data and time.filename
: (optional) The name of the file written to the directory. Use the extension ".txt" at the end. Several info will be appended to the name of the file. Defaultassignment_data.txt
.keep.gsi.files
: (logical, optional) With the default, the intermediate input and output gsi_sim files will be deleted from the directory when finished processing. I you decide to keep the files, withkeep.gsi.files = TRUE
, remember to allocate a large chunk of the disk space for the analysis. Default:keep.gsi.files = FALSE
random.seed
: (integer, optional) For reproducibility, set an integer that will be used inside function that requires randomness. With default, a random number is generated and printed in the appropriate output. Default:random.seed = NULL
.full.y.range
: (logical, optional) By default the y axis print min and max values are determied automatically from the data. It might be more usefull to see the full range of the scale from 0 to 100, in this case usefull.y.range = TRUE
. This can also be modified after running the analysis. See example below. Default:full.y.range = FALSE
.
Assumptions
Individuals QC: Bad individual QC will bias the assignment results.
remove duplicates samples: when found within the same strata, duplicates generate a false population signal, when they are found between strata (yes, I've seen it), it's generating noise and the core population signal is diluted.
remove individual with outlier heterozygosity: unchecked, outlier individuals based on heterozygosity will generate false population signal when the sample as lower heterozygosity and match against several strata (week assignment) when the sample as higher heterozygosity.
remove individuals with too many missing: these individuals will exacerbate or dilute the signal, depending on correlation with heterozygosity and presence of pattern of missingness.
Markers QC: Bad markers QC will bias the assignment results.
low MAC: improper Minor Allele Count filtering generate noise. The LOO and THL methods, both removes samples during model construction, if MAC is too low, the population core signature is greatly impacted at each iteration.
Linkage disequilibrium: remove highly linked markers.
HWE: remove markers in very strong Hardy-Weinberg disequilibrium likely artefactual and/or genotyping errors.
Strata: bad K selection will result in poor assingment results.
filtered data: Don't expect to filter your data here. radiator was designed for this, and
filter_rad
will handle the issues mentioned above. By default, the function will only remove monomorphic markers and keep markers in common between strata.
Life cycle
Map-independent imputation of missing genotype is avaible in my other R package called grur.
Use grur to :
Visualize your missing data: before imputing your genotypes, visualize your missing data. Several visual tools are available inside grur to help you decide the best strategy after.
Optimize: use grur imputation module and other functions to optimize the imputations of your dataset. You need to test arguments. Failing to conduct tests and adjust imputations arguments will generate artifacts and/or exacerbate bias. Using defaults is not optional here...
genomic_converter: use the output argument inside grur imputation module to generate the required formats for assigner (e.g. a tidy dataset)
Deprecated arguments:
sampling.method
: renamedmarkers.sampling
.
References
Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.
Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.
Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.
Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010:11: 94. doi:10.1186/1471-2156-11-94
Jombart T, Ahmed I. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011:27: 3070–3071. doi:10.1093/bioinformatics/btr521
Author
Thierry Gosselin thierrygosselin@icloud.com and Eric C. Anderson
Examples
if (FALSE) { # \dontrun{
assignment.treefrog <- assignment_ngs(
data = "batch_1.vcf",
strata = "strata.treefrog.tsv",
assignment.analysis = "gsi_sim",
marker.number = c(500, 5000, "all"),
markers.sampling = "ranked",
thl = 0.3
)
# To create a dataframe with the assignment results:
assignment <- assignment.treefrog$assignment
# To plot the assignment using ggplot2 and facet
fig <- assignment.treefrog$assignment.plot
# To view the full range of y values = Assignment success(%):
fig + ggplot2::scale_y_continuous(limits = c(0,100))
# Or use the ... argument: full.y.range = TRUE
} # }