Assignment analysis tailored for RADseq data

The arguments in the assignment_ngs function were tailored for the reality of RADseq data for assignment analysis while maintaining a reproducible workflow. Assignment assumptions are listed in the section below.

Input file: various file format are supported (see data argument below).
Cross-Validations: Markers can be randomly selected for a classic LOO (Leave-One-Out) assignment or chosen based on ranked Fst for a thl (Training, Holdout, Leave-one-out) assignment analysis.
Assignment analysis: conducted in gsi_sim, a tool for doing and simulating genetic stock identification and developed by Eric C. Anderson, or adegenet, an R package developed by Thibaul Jombart.
Parallel: The assignment can be conduncted on multiple CPUs. The R GUI is unstable with this functions, I recommend using RStudio.

Usage

assignment_ngs(
  data,
  strata = NULL,
  pop.levels = NULL,
  assignment.analysis = c("gsim_sim", "adegenet"),
  markers.sampling = c("ranked", "random"),
  marker.number = "all",
  thl = 1,
  iteration.method = 10,
  subsample = NULL,
  iteration.subsample = 1,
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

Several input format are accepted. assigner uses radiator tidy_genomic_data module to import the data. See function documentation for more details.

strata

(optional) The strata file is a tab delimited file with a minimum of 2 columns headers: INDIVIDUALS and STRATA. Documented in read_strata. DArT data: a third column TARGET_ID is required. Documented on read_dart. Also use the strata read function to blacklist individuals. Default: strata = NULL.

pop.levels

(optional, string) This refers to the levels in a factor. In this case, the id of the pop. Use this argument to have the pop ordered your way instead of the default alphabetical or numerical order. e.g. pop.levels = c("QUE", "ONT", "ALB") instead of the default pop.levels = c("ALB", "ONT", "QUE"). White spaces in population names are replaced by underscore. Default: pop.levels = NULL.

assignment.analysis

(character) Assignment analysis conducted with assignment.analysis = "gsi_sim" or assignment.analysis = "adegenet". See Details section below for installing gsi_sim.

markers.sampling

(character) 2 options for markers selection:

markers.sampling == "random" the subset of markers are selected randomly, this is the classic Leave-One-Out (LOO) assignment.
markers.sampling == "ranked" the subset of markers are first ranked based on an overall decreasing Fst values. The Fst is computed with fst_WC84 function, that uses a fast implementation of Weir and Cockerham 1984 Fst/Theta equations. This selection method is used during Training-Holdout-Leave One Out (thl) assignment. How many markers are selected is controlled with argument thl.

marker.number

(Integer or string of number or "all") The assignment analysis can use all your markers (default) or a subset of your markers. e.g. To test 500, 1000, 2000 and all the markers: marker.number = c(500, 1000, 2000, "all"). To use only 500 makers marker.number = 500. How those markers are sampled is determined with the argument markers.sampling, next. Default = marker.number = "all".

thl

(character, integer, proportion) For markers.sampling = "ranked" only. Several options are available:

thl = 1: Only 1 individual sample is used as holdout. This individual is not participating in the markers ranking. For each marker number, the analysis will be repeated with all the indiviuals in the data set (e.g. 500 individuals, 500 times 500, 1000, 2000 markers). This is the default.
proportion: e.g. thl = 0.15, 15 percent of the individuals, in each strata, are chosen randomly as holdout individuals.
thl = "all": all individuals are used for ranking (not good) and the argument iteration.method = 1 is set by default. This is for testing only.

Different lists of holdout individuals are generated when the argument iteration.method (bootstrap) is used.

iteration.method

(integer) With random markers selection method, the iterations argument = the number of iterations to repeat marker resampling. Default: iteration.method = 10.

With marker.number = c(500, 1000) and default iterations setting, 500 markers will be randomly chosen 10 times and 1000 markers will be randomly chosen 10 times.

Notes: If all the markers are used, with marker.number = "all" or in a series of marker number groupings marker.number = c(200, 500, "all"), the number of iteration is automatically set to 1. The remaining groupings are unaffected.

With ranked makers selection method, using thl = 1, the analysis will be repeated for each individuals in the data set for every marker.number selected. With a proportion argument thl = 0.15, 15 percent of individuals in each populations are chosen randomly as holdout individuals and this process is reapeated the number of times chosen by the iteration.method value.

subsample

(Integer or Character, optional) This argument subsample individuals. With subsample = 36, 36 individuals in each populations are chosen randomly to represent the dataset. This integer as to be smaller than the population with min sample size, if higher, the minimum sample size found in the data will be used as default. In doubt, use subsample = "min", the function will use the smallest population sample size found in the data. The number of times this process is repeated is controlled by the argument iteration.subsample. Default: subsample = NULL (no subsampling).

iteration.subsample

(Integer) The number of iterations to repeat subsampling of individuals. With subsample = 20 and iteration.subsample = 10, 20 individuals/populations will be randomly chosen 10 times. Default: iteration.subsample = 1.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) To pass further argument for fine-tuning the function (see advanced section below).

Value

Depending on arguments selected, several folders and files are written:

01_radiator_tidy_genomic this is the result of importing the data with radiator import module tidy_genomic_data.
assigner_assignment_ngs_args_date@time.tsv: For reproducibility, the function call, arguments and values used along the default arguments.
assignment.plot.pdf: The assignment figure.
assignment.results.summary.stats.tsv: tibble of summarized assignment statistics.
assignment.ranked.results.summary.stats.all.subsamples.tsv: When subsampling is used, this file contains the raw results of all subsample before summarizing.

THL: Training, Holdout, Leave-one-out:

Intermediate files are written, you can inspect specific iterations and/or subsample:

assignment.ranked.results.iterations.raw.tsv: tibble with raw intermediate results of all iterations.
assignment.ranked.results.iterations.summary.tsv: tibble with intermediate summary over iterations.
holdout.individuals.tsv: tibble with the holdout individuals and associated iteration and random seed number.

LOO: Leave-one-out:

Intermediate files are written, you can inspect specific iterations and/or subsample:

assignment.random.results.iterations.raw.tsv: tibble with raw intermediate results of all iterations.
markers.random.tsv: tibble with the random markers selected for each iteration with associated random seed number.

Folders

The names have the different iterations i starting with assignment_i contains:

assignment_i.tsv: the assignment result, for the iteration.
fst.ranked_i.tsv: for THL method, the ranked Fst per markers, for the iteration.
gsi_sim_seeds: the gsi_sim random seeds when this program is used, for the iteration.

The output in your global environment is a list. To view the assignment results $assignment to view the ggplot2 figure $assignment.plot. See example below.

Details

Using gsi_sim:

assignment_ngs assumes that the command line version of gsi_sim is properly installed into file.path(system.file(package = "assigner"), "bin", "gsi_sim"). Things are set up so that it will try running gsi_sim, and if it does not find it, the program will throw an error and ask the user to run install_gsi_sim which will do its best to put a usable copy of gsi_sim where it is needed. To do so, you must be connected to the internet. If that doesn't work, you will need to compile the program yourself, or get it yourself, and the manually copy it to file.path(system.file(package = "assigner"), "bin", "gsi_sim"). To compile gsi_sim, follow the instruction here: https://github.com/eriqande/gsi_sim.

Advance mode

Ideally, forget about this section. For advance users, through dots-dots-dots ... you can pass several arguments for fine-tuning the function:

adegenet.dapc.opt (optional, character) Argument available only when using: assignment.analysis = "adegenet" with markers.sampling == "random".

The assignment using dapc can use the optimized alpha score adegenet.dapc.opt == "optim.a.score" or cross-validation adegenet.dapc.opt == "xval" for stability of group membership probabilities. For fine tuning the trade-off between power of discrimination and over-fitting. See adegenet documentation for more details. adegenet.dapc.opt == "xval" doesn't work with missing data, so it's only available with full dataset or imputed dataset. With non imputed data or the default: adegenet.dapc.opt == "optim.a.score".
adegenet.n.rep: (optional, integer) When adegenet.dapc.opt == "xval", the number of replicates to be carried out at each level of PC retention. Default: adegenet.n.rep = 30. See adegenet documentation for more details.
adegenet.training: (optional, numeric) When adegenet.dapc.opt == "xval", the proportion of data (individuals) to be used for the training set. Default: adegenet.training = 0.9, if all groups have >= 10 members. Otherwise, training.set scales automatically to the largest proportion that still ensures all groups will be present in both training and validation sets. See adegenet documentation for more details.
folder: (optional) The name of the folder created in the working directory to save the files/results. Default: folder = NULL will create the folder for you with data and time.
filename: (optional) The name of the file written to the directory. Use the extension ".txt" at the end. Several info will be appended to the name of the file. Default assignment_data.txt.
keep.gsi.files: (logical, optional) With the default, the intermediate input and output gsi_sim files will be deleted from the directory when finished processing. I you decide to keep the files, with keep.gsi.files = TRUE, remember to allocate a large chunk of the disk space for the analysis. Default: keep.gsi.files = FALSE
random.seed: (integer, optional) For reproducibility, set an integer that will be used inside function that requires randomness. With default, a random number is generated and printed in the appropriate output. Default: random.seed = NULL.
full.y.range: (logical, optional) By default the y axis print min and max values are determied automatically from the data. It might be more usefull to see the full range of the scale from 0 to 100, in this case use full.y.range = TRUE. This can also be modified after running the analysis. See example below. Default: full.y.range = FALSE.

Assumptions

Individuals QC: Bad individual QC will bias the assignment results.
- remove duplicates samples: when found within the same strata, duplicates generate a false population signal, when they are found between strata (yes, I've seen it), it's generating noise and the core population signal is diluted.
- remove individual with outlier heterozygosity: unchecked, outlier individuals based on heterozygosity will generate false population signal when the sample as lower heterozygosity and match against several strata (week assignment) when the sample as higher heterozygosity.
- remove individuals with too many missing: these individuals will exacerbate or dilute the signal, depending on correlation with heterozygosity and presence of pattern of missingness.
Markers QC: Bad markers QC will bias the assignment results.
- low MAC: improper Minor Allele Count filtering generate noise. The LOO and THL methods, both removes samples during model construction, if MAC is too low, the population core signature is greatly impacted at each iteration.
- Linkage disequilibrium: remove highly linked markers.
- HWE: remove markers in very strong Hardy-Weinberg disequilibrium likely artefactual and/or genotyping errors.
Strata: bad K selection will result in poor assingment results.
filtered data: Don't expect to filter your data here. radiator was designed for this, and filter_rad will handle the issues mentioned above. By default, the function will only remove monomorphic markers and keep markers in common between strata.

Life cycle

Map-independent imputation of missing genotype is avaible in my other R package called grur.

Use grur to :

Visualize your missing data: before imputing your genotypes, visualize your missing data. Several visual tools are available inside grur to help you decide the best strategy after.
Optimize: use grur imputation module and other functions to optimize the imputations of your dataset. You need to test arguments. Failing to conduct tests and adjust imputations arguments will generate artifacts and/or exacerbate bias. Using defaults is not optional here...
genomic_converter: use the output argument inside grur imputation module to generate the required formats for assigner (e.g. a tidy dataset)

Deprecated arguments:

sampling.method: renamed markers.sampling.

References

Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.

Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.

Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.

Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010:11: 94. doi:10.1186/1471-2156-11-94

Jombart T, Ahmed I. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011:27: 3070–3071. doi:10.1093/bioinformatics/btr521

Author

Thierry Gosselin thierrygosselin@icloud.com and Eric C. Anderson

Examples

if (FALSE) { # \dontrun{
assignment.treefrog <- assignment_ngs(
 data = "batch_1.vcf",
 strata = "strata.treefrog.tsv",
 assignment.analysis = "gsi_sim",
 marker.number = c(500, 5000, "all"),
 markers.sampling = "ranked",
 thl = 0.3
 )
 # To create a dataframe with the assignment results:
   assignment <- assignment.treefrog$assignment
 # To plot the assignment using ggplot2 and facet
   fig <- assignment.treefrog$assignment.plot
 # To view the full range of y values = Assignment success(%):
   fig + ggplot2::scale_y_continuous(limits = c(0,100))

 # Or use the ... argument: full.y.range = TRUE
} # }