Run STACKS gstacks module inside R!

run_gstacks(
  denovo = TRUE,
  P = "06_ustacks_2_gstacks",
  M = "02_project_info/population.map.tsv2bam.tsv",
  I = NULL,
  B = NULL,
  O = NULL,
  parallel.core = parallel::detectCores() - 1,
  model = "marukilow",
  var.alpha = 0.01,
  gt.alpha = 0.05,
  rm.pcr.duplicates = FALSE,
  rm.unpaired.reads = FALSE,
  ignore.pe.reads = TRUE,
  unpaired = TRUE,
  kmer.length = 31,
  max.debruijn.reads = 1000,
  min.kmer.cov = 2,
  write.alignments = FALSE,
  min.mapq = 10,
  max.clipped = 0.2,
  max.insert.len = 1000,
  details = FALSE,
  phasing.cooccurrences.thr.range = c(1, 2),
  phasing.dont.prune.hets = FALSE,
  h = FALSE
)

Arguments

denovo

(logical) de novo vs reference-based mode. Default: denovo = TRUE.

P

(path, character) De novo mode. Path to the directory containing STACKS files. Default: P = "06_ustacks_2_gstacks". Inside the folder, you should have:

  • the catalog files: starting with batch_ and ending with .alleles.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam;

  • 5 files for each samples: The sample name is the prefix for the files ending with: .alleles.tsv.gz, .models.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam. Those files are created in the ustacks, sstacks, cxstacks, tsv2bam modules

.

M

(path, character) Path to a population map giving the list of samples.

I

(character, path) Reference-based mode. Input directory containing BAM files. Default: I = NULL.

B

(character, path) Reference-based mode. Path to input BAM files. The input BAM file(s) must be sorted by coordinate. With B, records must be assigned to samples using BAM "reads groups" (gstacks uses the ID/identifier and SM/sample name fields). Read groups must be consistent if repeated different files. With I, read groups are unneeded and ignored. Please refer to the gstacks manual page for information. stacks also provide information about how to generate such a BAM file with Samtools or Sambamba, and examples. Default: B = NULL.

O

(character, path) Path to output directory. stackr by default will output in the input directory. Default: O = NULL (same as input).

parallel.core

(integer) Enable parallel execution with the number of threads. Default: parallel.core = parallel::detectCores() - 1.

model

(character) The model to use to call variants and genotypes; one of "marukilow", "marukihigh", or "snp". See ref for more details on algorithms. Default: model = "marukilow".

var.alpha

(double) Alpha threshold for discovering SNPs. Default: var.alpha = 0.01.

gt.alpha

(double) Alpha threshold for calling genotypes. Default: gt.alpha = 0.05.

rm.pcr.duplicates

(logical) remove all but one set of read pairs of the same sample that have the same insert length (implies rm.unpaired.reads = TRUE.). Default: rm.pcr.duplicates = FALSE.

rm.unpaired.reads

(logical) Discard unpaired reads. Default: rm.unpaired.reads = FALSE.

ignore.pe.reads

(logical) With default the function will ignore paired-end reads even if present in the input. Default: ignore.pe.reads = TRUE.

unpaired

(logical) Ignore read pairing (only for paired-end GBS; treat READ2's as if they were READ1's). Default: unpaired = TRUE.

kmer.length

(integer) De novo mode. kmer length for the de Bruijn graph. For expert. Default: kmer.length = 31.

max.debruijn.reads

(integer) Maximum number of reads to use in the de Bruijn graph. For expert. Default: max.debruijn.reads = 1000.

min.kmer.cov

(integer) De novo mode. Minimum coverage to consider a kmer. For expert. Default: min.kmer.cov = 2.

write.alignments

(logical) Save read alignments (heavy BAM files). Default: write.alignments = FALSE.

min.mapq

(double) Reference-based mode. Minimum PHRED-scaled mapping quality to consider a read. Default: min.mapq = 10.

max.clipped

(double) Reference-based mode. Maximum soft-clipping level, in fraction of read length. Default: max.clipped = 0.20.

max.insert.len

(integer) Reference-based mode. Maximum allowed sequencing insert length Default: max.insert.len = 1000.

details

(logical) With default the function will write a more detailed output. Default: details = FALSE.

phasing.cooccurrences.thr.range

(integer) range of edge coverage thresholds to iterate over when building the graph of allele cooccurrences for SNP phasing. Default: phasing.cooccurrences.thr.range = c(1,2).

phasing.dont.prune.hets

(logical) Don't try to ignore dubious heterozygote genotypes during phasing. By default, during phasing, dubious het are ignored. Default: phasing.dont.prune.hets = FALSE.

h

Display this help messsage. Default: h = FALSE

Value

tsv2bam returns a set of .matches.bam files.

The function run_gstacks returns a list with the number of individuals, the batch ID number, a summary data frame and a plot containing:

  1. INDIVIDUALS: the sample id

  2. ALL_LOCUS: the total number of locus for the individual (shown in subplot A)

  3. LOCUS: the number of locus with a one-to-one relationship (shown in subplot B) with the catalog

  4. MATCH_PERCENT: the percentage of locus with a one-to-one relationship with the catalog (shown in subplot C)

    Addtionally, the function returns a batch_X.catalog.bam file that was generated by merging all the individual BAM files in parallel.

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Maruki T, Lynch M (2017) Genotype Calling from Population-Genomic Sequencing Data. G3, 7, 1393-1404.

See also

Examples

if (FALSE) { # The simplest form of the function with De novo data: bam.sum <- stackr::run_gstacks() # that's it ! # This will use, by default, the same population map used in run_tsv2bam }