run_gstacks(
denovo = TRUE,
P = "06_ustacks_2_gstacks",
M = "02_project_info/population.map.tsv2bam.tsv",
I = NULL,
B = NULL,
O = NULL,
parallel.core = parallel::detectCores() - 1,
model = "marukilow",
var.alpha = 0.01,
gt.alpha = 0.05,
rm.pcr.duplicates = FALSE,
rm.unpaired.reads = FALSE,
ignore.pe.reads = TRUE,
unpaired = TRUE,
kmer.length = 31,
max.debruijn.reads = 1000,
min.kmer.cov = 2,
write.alignments = FALSE,
min.mapq = 10,
max.clipped = 0.2,
max.insert.len = 1000,
details = FALSE,
phasing.cooccurrences.thr.range = c(1, 2),
phasing.dont.prune.hets = FALSE,
h = FALSE
)
(logical) de novo vs reference-based mode.
Default: denovo = TRUE
.
(path, character) De novo mode.
Path to the directory containing STACKS files.
Default: P = "06_ustacks_2_gstacks"
.
Inside the folder, you should have:
the catalog files: starting with batch_
and ending with
.alleles.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam
;
5 files for each samples: The sample name is the prefix for
the files ending with:
.alleles.tsv.gz, .models.tsv.gz, .snps.tsv.gz, .tags.tsv.gz and .bam
.
Those files are created in the
ustacks,
sstacks,
cxstacks,
tsv2bam
modules
.
(path, character) Path to a population map giving the list of samples.
(character, path) Reference-based mode.
Input directory containing BAM files.
Default: I = NULL
.
(character, path) Reference-based mode. Path to input BAM files.
The input BAM file(s) must be sorted by coordinate.
With B
, records must be assigned to samples using BAM "reads groups"
(gstacks uses the ID/identifier and SM/sample name fields). Read groups
must be consistent if repeated different files. With I
,
read groups are unneeded and ignored.
Please refer to the gstacks manual page for information. stacks also provide
information about how to
generate such a BAM file with Samtools or Sambamba, and examples.
Default: B = NULL
.
(character, path) Path to output directory. stackr
by default
will output in the input directory.
Default: O = NULL
(same as input).
(integer) Enable parallel execution with the number of threads.
Default: parallel.core = parallel::detectCores() - 1
.
(character) The model to use to call variants and genotypes;
one of "marukilow"
, "marukihigh"
, or "snp"
.
See ref for more details on algorithms.
Default: model = "marukilow"
.
(double) Alpha threshold for discovering SNPs.
Default: var.alpha = 0.01
.
(double) Alpha threshold for calling genotypes.
Default: gt.alpha = 0.05
.
(logical) remove all but one set of read pairs of
the same sample that have the same insert length
(implies rm.unpaired.reads = TRUE
.).
Default: rm.pcr.duplicates = FALSE
.
(logical) Discard unpaired reads.
Default: rm.unpaired.reads = FALSE
.
(logical) With default the function will
ignore paired-end reads even if present in the input.
Default: ignore.pe.reads = TRUE
.
(logical) Ignore read pairing (only for paired-end GBS;
treat READ2's as if they were READ1's).
Default: unpaired = TRUE
.
(integer) De novo mode.
kmer length for the de Bruijn graph. For expert.
Default: kmer.length = 31
.
(integer) Maximum number of reads to use in the
de Bruijn graph. For expert.
Default: max.debruijn.reads = 1000
.
(integer) De novo mode.
Minimum coverage to consider a kmer. For expert.
Default: min.kmer.cov = 2
.
(logical) Save read alignments (heavy BAM files).
Default: write.alignments = FALSE
.
(double) Reference-based mode.
Minimum PHRED-scaled mapping quality to consider a read.
Default: min.mapq = 10
.
(double) Reference-based mode.
Maximum soft-clipping level, in fraction of read length.
Default: max.clipped = 0.20
.
(integer) Reference-based mode.
Maximum allowed sequencing insert length
Default: max.insert.len = 1000
.
(logical) With default the function will write a more detailed
output.
Default: details = FALSE
.
(integer) range of edge coverage thresholds to
iterate over when building the graph of allele cooccurrences for
SNP phasing.
Default: phasing.cooccurrences.thr.range = c(1,2)
.
(logical) Don't try to ignore dubious heterozygote
genotypes during phasing. By default, during phasing, dubious het are ignored.
Default: phasing.dont.prune.hets = FALSE
.
Display this help messsage.
Default: h = FALSE
tsv2bam
returns a set of .matches.bam
files.
The function run_gstacks
returns a list with the number of individuals, the batch ID number,
a summary data frame and a plot containing:
INDIVIDUALS: the sample id
ALL_LOCUS: the total number of locus for the individual (shown in subplot A)
LOCUS: the number of locus with a one-to-one relationship (shown in subplot B) with the catalog
MATCH_PERCENT: the percentage of locus with a one-to-one relationship with the catalog (shown in subplot C)
Addtionally, the function returns a batch_X.catalog.bam file that was generated by merging all the individual BAM files in parallel.
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.
Maruki T, Lynch M (2017) Genotype Calling from Population-Genomic Sequencing Data. G3, 7, 1393-1404.
if (FALSE) { # \dontrun{
# The simplest form of the function with De novo data:
bam.sum <- stackr::run_gstacks() # that's it !
# This will use, by default, the same population map used in run_tsv2bam
} # }