Summarize STACKS pstacks files

This function reads the output of STACKS pstacks folder and summarise the information.

The information shown can help to choose individuals to include/exclude from the catalog, the next step in STACKS pipeline (see example).

summary_pstacks(
  pstacks.folder,
  parallel.core = parallel::detectCores() - 1,
  verbose = FALSE
)

Arguments

pstacks.folder: (character). The path to the pstacks output files.
parallel.core: (integer, optional) The number of core used for parallel execution. Default: parallel::detectCores() - 1.
verbose: (logical, optional) Make the function a little more chatty during execution. Default: verbose = FALSE.

Value

The function returns a summary (data frame) containing:

INDIVIDUALS: the sample id
LOCUS_NUMBER: the number of locus
BLACKLIST_PSTACKS: the number of locus blacklisted by pstacks
FOR_CATALOG: the number of locus available to generate the catalog
BLACKLIST_ARTIFACT: the number of artifact genotypes (> 2 alleles, see details)
FILTERED: the number of locus after artifacts are removed
HOMOZYGOSITY: the number of homozygous genotypes
HETEROZYGOSITY: the number of heterozygous genotypes
MEAN_NUMBER_SNP_LOCUS: the mean number of SNP/locus (excluding the artifact locus)
MAX_NUMBER_SNP_LOCUS: the max number of SNP/locus observed for this individual (excluding the artifact locus)
NUMBER_LOCUS_4SNP: the number of locus with 4 or more SNP/locus (excluding the artifact locus)

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Examples

if (FALSE) { # \dontrun{
pstacks.summary <- stackr::summary_pstacks(
pstacks.folder = "output_pstacks", verbose = TRUE)

# To get the number of snp/locus for a specific individual:
snp.info <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, SNP_LOCUS) %>%
tidyr::unnest(.)

#Similarly, for the blacklisted locus (artifactual):
blacklist <- pstacks.summary %>%
dplyr::filter(INDIVIDUALS == "ID2") %>%
dplyr::select(INDIVIDUALS, BLACKLIST_ARTIFACT) %>%
tidyr::unnest(.)

# Decide individuals to include in catalog:

# Make the pstacks.summary dataframe population wise by including pop info.
# For this we need a strata file ... it's similar to a population map, but with headers.
# It's a 2 columns files with INDIVIDUALS and STRATA as column header.
strata <- readr::read_tsv("strata.octopus.tsv", col_types = "cc") %>%
dplyr::rename(POP_ID = STRATA)

# join the dataframes
individuals.catalog <- dplyr::left_join(pstacks.summary, strata, by = "INDIVIDUALS") %>%
dplyr::arrange(desc(FILTERED))

# look at the FILTERED column
hist(individuals.catalog$FILTERED)
boxplot(individuals.catalog$FILTERED)
mean(individuals.catalog$FILTERED)

# lets say we want to filter out individuals below 45000 markers
individuals.catalog.filtered <- individuals.catalog %>%
dplyr::filter(FILTERED > 45000)

# number of ind per pop
dplyr::select(individuals.catalog.filtered, INDIVIDUALS, POP_ID) %>%
dplyr::group_by(POP_ID) %>% dplyr::tally(.)

# choose 20% of individuals per pop to represent the catalog

individuals.catalog.select <- individuals.catalog.filtered %>%
dplyr::group_by(POP_ID) %>%
dplyr::sample_frac(tbl = ., size = 0.2, replace = FALSE) %>%
dplyr::ungroup(.) %>%
dplyr::arrange(desc(FILTERED)) %>%
dplyr::distinct(INDIVIDUALS, POP_ID)

# Using desc (from large to lower number of markers) ensure that
# the catalog is populated with samples with the highest number of loci first.
# This speed up the process in cstacks!


# Create a file with individuals and pop to use in
# cstacks or run_cstacks

readr::write_tsv(
x = individuals.catalog.select,
file = "population.map.octopus.samples.catalog.tsv",
col_names = FALSE) # stacks doesn't want column header
} # }

Arguments

Value

References

See also

Examples