This function reads the output of STACKS pstacks folder and summarise the information.

The information shown can help to choose individuals to include/exclude from the catalog, the next step in STACKS pipeline (see example).

summary_pstacks(
  pstacks.folder,
  parallel.core = parallel::detectCores() - 1,
  verbose = FALSE
)

Arguments

pstacks.folder

(character). The path to the pstacks output files.

parallel.core

(integer, optional) The number of core used for parallel execution. Default: parallel::detectCores() - 1.

verbose

(logical, optional) Make the function a little more chatty during execution. Default: verbose = FALSE.

Value

The function returns a summary (data frame) containing:

  1. INDIVIDUALS: the sample id

  2. LOCUS_NUMBER: the number of locus

  3. BLACKLIST_PSTACKS: the number of locus blacklisted by pstacks

  4. FOR_CATALOG: the number of locus available to generate the catalog

  5. BLACKLIST_ARTIFACT: the number of artifact genotypes (> 2 alleles, see details)

  6. FILTERED: the number of locus after artifacts are removed

  7. HOMOZYGOSITY: the number of homozygous genotypes

  8. HETEROZYGOSITY: the number of heterozygous genotypes

  9. MEAN_NUMBER_SNP_LOCUS: the mean number of SNP/locus (excluding the artifact locus)

  10. MAX_NUMBER_SNP_LOCUS: the max number of SNP/locus observed for this individual (excluding the artifact locus)

  11. NUMBER_LOCUS_4SNP: the number of locus with 4 or more SNP/locus (excluding the artifact locus)

References

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

See also

Examples

if (FALSE) { pstacks.summary <- stackr::summary_pstacks( pstacks.folder = "output_pstacks", verbose = TRUE) # To get the number of snp/locus for a specific individual: snp.info <- pstacks.summary %>% dplyr::filter(INDIVIDUALS == "ID2") %>% dplyr::select(INDIVIDUALS, SNP_LOCUS) %>% tidyr::unnest(.) #Similarly, for the blacklisted locus (artifactual): blacklist <- pstacks.summary %>% dplyr::filter(INDIVIDUALS == "ID2") %>% dplyr::select(INDIVIDUALS, BLACKLIST_ARTIFACT) %>% tidyr::unnest(.) # Decide individuals to include in catalog: # Make the pstacks.summary dataframe population wise by including pop info. # For this we need a strata file ... it's similar to a population map, but with headers. # It's a 2 columns files with INDIVIDUALS and STRATA as column header. strata <- readr::read_tsv("strata.octopus.tsv", col_types = "cc") %>% dplyr::rename(POP_ID = STRATA) # join the dataframes individuals.catalog <- dplyr::left_join(pstacks.summary, strata, by = "INDIVIDUALS") %>% dplyr::arrange(desc(FILTERED)) # look at the FILTERED column hist(individuals.catalog$FILTERED) boxplot(individuals.catalog$FILTERED) mean(individuals.catalog$FILTERED) # lets say we want to filter out individuals below 45000 markers individuals.catalog.filtered <- individuals.catalog %>% dplyr::filter(FILTERED > 45000) # number of ind per pop dplyr::select(individuals.catalog.filtered, INDIVIDUALS, POP_ID) %>% dplyr::group_by(POP_ID) %>% dplyr::tally(.) # choose 20% of individuals per pop to represent the catalog individuals.catalog.select <- individuals.catalog.filtered %>% dplyr::group_by(POP_ID) %>% dplyr::sample_frac(tbl = ., size = 0.2, replace = FALSE) %>% dplyr::ungroup(.) %>% dplyr::arrange(desc(FILTERED)) %>% dplyr::distinct(INDIVIDUALS, POP_ID) # Using desc (from large to lower number of markers) ensure that # the catalog is populated with samples with the highest number of loci first. # This speed up the process in cstacks! # Create a file with individuals and pop to use in # cstacks or run_cstacks readr::write_tsv( x = individuals.catalog.select, path = "population.map.octopus.samples.catalog.tsv", col_names = FALSE) # stacks doesn't want column header }