This function reads the output of STACKS pstacks folder and summarise the information.
The information shown can help to choose individuals to include/exclude from the catalog, the next step in STACKS pipeline (see example).
summary_pstacks( pstacks.folder, parallel.core = parallel::detectCores() - 1, verbose = FALSE )
pstacks.folder | (character). The path to the pstacks output files. |
---|---|
parallel.core | (integer, optional) The number of core used for parallel
execution.
Default: |
verbose | (logical, optional) Make the function a little more chatty during
execution.
Default: |
The function returns a summary (data frame) containing:
INDIVIDUALS: the sample id
LOCUS_NUMBER: the number of locus
BLACKLIST_PSTACKS: the number of locus blacklisted by pstacks
FOR_CATALOG: the number of locus available to generate the catalog
BLACKLIST_ARTIFACT: the number of artifact genotypes (> 2 alleles, see details)
FILTERED: the number of locus after artifacts are removed
HOMOZYGOSITY: the number of homozygous genotypes
HETEROZYGOSITY: the number of heterozygous genotypes
MEAN_NUMBER_SNP_LOCUS: the mean number of SNP/locus (excluding the artifact locus)
MAX_NUMBER_SNP_LOCUS: the max number of SNP/locus observed for this individual (excluding the artifact locus)
NUMBER_LOCUS_4SNP: the number of locus with 4 or more SNP/locus (excluding the artifact locus)
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.
if (FALSE) { pstacks.summary <- stackr::summary_pstacks( pstacks.folder = "output_pstacks", verbose = TRUE) # To get the number of snp/locus for a specific individual: snp.info <- pstacks.summary %>% dplyr::filter(INDIVIDUALS == "ID2") %>% dplyr::select(INDIVIDUALS, SNP_LOCUS) %>% tidyr::unnest(.) #Similarly, for the blacklisted locus (artifactual): blacklist <- pstacks.summary %>% dplyr::filter(INDIVIDUALS == "ID2") %>% dplyr::select(INDIVIDUALS, BLACKLIST_ARTIFACT) %>% tidyr::unnest(.) # Decide individuals to include in catalog: # Make the pstacks.summary dataframe population wise by including pop info. # For this we need a strata file ... it's similar to a population map, but with headers. # It's a 2 columns files with INDIVIDUALS and STRATA as column header. strata <- readr::read_tsv("strata.octopus.tsv", col_types = "cc") %>% dplyr::rename(POP_ID = STRATA) # join the dataframes individuals.catalog <- dplyr::left_join(pstacks.summary, strata, by = "INDIVIDUALS") %>% dplyr::arrange(desc(FILTERED)) # look at the FILTERED column hist(individuals.catalog$FILTERED) boxplot(individuals.catalog$FILTERED) mean(individuals.catalog$FILTERED) # lets say we want to filter out individuals below 45000 markers individuals.catalog.filtered <- individuals.catalog %>% dplyr::filter(FILTERED > 45000) # number of ind per pop dplyr::select(individuals.catalog.filtered, INDIVIDUALS, POP_ID) %>% dplyr::group_by(POP_ID) %>% dplyr::tally(.) # choose 20% of individuals per pop to represent the catalog individuals.catalog.select <- individuals.catalog.filtered %>% dplyr::group_by(POP_ID) %>% dplyr::sample_frac(tbl = ., size = 0.2, replace = FALSE) %>% dplyr::ungroup(.) %>% dplyr::arrange(desc(FILTERED)) %>% dplyr::distinct(INDIVIDUALS, POP_ID) # Using desc (from large to lower number of markers) ensure that # the catalog is populated with samples with the highest number of loci first. # This speed up the process in cstacks! # Create a file with individuals and pop to use in # cstacks or run_cstacks readr::write_tsv( x = individuals.catalog.select, path = "population.map.octopus.samples.catalog.tsv", col_names = FALSE) # stacks doesn't want column header }