vignettes/stackr.Rmd
stackr.Rmd
This vignette purposes is to show users how to use stackr to run the stacks pipeline inside R/RStudio.
run_ustacks
and stackr will produce tables and figures automatically.-p
, -t
, --threads
. Depending on module the meaning changes, which can be confusing or generate problems for first-time users. Inside stackr, using parallel execution is always through the use of the argument: parallel.core
.-
. Inside stackr this is replaced by function arguments with .
point/dot. Example: in stacks process_radtags
argument --adapter-1
will be the argument adapter.1
inside stackr run_process_radtags
.The names of stackr functions follows stacks module, below the similarity:
They are tons of other workflows and below are just a few examples:
Here, I’m going to use RStudio, because I really got to like it and I’m very efficient in it. But it’s by no mean essential.
Set your working directory inside RStudio using setwd
function and appropriate path.
Generate the folders for your project
To get the doc on the function we’re about to use: ?stackr::build_stackr_workflow_dir
.
Let’s start a project on turtles…
stackr::build_stackr_workflow_dir(main.folder.name = "turtle")
# stackr::build_stackr_workflow_dir(main.folder.name = "turtle", date = FALSE) #if you don't want the date appended
By default, the date and time is appended to the main folder and inside, these folders will be generated:
01_scripts
02_project_info
03_sequencing_lanes
04_process_radtags
05_clustering_mismatches
06_ustacks_2_gstacks
07_populations
08_stacks_results
09_log_files
10_filters
For now let’s just say that you put the scripts and other .R
files associated to this project inside the first folder: 01_scripts
.
Inside RStudio generate a new project inside the directory just created:
You can add the alias/shorcut for the project on your computer desktop.
Inside MS Excel or another software that works with tabular columns, prepare your project info file and save it with the file extension .tsv
inside the second folder: 02_project_info
. For example, the file project.info.turtle.tsv
will contain a minimum of 3 columns for single-end and 4 columns for paired-end sequencing. The number of lines equal the number of sample: 1 sample per line. The project info file is sample oriented or focused.
But before generating and filling the project info file consider the names of your samples…
You already have somewhere on your computer barcodes and sample name. Here is the time to reconsider how you named your sample. This is not trivial, researcher loose a lot of time because of bad naming scheme. REALLY.
My philosophy:
Include useful metadata associated with your sample separated by an hyphen -
. Why an hyphen -
instead of a point .
or an underscore _
?? Imagine you print your file on real paper or that you have to translate by hand the sample names. An hyphen, when printed, generated far less confusion because it can easily be distinguised if printed on paper (lined paper or not). The underscore or point will just be confusion with space.
Below an example using metadata: SPECIES
, STRATA
(could be location, I don’t know the pop yet), YEAR
(of sampling), SEX
, ID
(with sufficient digits)
STU-BUR-2019-F-0001
STU-BUR-2019-F-0002
STU-BUR-2019-F-0003
STU-BUR-2019-F-0004
STU-BUR-2019-M-0005
STU-BUR-2017-F-0006
STU-GRA-2017-M-0007 # you could start over from 0001 with different strata, it doesn't matter
STU-GRA-2018-M-0008
STU-GRA-2019-M-0009
STU-GRA-2019-F-0010
STU-GRA-2019-F-0011
Special considerations:
Ok now the project info file…
Columns prerequisites:
LANES
: the name of the sequencing file e.g. H6CY7BBXY_3_254.fastq.gzBARCODES
: the barcode associated with the sample.INDIVIDUALS
: a more appealing sample name (more on this below).Generate an empty file like this:
Columns prerequisites:
BARCODES
: the barcode associated with the sampleINDIVIDUALS
: a more appealing sample name (more on this below)FORWARD
: the name of the forward sequencing file e.g. H5CY5BBXY_2_25_1.fastq.gzREVERSE
: the name of the reverse sequencing file e.g. H5CY5BBXY_2_25_2.fastq.gzGenerate an empty file like this:
If you have several Illumina lanes or Ion Torrent chips, just write them down along the sample and barcodes associated with it. On the same file.
2 options:
stackr will see that you have, for different barcodes/lanes combo, ids that have the exact same name. What will happen ?
-1
and -2
will be added to the sample name.run_process_radtags
, the replicates will be combined and the new file will have -R
at the end. The individual replicates are kept separate as well.example:
You have a sample, STU-BUR-2019-F-0001
, on 2 separate wells on the the plate or even lanes, with different barcode. At the end of run_process_radtags
you’ll have:
STU-BUR-2019-F-0001-1
STU-BUR-2019-F-0001-2
STU-BUR-2019-F-0001-R
(for the combined replicates)Transfer the sequencing files inside the folder: 03_sequencing_lanes
. Contrary to the pipeline I used to contribute and follow link Don’t modify the files: stackr, like stacks, handles the different types.
I highly recommend running the sequencing files (entire lane or chip) through a software dedicated for QC. If you have bad sequencing, no need to go further in the pipeline…
Reducing the noise in an individual FQ file could really help speeding up your analysis. I use it when I have more than 200 samples. Nothing new here:
I prefer Daniel Ilut way of visualizing the information.
The function stackr::read_depth_plot
requires 1 FQ file (after stacks process_radtags or stackr::run_process_radtags
) and is very fast to run. It will highlight read coverage groups of distinct reads within a sample. When I say distinct reads, is reads that are exactly the same, no mismatches. The distinct reads are putative allele.
4 read coverage groups are shown:
distinct reads with low coverage (in red): these reads are likely sequencing errors or uninformative polymorphisms (shared only by a few samples).
disting reads for a target coverage (in green):
distinct reads with high coverage > 1 read depth (in yellow): those are legitimate alleles with high coverage.
distinct and unique reads with high coverage (in orange): this is where you’ll find paralogous sequences, transposable elements and other high copy number variants.
To run it on your data:
read.plot <- stackr::read_depth_plot(
fq.file = "04_process_radtags/my-fq.fq.gz",
min.coverage.fig = 7L, #default
parallel.core = parallel::detectCores() - 1 #default
)
The function will generate the plot shown above and write the figure in the working directory, using the sample name: my-fq_hap_read_depth.png
.
Should I run the function on all samples ?
Please don’t. Their is no use in running this on the entire dataset. It’s going to be fairly the same. Run it on a couple of samples. Based on the number of reads (distribution), choose 1 sample: below average, average and above average and compare the look of the figures produced.
Removing the noise in FQ files, it’s usually a hot topic… Personally, the less I do the better and it’s definitely the case for small projects. I don’t think the noise is a problem for stacks. For large projects, it’s something else, and certainly a huge bottleneck. When you generate de novo assembly for more than 200 samples and generate a catalog with all the noise, cleaning the data can represent weeks of computation time that you shave off your project… really.
You don’t have to believe what I just wrote, test it yourself:
1. Clean the FQ file
Use the sample you just used in the read coverage groups, above, clean it with stackr::clean_fq
. It’s very fast to run:
clean.fq.name <- stackr::clean_fq(
fq.file = "04_process_radtags/my-fq.fq.gz",
min.coverage.threshold = 2,
remove.unique.reads = TRUE, #default
write.blacklist.fasta = TRUE,
write.blacklist = FALSE, #default
parallel.core = parallel::detectCores() - 1, #default
verbose = TRUE #default
)
This will generate:
my-fq_cleaned.fq.gz # the cleaned fq file
my-fq_blacklisted_reads.fasta.gz # a fasta file with blacklisted sequences...
2. de novo assembly
Run stacks ustacks or even better, stay in stackr and do the next step below: the mismatch threshold series (using stackr::run_ustacks
). For this, use the raw and clean FQ files and compare the output…
In all my projects, having less reads (the cleaned FQ file) means:
-M
in stacks ustacks) for the cleaned or the raw FQ files.stackr::run_cstacks
step.Why keep the blacklisted sequences (the fasta file) ?
You could use Biostrings::countPattern
function to look for TE in those sequences;)
3. Filtering the entire dataset
If you see a big difference and are convince that filtering is the way to go, you can easily do it with this piece of code:
fq.to.clean <- list.files(path = "04_process_radtags/")
names(fq.to.clean) <- fq.to.clean
clean.all <- purrr::map_df(
.x = fq.to.clean,
.f = stackr::clean_fq,
min.coverage.threshold = 5,
remove.unique.reads = TRUE, #default
write.blacklist.fasta = TRUE,
write.blacklist = FALSE, #default
parallel.core = 12
)
Transfer the cleaned files or the raw ones in a separate folder/sub-folder to keep things tidy.
Another hot topic, up for debate…
After the stacks step called process_radtags, in stackr run_process_radtags
, looking at the distribution of the number of reads per sample, you might be getting one of these distribution:
In an ideal world, you want the uniform distribution. That translate in good field and wet-lab techniques:
And this resulted in similar sequenced information between your samples. You’re about to compare apples with apples. :)
If you have a skewed (left or right) or a normal distribution, you’ll be comparing apples with … citrus. The samples with lower reads will have less loci and usually less heterozygous loci. Similarly, the samples with higher number of reads will have more loci, more heterozygous loci.
What’s the likely cause for these ?
Ok, but what’s the problem with that ?
Potentially these problems:
Solutions ?
Here are the basic steps:
I’m not a fan:
When doing it right from the start is impossible, I prefer to do the normalization after de demultiplexing step… after stackr::run_process_radtags
. The prerequisite is to have enough sequencing material (reads) for all my samples.
The tool? stackr::normalize_reads function.
Rarefaction of fasq files by sub-sampling the reads before de novo assembly or alignment is not new:
The steps ?
Read the function documentation and look at the example
# To run this function, bioconductor's ShortRead package is necessary:
source("http://bioconductor.org/biocLite.R")
biocLite("ShortRead")
# Using OpenMP threads
nthreads <- .Call(ShortRead:::.set_omp_threads, 1L)
on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
# using defaults:
stackr::normalize_reads(path.samples = "~/corals")
# customizing the function:
stackr::normalize_reads(
project.info = "project.info.corals.tsv",
path.samples = "04_process_radtags",
sample.reads = 2000000,
number.replicates = 3,
random.seed = 3,
parallel.core = 12
)
My observations using rarefaction:
-M
in stacks ustacks) for the rarefied samples.It doesn’t bother you to throw away all the valuable information that samples with high read numbers have ?
Absolutely not!
This is the de novo assembly step in stacks, the module is called ustacks.
Function documentation: ?stackr::run_ustacks
(read it twice).
Difference with stacks ustacks
?
-i
argument, the unique integer ID for this sample is managed automatically.stackr::run_ustacks
automatically ;)?summary_ustacks
.If you don’t know the optimal number of mismatches between your individual reads this part is the most important steps of de novo assembly.
When using paired-end data, use forward reads for the analysis.
Select one sample to do the test. For this, you could use a sample with average file size or use the stats produced directly by the output of run_process_radtags
and available inside the updated project info file saved by the function.
info.stats <- readr::read_tsv(file = "02_project_info/project.info.20191121@1606.tsv") %>%
dplyr::filter(info.stats, REPLICATES == "1")
m <- mean(info.stats$RETAINED)
s <- stats::sd(info.stats$RETAINED)
read.min <- m - s
read.max <- m + s
sample.select <- info.stats %>%
dplyr::filter(RETAINED >= read.min & RETAINED <= read.max) %>%
dplyr::sample_n(size = 1)
sample.select$FQ_FILES_F
# [1] "BET-LAM-60044-1682200-1.1.fq.gz"
Make a copy the file inside the folder: 05_clustering_mismatches
:
file.copy(
from = file.path("04_process_radtags", sample.select$FQ_FILES_F),
to = file.path("05_clustering_mismatches/", sample.select$FQ_FILES_F)
)
Run the mismatch clustering series:
mismatch <- stackr::run_ustacks(
mismatch.testing = TRUE,
f = "05_clustering_mismatches",
M = 1:5,
model.type = "bounded")
This will generate 5 folders inside 05_clustering_mismatches
that will have the mismatches number and associated ustacks output files in it.
A figure similar to this one:
To understand this figure, I suggest you read these 2 excellent papers:
Ilut, D., Nydam, M., Hare, M. (2014). Defining Loci in Restriction-Based Reduced Representation Genomic Data from Non model Species: Sources of Bias and Diagnostics for Optimal Clustering BioMed Research International 2014(2), 1 9. https://dx.doi.org/10.1155/2014/675158
Harvey, M., Judy, C., Seeholzer, G., Maley, J. (2015). Similarity thresholds used in short read assembly reduce the comparability of population histories across species PeerJ 3(), e895. https://dx.doi.org/10.7717/peerj.895
The function also produce a tibble mismatches.summary.tsv
with several columns and lines summarizing ustacks output logs for the different mismatches.
This is the data used to make the figure above.
Now that you have decided the mismatch threshold, corresponding to the M
argument inside stackr::run_ustacks
, run the function on all samples.
Depending on project size, you might want to look at the alternative of ustacks and cstacks: run_radproc.
u <- stackr::run_ustacks(
project.info = "02_project_info/project.info.20191121@1606.tsv",
M = 3,
model.type = "bounded",
parallel.core = 12
)
# careful with the defaults...
When using paired-end data, the function uses the forward reads for the assembly. The reverse reads are used later in the pipeline.
Output:
The folder 06_ustacks_2_gstacks
will be populated with 3 files per samples ending with:
alleles.tsv.gz
tags.tsv.gz
snps.tsv.gz
These files are documented here
The folder 08_stacks_results
will have a summary file: summary_ustacks_date@time.tsv
.
The folder 09_log_files
as the associated individual log files.
This is the catalog or pseudo reference genome for the de novo part of stacks, the module is called cstacks. It’s probably the longest part of the pipeline for large sample-size projects. It’s also the part where a crashing computer or cluster as the biggest drawbacks and test your bioinformatic patience.
Depending on project size, you might want to look at the alternative of ustacks and cstacks: run_radproc.
Function documentation: ?stackr::run_cstacks
(read it twice).
Difference with stacks cstacks
?
How?
cstacks
starts again with the previous catalog and build a new one with new samples next on the list.For now, I highly discourage subsampling to build the catalog = please use all the samples in the project.
Start by generating a population map:
pop.map.catalog <- u$project.info %>%
dplyr::distinct(INDIVIDUALS_REP) %>%
dplyr::mutate(POP_ID = 1L) %>%
readr::write_tsv(
x = .,
path = "02_project_info/population.map.turtle.catalog.tsv",
col_names = FALSE# stacks doesn't want column header
)
Build the catalog:
c <- stackr::run_cstacks (
P = "06_ustacks_2_gstacks",
M = "02_project_info/population.map.turtle.catalog.tsv",
n = 3,
parallel.core = 32
)
Who’s in the catalog?
Use the function stackr::extract_catalog_sql_ids to verify the SQL IDs of the samples used for the catalog construction.
ids <- stackr::extract_catalog_sql_id(x = "06_ustacks_2_gstacks/catalog.tags.tsv.gz")
You have tried the noise reduction step and/or the normalization step and it’s still way to long ?
The RADProc approach developped by Praveen Nadukkalam Ravindran (Ravindran et al. 2019) is a viable alternative and it’s compatible with stacks. You replace the de novo and catalog steps by RADProc. For a project I did, it’s a question of 15 days to run stacks ustacks/cstacks or 1,5 days to run RADProc + the rest of stacks pipeline… no kidding!
Limitations
The current version works with stacks 2 (sstacks and after).
macOS installation instructions:
cd Downloads
curl -O ftp://ftp.gnu.org/gnu/m4/m4-1.4.18.tar.gz
tar -xzvf m4*
cd m4*
./configure ac_cv_type_struct_sched_param=yes && make -j4 && sudo make install
cd ..
sudo rm -R m*
sudo chmod -R 777 RADProc-master
cd RADProc-master/Install/
autoreconf --install
sh ./configure
make && sudo make install
Read the documentation…
To run in command line
RADProc -t "gzfastq" -f "04_process_radtags" -o "06_ustacks_2_gstacks" -M 4 -m 3 -n 4 -p 8 -x 3 -S 80 -D 7 2>&1 | tee radproc_date@time.log
To run within stackr
r <- stackr::run_radproc(M = 3, m = 4, n = 3)
Function documentation: ?stackr::run_radproc
(read it twice).
The output of that command:
catalog.tags.tsv
, catalog.snps.tsv
, catalog.alleles.tsv
.tags.tsv
, snps.tsv
, alleles.tsv
.That’s the sstacks part in stacks. It’s where the output from ustacks (the individual files) are matched against the catalog.
Function documentation: ?stackr::run_sstacks
(read it twice).
Difference with stacks sstacks
?
s <- stackr::run_sstacks(parallel.core = 12)
That’s the tsv2bam part in stacks. The tsv2bam program will transpose data so that it is oriented by locus, instead of by sample. This allows downstream improvements (e.g. Bayesian SNP calling).
If paired-end reads are available, the tsv2bam
program will pull in the set of paired-end reads that are associated with each single-end locus that was assembled de novo.
Difference with stacks tsv2bam
?
SAMtools
. It’s all integrated inside run_tsv2bam
.SAMtools
to merge in parallel the BAM sample files into a unique BAM catalog file..Function documentation: ?stackr::run_tsv2bam
(read it twice).
# example with paired-end data:
bam.sum <- stackr::run_tsv2bam(
M = "02_project_info/population.map.turtle.catalog.tsv",
R = "04_process_radtags/",
parallel.core = 12
)
For some reasons, if you used paired-end data, this step won’t work for now on macOS. It’s not a stackr problem, it’s a stacks problem. Please use linux … If you don’t have access to a linux computer, try using the cloud for this step documentation.
Function documentation: ?stackr::run_gstacks
(read it twice).
g <- stackr::run_gstacks(
M = "02_project_info/population.map.turtle.catalog.tsv",
parallel.core = 12
)
g <- stackr::run_gstacks(
M = "02_project_info/population.map.turtle.catalog.tsv",
rm.pcr.duplicates = TRUE,
rm.unpaired.reads = TRUE,
unpaired = FALSE,
ignore.pe.reads = FALSE,
parallel.core = 12
)
For this step, I like to keep it simple and get a VCF file. Filters are best handled in my package radiator.
Function documentation: ?stackr::run_populations
(read it twice).
p <- stackr::run_populations(
M = "02_project_info/population.map.turtle.catalog.tsv",
verbose = TRUE,
r = 0.1,
R = 0.1,
p = 1,
parallele.core = 1)
Tool: my package radiator
Tools for visualization and imputations: grur
Tool: assigner
Ilut, D., Nydam, M., Hare, M. (2014). Defining Loci in Restriction-Based Reduced Representation Genomic Data from Non model Species: Sources of Bias and Diagnostics for Optimal Clustering BioMed Research International 2014(2), 1 9. https://dx.doi.org/10.1155/2014/675158
Rochette, N., Rivera‐Colón, A., Catchen, J. (2019). Stacks 2: Analytical methods for paired‐end sequencing improve RADseq‐based population genomics. Molecular Ecology https://dx.doi.org/10.1111/mec.15253
Harvey, M., Judy, C., Seeholzer, G., Maley, J. (2015). Similarity thresholds used in short read assembly reduce the comparability of population histories across species PeerJ 3(), e895. https://dx.doi.org/10.7717/peerj.895
Ravindran, P., Bentzen, P., Bradbury, I., Beiko, R. (2019). RADProc: A computationally efficient de novo locus assembler for population studies using RADseq data Molecular Ecology Resources 19(1), 272-282. https://dx.doi.org/10.1111/1755-0998.12954
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
Sanders HL (1968) Marine benthic diversity: A comparative study. The American Naturalist 102: pp. 243–282
Hale et al. (2009) on the relative merits of normalization and rarefaction in gene discovery in sturgeons. BMC genomics, 10, 203.
Kalinowski (2004) Counting Alleles with Rarefaction. Conservation Genetics, 5, 539–543.
Puncher et al. (2018) Molecular Ecology Resources, 44, 678.