Rarefaction of reads samples — normalize

Rarefaction of fasq files by sub-sampling the reads before de novo assembly or alignment. The normalization/standardization/sample size correction step allows to check if some statistics are increasing with read numbers (e.g. heterozygous markers). It's a very easy way to disentangle artifact from biological signal caused by varying read numbers across samples.

normalize_reads(
  project.info = NULL,
  fq.files,
  sample.reads = 1e+06,
  number.replicates = 3,
  random.seed = NULL,
  parallel.core = parallel::detectCores() - 1
)

Arguments

project.info: (character, path, optional) When using the stackr pipeline, a project info file is created. This file provides all the info and stats generated by stacks and stackr. The project info file will be updated with the new samples. The project info filename will be appended _normalize. The file should end with .tsv. If no project.info file is provided, the function will have to look at the number of reads in the fastq files and this will take longer. Default: project.info = NULL.
fq.files: (character, path) Path of folder containing the samples to normalize.
sample.reads: (integer) The number of reads to pick randomly. Default: sample.reads = 1000000.
number.replicates: (interger) The number of samples to generate. With default, if 20 samples are in the folder, 100 new samples will be generated. Default: number.replicates = 5.
random.seed: (integer, optional) For reproducibility, set an integer that will be used inside function that requires randomness. With default, a random number is generated and printed in the appropriate output. Default: random.seed = NULL.
parallel.core: (optional) The number of core for parallel programming. Each samples to normalize is sequentially treated and replicates are generated in parallel. By default, parallel.core = parallel::detectCores() - 1. This number is adjusted automatically to the number of replicates.

Value

fastq files with "-1", "-2", "..." appended to the original name. If a project info file was provided, the new replicate samples info is integrated to the file. The modified project info file will have _normalize appended to the original filename.

Examples

if (FALSE) { # \dontrun{
library(stackr)
# To run this function, bioconductor \code{ShortRead} package is necessary:
source("http://bioconductor.org/biocLite.R")
biocLite("ShortRead")
# Using OpenMP threads
nthreads <- .Call(ShortRead:::.set_omp_threads, 1L)
on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
# using defaults:
stackr::normalize_reads(fq.files = "~/corals")

# customizing the function:
stackr::normalize_reads(
   project.info = "project.info.corals.tsv",
   fq.files = "~/corals",
   sample.reads = 2000000,
   number.replicates = 5,
   random.seed = 3,
   parallel.core = 5)

# You then need to run stackr: run_ustacks, run_sstacks, run_tsv2bam, run_gstacks, run_populations
# or equivalent if a reference genome.
} # }