Read and tidy DArT output files.

Used internally in radiator and might be of interest for users. The function generate a GDS object/file and optionally, a tidy dataset using DArT files. Read the details section to understand why it's better than dartR.

read_dart(
  data,
  strata,
  filename = NULL,
  tidy.dart = FALSE,
  calibrate.alleles = TRUE,
  verbose = TRUE,
  parallel.core = parallel::detectCores() - 1,
  ...
)

Arguments

data

(file) 6 files formats used by DArT are recognized by radiator. Don't modify the DArT file, to do this, use the strata file/argument below. The function can import files ending with .csv or .tsv.

1row: Genotypes are in 1 row and coded (0, 1, 2, -). 0 for 2 reference alleles REF/REF, 1 for 2 alternate alleles ALT/ALT, 2 for heterozygote REF/ALT, - for missing.
2rows: No genotypes. It's absence/presence, 0/1, of the REF and ALT alleles. Sometimes called binary format.
counts: No genotypes, It's counts/read depth for the REF and ALT alleles. Sometimes just called count data. This should be the preferred file format, because DArT output the coverage (read depth for each genotypes).
silico.dart: SilicoDArT data. No genotypes, no REF or ALT alleles. It's a file coded as absence/presence, 0/1, for the presence of sequence in the clone id.
silico.dart.counts: SilicoDArT data. No genotypes, no REF or ALT alleles. It's a file coded as absence/presence, with counts for the presence of sequence in the clone id.
dart.vcf: For DArT VCFs, please use read_vcf.

If you encounter a problem, sent me your data so that I can update the function.

strata

A tab delimited file or object with a minimum of 3 columns headers:

TARGET_ID: generated by DArT, it's made of integers. For 1row and 2rows the TARGET_ID is usually the sample name submitted to DArT.

If this is new to you, take a look at this function: extract_dart_target_id, it's easier than opening your DArT file in MS EXCEL. This will extract the TARGET_ID of your DArT file you can then use it inside MS EXCEL and build the remaining columns.
INDIVIDUALS: This is the time and place no name your sample correctly.
STRATA: refers to any grouping of individuals. Usually, your sampling sites or populations. Keep it simple, 3 or 4 letters, like TAS for Tasmania, etc.

Silico DArT data is currently used to detect sex markers, so the STRATA column should be filed with sex information: M or F.

filename

(optional) The function uses write.fst, to write the tidy data frame in the working directory. The file extension appended to the filename provided is .rad. With default: filename = NULL, the tidy data frame is in the global environment only (i.e. not written in the working directory...).

tidy.dart

(logical, optional) Generate a tidy dataset. Default:tidy.dart = FALSE.

calibrate.alleles

(optional, logical) Default: calibrate.alleles = TRUE. Documented in calibrate_alleles.

verbose

(optional, logical) When verbose = TRUE the function is a little more chatty during execution. Default: verbose = TRUE.

parallel.core

(optional) The number of core used for parallel execution during import. Default: parallel.core = parallel::detectCores() - 1.

...

(optional) To pass further argument for fine-tuning the function.

Value

A radiator GDS file and tidy dataframe with several columns depending on DArT file: silico.dart: A tibble with 5 columns: CLONE_ID, SEQUENCE, VALUE, INDIVIDUALS, STRATA. This object is also saved in the directory (file ending with .rad).

Common to 1row, 2rows and counts: A GDS file is automatically generated. To have a tidy tibble, the argument tidy.dart = TRUE must be used.

VARIANT_ID: generated by radiator and correspond the markers in integer.
MARKERS: generated by radiator and correspond to CHROM + LOCUS + POS separated by 2 underscores.
CHROM: the chromosome info, for de novo: CHROM_1.
LOCUS: the locus info.
POS: the SNP id on the LOCUS.
COL: the position of the SNP on the short read.
REF: the reference allele.
ALT: the alternate allele.
INDIVIDUALS: the sample name.
STRATA/POP_ID: populations id of the sample.
GT_BIN: the genotype based on the number of alternate allele in the genotype (the count/dosage of the alternate allele). 0, 1, 2, NA.
REP_AVG: the reproducibility average, output specific of DArT.

Other columns potentially in the tidy tibble:

GT: the genotype in 6 digit format à la genepop.
GT_VCF: the genotype in VCF format 0/0, 0/1, 1/1, ./..
GT_VCF_NUC: the genotype in VCF format, but keeping the nucleotide information. A/A, A/T, T/T, ./.
AVG_COUNT_REF: the coverage for the reference allele, output specific of DArT.
AVG_COUNT_SNP: the coverage for the alternate allele, output specific of DArT.
READ_DEPTH: the number of reads used for the genotype (count data).
ALLELE_REF_DEPTH: the number of reads of the reference allele (count data).
ALLELE_ALT_DEPTH: the number of reads of the alternate allele (count data).

Written in the working directory:

The radiator GDS file
The DArT metadata information
The tidy DArT data
The strata file associated with this tidy dataset
The allele dictionary is a tibble with columns: MARKERS, CHROM, LOCUS, POS, REF, ALT.

Details

More details on what is happening under the hood when you import the DArT file in R:

The DArT file is imported
- DArT files should not be modify.
- A lot of imports problems originates from files modifications, a couple of common checks are done.
- The format (1row, 2rows, counts, silico) is interpreted.
- The number of target ids is checked.
The strata file file is imported:
- This is the file that needs modifications.
- This is here that you change the bad samples names.
- Remove target ids (blacklist samples that you no longer want).
- Change the order of the populations/sampling sites.
- or alternatively, you can use the pop.levels argument (see dots-dots-dots ... in the advance mode section below)
The target ids between the DArT and strata files are verified and the files are merged.
The data is inspected for duplicated names
DArT changed colnames in their files along the years, we tidy things:
- colnames in camelcase are changed to snakecase
- ALLELE_SEQUENCE is changed to SEQUENCE
- TRIMMED_SEQUENCE is changed to SEQUENCE
- CLUSTER_CONSENSUS_SEQUENCE is changed to SEQUENCE
- Genomic metadata are named and or re-named based on the Variant Call Format Specification:CHROM, LOCUS, POS, COL, REF, ALT
With this function you have the option to tidy the DArT file:
- What's that ? R for Data Science: explanation
- It takes longer and you need more memory, but if you can allow it, it's better for inspection and visualisation.
- or you wait to filter the data and generate a tidy dataset with tidy_genomic_data
REF and ALT alleles are re-calibrated with calibrate_alleles:
- This is not optional
- It takes longer than just reading the file like other software and packages, but it's better.

Advance mode

dots-dots-dots ... allows to pass several arguments for fine-tuning the function:

whitelist.markers: detailed in filter_whitelist. Defautl: whitelist.markers = NULL.
missing.memory (option, path) This argument allows to erase genotypes that have bad statistics. It's the path to a file .rad file that contains 3 columns: MARKERS, INDIVIDUALS, ERASE. The file is produced by several radiator functions. For DArT data, filter_rad generate the file. Defautl: missing.memory = NULL. Currently not used.
path.folder: (optional, path) To write output in a specific folder. Default: path.folder = NULL. The working directory is used.
pop.levels: detailed in tidy_genomic_data.

Author

Thierry Gosselin thierrygosselin@icloud.com

Examples