Used internally in radiator and might be of interest for users. The function generate a GDS object/file and optionally, a tidy dataset using DArT files.
read_dart(
data,
strata,
filename = NULL,
tidy.dart = FALSE,
verbose = FALSE,
parallel.core = parallel::detectCores() - 1,
...
)
One of the DArT output files. 6 formats used by DArT are recognized by radiator. recognised:
1row
: Genotypes are in 1 row and coded (0, 1, 2, -).
0 for 2 reference alleles REF/REF
, 1 for 2 alternate alleles ALT/ALT
,
2 for heterozygote REF/ALT
, - for missing
.
2rows
: No genotypes. It's absence/presence, 0/1, of the REF and ALT alleles.
Sometimes called binary format.
counts
: No genotypes, It's counts/read depth for the REF and ALT alleles.
Sometimes just called count data.
silico.dart
: SilicoDArT data. No genotypes, no REF or ALT alleles.
It's a file coded as absence/presence, 0/1, for the presence of sequence in
the clone id.
silico.dart.counts
: SilicoDArT data. No genotypes, no REF or ALT alleles.
It's a file coded as absence/presence, with counts for the presence of sequence in
the clone id.
dart.vcf
: For DArT VCFs, please use read_vcf
.
Depending on the number of markers, these format will be recoded similarly to VCF files (dosage of alternate allele, see details).
The function can import .csv
or .tsv
files.
If you encounter a problem, sent me your data so that I can update the function.
A tab delimited file or object with 3 columns.
Columns header is:
TARGET_ID
, INDIVIDUALS
and STRATA
.
Note: the column STRATA
refers to any grouping of individuals.
You need to make sure that
the column TARGET_ID
match the id used by DArT. With the counts
format the TARGET_ID
is a series of integer.
With 1row
and 2rows
the TARGET_ID
is actually the sample
name submitted to DArT.
The column INDIVIDUALS
and STRATA
will be kept in the tidy data.
Only individuals in the strata file are kept in the tidy, i.e. that the strata
is also used as a whitelist of individuals/strata.
Silico DArT data is currently used to detect sex markers, so the STRATA
column should be filed with sex information: M
or F
.
See example on how to extract the TARGET_ID of your DArT file.
(optional) The function uses write.fst
,
to write the tidy data frame in
the working directory. The file extension appended to
the filename
provided is .rad
.
With default: filename = NULL
, the tidy data frame is
in the global environment only (i.e. not written in the working directory...).
(logical, optional) Generate a tidy dataset.
Default:tidy.dart = FALSE
.
(optional, logical) When verbose = TRUE
the function is a little more chatty during execution.
Default: verbose = TRUE
.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional) To pass further argument for fine-tuning the function.
A radiator GDS file and tidy dataframe with several columns depending on DArT file:
silico.dart:
A tibble with 5 columns: CLONE_ID, SEQUENCE, VALUE, INDIVIDUALS, STRATA
.
This object is also saved in the directory (file ending with .rad).
Common to 1row, 2rows and counts
: A GDS file is automatically generated.
To have a tidy tibble, the argument tidy.dart = TRUE
must be used.
VARIANT_ID: generated by radiator and correspond the markers in integer.
MARKERS: generated by radiator and correspond to CHROM + LOCUS + POS separated by 2 underscores.
CHROM: the chromosome info, for de novo: CHROM_1.
LOCUS: the locus info.
POS: the SNP id on the LOCUS.
COL: the position of the SNP on the short read.
REF: the reference allele.
ALT: the alternate allele.
INDIVIDUALS: the sample name.
STRATA/POP_ID: populations id of the sample.
GT_BIN: the genotype based on the number of alternate allele in the genotype
(the count/dosage of the alternate allele). 0, 1, 2, NA
.
REP_AVG: the reproducibility average, output specific of DArT.
Other columns potentially in the tidy tibble:
GT: the genotype in 6 digit format à la genepop.
GT_VCF: the genotype in VCF format 0/0, 0/1, 1/1, ./.
.
GT_VCF_NUC: the genotype in VCF format, but keeping the nucleotide information.
A/A, A/T, T/T, ./.
AVG_COUNT_REF: the coverage for the reference allele, output specific of DArT.
AVG_COUNT_SNP: the coverage for the alternate allele, output specific of DArT.
READ_DEPTH: the number of reads used for the genotype (count data).
ALLELE_REF_DEPTH: the number of reads of the reference allele (count data).
ALLELE_ALT_DEPTH: the number of reads of the alternate allele (count data).
Written in the working directory:
The radiator GDS file
The DArT metadata information
The tidy DArT data
The strata file associated with this tidy dataset
The allele dictionary is a tibble with columns:
MARKERS, CHROM, LOCUS, POS, REF, ALT
.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
whitelist.markers
: detailed in filter_whitelist
.
Defautl: whitelist.markers = NULL
.
missing.memory
(option, path)
This argument allows to erase genotypes that have bad statistics.
It's the path to a file .rad
file that contains 3 columns:
MARKERS, INDIVIDUALS, ERASE
. The file is produced by several radiator
functions. For DArT data, filter_rad
generate the file.
Defautl: missing.memory = NULL
. Currently not used.
path.folder
: (optional, path) To write output in a specific folder.
Default: path.folder = NULL
. The working directory is used.
pop.levels
: detailed in tidy_genomic_data
.
if (FALSE) { # \dontrun{
clownfish.dart.tidy <- radiator::read_dart(
data = "clownfish.dart.csv",
strata = "clownfish.strata.tsv"
)
} # }