This function identifies sex-linked markers putatively located on
heterogametic or homogametic chromosomes and re-assign the sex in a dataset
according to Y- or W-linked markers.
The function work best in: DArT silico (counts) >
DArT counts or RADseq with allele read depth > DArT silico (genotypes) >
RADseq (genotypes) and DArT (1-row, 2-rows genotypes).
sexy_markers(
data,
silicodata = NULL,
strata = NULL,
boost.analysis = FALSE,
coverage.thresholds = 1,
filters = TRUE,
interactive.filter = TRUE,
folder.name = NULL,
parallel.core = parallel::detectCores() - 1,
...
)
(object or file) DArT file .csv or .tsv
, VCF file .vcf
,
GDS file or object (.gds
).
See read_dart
or read_vcf
for more details.
(optional, file) A silico (dominant marker) DArT file .csv or .tsv
.
This can be count or genotyped data. Note that both data
and
silicodata
can be used at the same time.
Default: silicodata = NULL
.
(file) A tab delimited file with a minimum of
2 columns (INDIVIDUALS, STRATA
) for VCF files and 3 columns for DArT files
(TARGET_ID, INDIVIDUALS, STRATA
).
TARGET_ID:
it's the column header of the DArT file.
STRATA:
is the grouping column, here the sex info.
3 values: M
for male, F
for female and U
for unknown.
Anything else is converted to U
.
INDIVIDUALS:
Is how you want your samples to be named.
Default: strata = NULL
, the function will look for sex info in the
tidy data or GDS data (individuals.meta section).
You can easily build the strata file by starting with the output of these
functions: extract_dart_target_id
and extract_individuals_vcf
(optional, logical) This method uses machine learning
approaches to find sex markers and re-assign samples in sex group.
The approach is currently under construction.
Default: boost.analysis = FALSE
.
(optional, integer) The minimum coverage required
to call a marker absent. For silico genotype data this must be < 1.
Default: coverage.thresholds = 1
.
(optional, logical) When filters = TRUE
, the data will
be filtered for monomorphic loci, missingness of individuals and heterozygosity of individuals.
CAUTION: we advice to use these filter, since not filtering or filtering too
stringently will results in false positive or false negative detections.
Default: filters = TRUE
.
(optional, logical) When interactive.filter = TRUE
the function will ask for your input to define thresholds. If interactive.filter = FALSE
the function expects additional arguments (see Advance mode).
Default: interactive.filter = TRUE
.
(optional,character) Name of the folder to store the results.
Default: folder.name = NULL
. The name sexy_markers_datetime will be generated.
(optional) The number of core used for parallel
execution during import.
Default: parallel.core = parallel::detectCores() - 1
.
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section)
The created object contains:
A list with (1) the summarised SNP data per sex and (2) the summarised silico data per sex. This should allow you to re-create the various plots.
A vector with the names of the sex-linked marker. One vector for each sex method.
A dataframe with a summary of the sex-linked markers and their sequence (if available).
This function takes DArT and RAD-type data to find markers that have a specific
pattern that is linked to sex.
The function hypothesizes the presence of sex-chromosomes in your
species/population. The tests are designed to identify markers that are
located on putative heterogametic (Y or W) or homogametic (X or Z) chromosomes.
Note: Violating Assumptions or Prerequisites (see below) can lead to
false positive or the absence of detection of sex-linked markers.
Genetic Sex Determination System over a e.g. environmental-sex-determination system.
Genome coverage: restriction sites randomly spread throughout the whole genome.
Mutations: Processes such as sex-specific mutations in the restriction sites could lead to false positive results.
Deletions/duplications: Processes such as sex-specific deletions or duplications could lead to false positive results.
Homology: The existence of homologous sequences between the homogametic and heterogametic chromosomes could lead to false negative results.
Absence of population signal
Sample size: Ideally, the data must have enough individuals (n > 100).
Batch effect: Sex should be randomized on lanes/chips during sequencing.
Sex ratio: Dataset with equal ratio work best.
Genotyping rate: for DArT data, if the minimum call rate is
> 0.5 ask DArT to lower their filtering threshold.
RADseq data, lower markers missingness thresholds during filtering
(e.g. stacks r
and p
).
Identity-by-Missingness: Absence of artifactual pattern of missingness (see missing visualization)
Low genotyping error rate: see detect_het_outliers
and whoa.
Low heterozygosity miscall rate: see detect_het_outliers
and
whoa.
Absence of pattern of heterozygosity driven by missingness:
see detect_mixed_genomes
.
Absence of paralogous sequences in the data.
Heterogametic sex-markers:
Presence/Absence method: To identify markers on Y or W chromosomes, we look at the presence or absence of a marker between females and males. More specifically, if a marker is always present in males but never in females, they are putatively located in the Y-chromosome; and vice versa for the W-linked markers.
Homogametic sex-markers: We have two different methods to identify markers on X or Z chromosomes:
Heterozygosity method: By looking at the heterozygosity of a marker between sexes, we can identify markers that are always homozygous in one sex (e.g. males for an XY system), while exhibiting an intermediate range of heterozygosity in the other sex (0.1 - 0.5).
Coverage method: If the data includes count information, this function will look for markers that have double the number of counts for either of the sexes. For example if an XY/XX system is present, females are expected to have double the number of counts for markers on the X chromosome.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
species
: To give your figures some meanings.
Default species = NULL
.
population
: To give your figures some meanings.
Default species = NULL
.
tau
: The quantile used in regression to distinguish homogametic markers
with the heterozygosity method. See rq
.
Default tau = 0.03
.
mis.threshold.data
: Threshold to filter the SNP data on missingness.
Only if interactive.filter = FALSE
.
mis.threshold.silicodata
: Threshold to filter the silico data on
missingness. No default. Only if interactive.filter = FALSE
.
threshold.y.markers
: Threshold to select heterogametic sex-linked
markers from the SNP data with the presence/absence method.No default.
Only if interactive.filter = FALSE
.
threshold.x.markers.qr
: Threshold to select homogametic sex-linked
markers from the SNP data with the heterozygosity method. No default.
Only if interactive.filter = FALSE
.
zoom.data
: Threshold to subset the F/M ratio of mean SNP coverage.
Used to improve the histogram resolution to select a better threshold.x.markers.RD
threshold. No default. Only if interactive.filter = FALSE
.
threshold.x.markers.RD
: Threshold to select homogametic sex-linked
markers from the SNP data with the coverage method.No default.
Only if interactive.filter = FALSE
.
zoom.silicodata
: Threshold to subset the F/M ratio of mean silico coverage.
Used to improve the histogram resolution to select a better threshold.x.markers.RD.silico
threshold. No default. Only if interactive.filter = FALSE
.
threshold.x.markers.RD.silico
: Threshold to select homogametic sex-linked
markers from the silico data with the coverage method. No default.
Only if interactive.filter = FALSE
.
sex.id.input
: (integer) sex.id.input = c(1, 2 or 3)
to recalculate the sex based on (1) 'visual', (2) 'genetic SNP' or (3) 'genetic SILICO' sexID.
No default. Only if interactive.filter = FALSE
.
het.qr.input
: (integer) het.qr.input = c(1 or 2)
to plot the heterozygosity residual plot for (1) X-linked markers (heterozygous for females),
or (2) Z-linked markers (heterozygous for males). No default.
Only if interactive.filter = FALSE
.
Machine Learning approaches (Random Forest and Extreme Gradient Boosting Trees) are currently been tested. They usually show a lower discovery rate but tend to perform better with new samples.
Eric Anderson's whoa package.
if (FALSE) { # \dontrun{
# The minimum
sex.markers <- radiator::sexy_markers(
data = "shark.dart.data.csv",
strata = "shark.strata.tsv")
# This will use the default: interactive version and a list is created and to view the sex markers
} # }