Read/Import and tidy genomic data frames. If data is in wide format, the functions will gather the data. Used internally in radiator and assigner and might be of interest for users.

tidy_wide(data, import.metadata = FALSE)

Arguments

data

A file in the working directory or object in the global environment in wide or long (tidy) formats. See details for more info.

How to get a tidy data frame ? radiator tidy_genomic_data.

import.metadata

(optional, logical) With import.metadata = TRUE the metadata (anything else than the genotype) will be imported for the long format exclusively. Default: import.metadata = FALSE, no metadata.

Value

A tidy data frame in the global environment.

Details

Input data:

To discriminate the long from the wide format, the function radiator tidy_wide searches for MARKERS in column names (TRUE = long format). The data frame is tab delimitted. Wide format: The wide format cannot store metadata info. The wide format starts with these 2 id columns: INDIVIDUALS, STRATA (that refers to any grouping of individuals), the remaining columns are the markers in separate columns storing genotypes.

Long/Tidy format: The long format is considered to be a tidy data frame and can store metadata info. (e.g. from a VCF see radiator tidy_genomic_data). A minimum of 4 columns are required in the long format: INDIVIDUALS, STRATA, MARKERS and GT for the genotypes. The remaining columns are considered metadata info.

Genotypes with separators: ALL separators will be removed. Genotypes should be coded with 3 integers for each alleles. 6 integers in total for the genotypes. e.g. 001002 or 111333 (for heterozygote individual). 6 integers WITH separator: e.g. 001/002 or 111/333 (for heterozygote individual). The separator can be any of these: "/", ":", "_", "-", ".", and will be removed.

separators in STRATA, INDIVIDUALS and MARKERS: Some separators can interfere with packages or codes and are cleaned by radiator.

  • MARKERS: /, :, - and . are changed to an underscore _.

  • STRATA: white spaces in population names are replaced by underscore.

  • INDIVIDUALS: _ and : are changed to a dash -

How to get a tidy data frame ? radiator tidy_genomic_data can transform 6 genomic data formats in a tidy data frame.

Author

Thierry Gosselin thierrygosselin@icloud.com