% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/utils.R
\name{vcf_sanity_check}
\alias{vcf_sanity_check}
\title{Perform a Sanity Check on a VCF File}
\usage{
vcf_sanity_check(
  vcf_path,
  n_data_lines = 100,
  max_markers = 10000,
  verbose = FALSE
)
}
\arguments{
\item{vcf_path}{A character string specifying the path to the VCF file. The file can be plain text or gzipped.}

\item{n_data_lines}{An integer specifying the number of data lines to sample for detailed checks. Default is 100.}

\item{max_markers}{An integer specifying the maximum number of markers allowed in the VCF file. Default is 10,000.}

\item{verbose}{A logical value indicating whether to print detailed messages during the checks. Default is FALSE.}
}
\value{
A list containing:
- `checks`: A named vector indicating the results of each check (TRUE or FALSE).
- `messages`: A data frame containing messages for each check, indicating success or failure.
- `duplicates`: A list containing any duplicated sample or marker IDs found in the VCF file.
- `ploidy_max`: The maximum ploidy detected from the genotype field, if applicable.
}
\description{
This function performs a series of checks on a VCF file to ensure its validity and integrity. It verifies the presence of required headers, columns, and data fields, and checks for common issues such as missing or malformed data.
}
\details{
The function performs the following checks:
- **VCF_header**: Verifies the presence of the `##fileformat` header.
- **VCF_columns**: Ensures required columns (`#CHROM`, `POS`, `ID`, `REF`, `ALT`, `QUAL`, `FILTER`, `INFO`) are present.
- **max_markers**: Checks if the total number of markers exceeds the specified limit.
- **GT**: Verifies the presence of the `GT` (genotype) field in the FORMAT column.
- **allele_counts**: Checks for allele-level count fields (e.g., `AD`, `RA`, `AO`, `RO`).
- **samples**: Ensures sample/genotype columns are present.
- **chrom_info** and **pos_info**: Verifies the presence of `CHROM` and `POS` columns.
- **ref_alt**: Ensures `REF` and `ALT` fields contain valid nucleotide codes.
- **multiallelics**: Identifies multiallelic sites (ALT field with commas).
- **phased_GT**: Checks for phased genotypes (presence of `|` in the `GT` field).
- **duplicated_samples**: Checks for duplicated sample IDs.
- **duplicated_markers**: Checks for duplicated marker IDs.
}
