vignettes/standardise-variant-fields.Rmd
standardise-variant-fields.Rmd
Every caller has its own ways of naming VCF
fields describing variant information. Following, is the list of fields which are given as output after parsing a VCF file with parse_vcf_output()
. Here is a description of what information was extracted for every caller to populate each standardised fields:
CHR_POS
where CHR
and POS
are standard VCF
fields for every caller.caller
argument.CHROM
in the VCF
file.POS
in the VCF
file.REF
in the VCF
file.ALT
in the VCF
file.MuTect2
the QSS
field in the FORMAT
fields reports the sum of the base qualities for the reference and alternative alleles separated by a comma. We used the reference and alternative depths at each position to compute the overall average base quality at that position. This quantity will populate the final qual
field.VarScan
this field is the average of the RBQ
and ABQ
fields from the FORMAT
fields. They are defined respectively as the average quality of reference and alternative supporting bases in the header of the VCF
file.VarDict
there are two QUAL
fields, one is the standard 6th field of a VCF
file and the other one is reported in the INFO
fields. We will use the latter to populate the qual
field in our analysis since it is the one defined as the average base quality at a variant position in the header of the VCF
. VarDict
uses a threshold of QUAL >= 25
to report a variant.freebayes
can report more than one alternative allele in ouput, even though most of the locations will only have one entry. For this reason this fiels is created by summing up the base qualities reported for the reference allele and the first alternative allele. When using this caller, two extra columns are also provided in output qual_ref
and qual_alt
to allow the user to compute specific base qualities. See ?parse_vcf_output
for more information.filter: standard field FILTER
in the VCF
file. Each caller populates this field in different ways depending on the characteristics of the algorithm. In general, the entries for this field can be either PASS
if that mutation passes all the filters defined by a caller or a description of the reason for filtering. The possible descriptions and their meaning can be found in the header of the VCF file generated by the caller.
genotype: standard GT
field in the FORMAT
fields of the VCF
file.
tot_depth: total read depth at each position as estimated by the caller. This information can be reported differently by each caller. VarDict
, VarScan
, and freebayes
record it in the DP
field while MuTect2
records the reference and alternative depth in the AD
columns and their sum was used to define the total_depth
.
VAF: variant allele frequency for the variants recorded at that position. VarScan
records it in the FREQ
field while MuTect
and VarDict
in the AF
field. For freebayes
this field is computed as sum of total count for the alternative alleles divided by the count for the reference allele, even though one might want to compute the VAF for every alternative allele separately. Since freebayes
can report more than one alternative allele in ouput this fiels is created by summing up the reference depth and the depth of the first alternative allele. See ?parse_vcf_output
for more information.
ADJVAF_ADJ_indels: field ADJAF
only reported by VarDict
and it represents the adjusted variant allele frequency for indels due to local realignment.
MuTect2
: ref_depth and alt_depth are the comma separated values reported in the field AD
; ref_forw, ref_rev, alt_forw and alt_rev are respectively the MuTect2
fields REF_F1R2
, REF_F2R1
, ALT_F1R2
and ALT_F2R1
.VarScan2
: in order the features listed above are extracted from the the fields RD
, AD
, RDF
, RDR
, ADF
and ADR
.VarDict
: ref_depth and alt_depth are computed after extracting information from the REFBIAS
and VARBIAS
fields. The field REFBIAS
contains comma separated values representing ref_forw and ref_rev and VARBIAS
contains comma separated values representing alt_forw and alt_rev.freebayes
: ref_depth and alt_depth are the fields RO
and AO
in the VCF file; the fields SRF
and SRR
contain the depth for the reference allele on the forward and reverse strand (ref_forw and ref_rev) and same applies for SAF
and SAR
for the alternative alleles respectively (alt_forw and alt_rev).If the VCF
file was also annotated using the Variant Effect Predictor (VEP) [@McLaren2016-lv]
All the other fields are exactly as generated by VEP
: Allele
, Consequence
, IMPACT
, SYMBOL
, Gene
, Feature_type
, Feature
, BIOTYPE
, EXON
, INTRON
, HGVSc
, HGVSp
, cDNA_position
, CDS_position
, Protein_position
Amino_acids
, Codons
, Existing_variation
DISTANCE
, STRAND
, FLAGS
, VARIANT_CLASS
, SYMBOL_SOURCE
, HGNC_ID
, CANONICAL
, TSL
, APPRIS
, CCDS
, ENSP
, SWISSPROT
, TREMBL
, UNIPARC
, GENE_PHENO
, SIFT
, PolyPhen
, DOMAINS
, AF
, AFR_AF
, AMR_AF
, EAS_AF
, EUR_AF
, SAS_AF
, AA_AF
, EA_AF
, ExAC_AF
, ExAC_Adj_AF
, ExAC_AFR_AF
, ExAC_AMR_AF
, ExAC_EAS_AF
, ExAC_FIN_AF
, ExAC_NFE_AF
, ExAC_OTH_AF
, ExAC_SAS_AF
, MAX_AF
, MAX_AF_POPS
, CLIN_SIG
, SOMATIC
, PHENO
, PUBMED
,MOTIF_NAME
, MOTIF_POS
, HIGH_INF_POS
, MOTIF_SCORE_CHANGE
SampleName
, IMPACT_rank
. Visit the VEP
page to find more information https://asia.ensembl.org/info/docs/tools/vep/index.html.