Every caller has its own ways of naming VCF fields describing variant information. Following, is the list of fields which are given as output after parsing a VCF file with parse_vcf_output(). Here is a description of what information was extracted for every caller to populate each standardised fields:

  • Location: CHR_POS where CHR and POS are standard VCF fields for every caller.
  • caller: the caller passed to the caller argument.
  • chrom: field CHROM in the VCF file.
  • pos: field POS in the VCF file.
  • ref: field REF in the VCF file.
  • alt: field ALT in the VCF file.
  • qual: This field is not reported consistently from the VCF output of the three callers but I made sure that its meaning is consistent and that it represents the average base quality at that position. Below is a description of how it is extracted from each caller.
    • In MuTect2 the QSS field in the FORMAT fields reports the sum of the base qualities for the reference and alternative alleles separated by a comma. We used the reference and alternative depths at each position to compute the overall average base quality at that position. This quantity will populate the final qual field.
    • In VarScan this field is the average of the RBQ and ABQ fields from the FORMAT fields. They are defined respectively as the average quality of reference and alternative supporting bases in the header of the VCF file.
    • In VarDict there are two QUAL fields, one is the standard 6th field of a VCF file and the other one is reported in the INFO fields. We will use the latter to populate the qual field in our analysis since it is the one defined as the average base quality at a variant position in the header of the VCF. VarDict uses a threshold of QUAL >= 25 to report a variant.
    • freebayes can report more than one alternative allele in ouput, even though most of the locations will only have one entry. For this reason this fiels is created by summing up the base qualities reported for the reference allele and the first alternative allele. When using this caller, two extra columns are also provided in output qual_ref and qual_alt to allow the user to compute specific base qualities. See ?parse_vcf_output for more information.
  • filter: standard field FILTER in the VCF file. Each caller populates this field in different ways depending on the characteristics of the algorithm. In general, the entries for this field can be either PASS if that mutation passes all the filters defined by a caller or a description of the reason for filtering. The possible descriptions and their meaning can be found in the header of the VCF file generated by the caller.

  • genotype: standard GT field in the FORMAT fields of the VCF file.

  • tot_depth: total read depth at each position as estimated by the caller. This information can be reported differently by each caller. VarDict, VarScan, and freebayes record it in the DP field while MuTect2 records the reference and alternative depth in the AD columns and their sum was used to define the total_depth.

  • VAF: variant allele frequency for the variants recorded at that position. VarScan records it in the FREQ field while MuTect and VarDict in the AF field. For freebayes this field is computed as sum of total count for the alternative alleles divided by the count for the reference allele, even though one might want to compute the VAF for every alternative allele separately. Since freebayes can report more than one alternative allele in ouput this fiels is created by summing up the reference depth and the depth of the first alternative allele. See ?parse_vcf_output for more information.

  • ADJVAF_ADJ_indels: field ADJAF only reported by VarDict and it represents the adjusted variant allele frequency for indels due to local realignment.

  • ref_depth, alt_depth, ref_forw, ref_rev, alt_forw and alt_rev: these fields represent the breakdown of supporting reference/alternative and forward/reverse reads at each location. Below I described what fields I used from every caller to extract these values. The fields are listed in order:
    • MuTect2: ref_depth and alt_depth are the comma separated values reported in the field AD; ref_forw, ref_rev, alt_forw and alt_rev are respectively the MuTect2 fields REF_F1R2, REF_F2R1, ALT_F1R2 and ALT_F2R1.
    • VarScan2: in order the features listed above are extracted from the the fields RD, AD, RDF, RDR, ADF and ADR.
    • VarDict: ref_depth and alt_depth are computed after extracting information from the REFBIAS and VARBIAS fields. The field REFBIAS contains comma separated values representing ref_forw and ref_rev and VARBIAS contains comma separated values representing alt_forw and alt_rev.
    • freebayes: ref_depth and alt_depth are the fields RO and AO in the VCF file; the fields SRF and SRR contain the depth for the reference allele on the forward and reverse strand (ref_forw and ref_rev) and same applies for SAF and SAR for the alternative alleles respectively (alt_forw and alt_rev).

If the VCF file was also annotated using the Variant Effect Predictor (VEP) [@McLaren2016-lv]
All the other fields are exactly as generated by VEP: Allele, Consequence, IMPACT, SYMBOL, Gene, Feature_type, Feature, BIOTYPE, EXON, INTRON, HGVSc, HGVSp, cDNA_position, CDS_position, Protein_position Amino_acids, Codons, Existing_variation DISTANCE, STRAND, FLAGS, VARIANT_CLASS, SYMBOL_SOURCE, HGNC_ID, CANONICAL, TSL, APPRIS, CCDS, ENSP, SWISSPROT, TREMBL, UNIPARC, GENE_PHENO, SIFT, PolyPhen, DOMAINS, AF, AFR_AF, AMR_AF, EAS_AF, EUR_AF, SAS_AF, AA_AF, EA_AF, ExAC_AF, ExAC_Adj_AF, ExAC_AFR_AF, ExAC_AMR_AF, ExAC_EAS_AF, ExAC_FIN_AF, ExAC_NFE_AF, ExAC_OTH_AF, ExAC_SAS_AF, MAX_AF, MAX_AF_POPS, CLIN_SIG, SOMATIC, PHENO, PUBMED,MOTIF_NAME, MOTIF_POS, HIGH_INF_POS, MOTIF_SCORE_CHANGE SampleName, IMPACT_rank. Visit the VEP page to find more information https://asia.ensembl.org/info/docs/tools/vep/index.html.