R/combine_and_filter.R
combine_and_filter.Rd
This function will take as input a data frame of variants with specific column information and return a filtered set with sample's clinical infrmation and default variants information also for samples without variants.
combine_and_filter(variants = NULL, patientID = NULL, studyGenes = NULL, minQual = 20, clinicalData = NULL, min_vaf = 0.15, min_alt = 2, keep_impact = c("HIGH", "MODERATE"), variant_type = "indels-vardict")
variants | a data frame where every row is a variant for one sample at a specific time point. The variants can derive from any caller but the input should be standardised to have the following columns: 'SampleName','PID','Time','chrom', 'pos', 'alt', 'ref', 'ref_depth','alt_depth' and (gene) 'SYMBOL' (see more information in Details). The columns `Consequence` and `IMPACT` (as annotated by Variant Effect Predictor (VEP) https://asia.ensembl.org/info/genome/variation/prediction/predicted_data.html) are filled with default values if not found. If VEP `Consequence` is not available it could populated with any other informations like exon number, INDEL/SNV label etc... to add details to each mutations (useful for plotting purposes). The `SampleName` columns is unique for every sequencing sample while `PID` for every patient. |
---|---|
patientID | a character vector specifying the patient/s id/s for which variants have to be imported. |
studyGenes | genes of interest. If none provided all genes will be used. |
minQual | minimum quality for a variant to be kept. |
clinicalData | clinical data about the patients in the cohort. It has to contain the a column `SampleName`. |
min_vaf | numeric. Minimum variant allele frequency (VAF) for a variant to be kept at one time point. |
min_alt | numeric. Minimum number of reads supporting the alt allele at one time points for a patient. |
keep_impact | vector specifying the IMPACT values to select variants. Values allowed are HIGH, MODERATE, LOW, MODIFIER ( https://asia.ensembl.org/info/genome/variation/prediction/predicted_data.html). IMPACT should be a columns of `variants`. If it is not found all variants are kept. |
variant_type | Label for the type of variants imported, e.g. vardict-indels. |
This function will keep only the variants for `patientID` found on `studyGenes` and with a `minQual`. If a sample has no variants, then only clinical information will be returned with default values for the variant information. If some variants are found at some time points but not at others, missing points will be populated with default (0) values for VAF, ref_depth, alt_depth to allow consistency when plotting changes over time.
More details about the `variants` input:
- The `Time` column can be defined in any way and it should reflect the time of sample collection. For example it could be defined as Time0, Time1, Time2 etc...
- If the column `IMPACT` is not found it will be filled with NAs and no variants will be filtered. Otherwise, values of the columns are checked and if they are within the expected values (HIGH, MODERATE, LOW or MODIFIER) only variants with `keep_impact` entries are kept. If a mutation appears twice with different `IMPACT` values only the most damaging will be kept.
The variants are then merged with the clinical information of `patientID`. This step is needed so that if no variants are returned for one time point for one `patientID`, default entries for Variant Allele Frequency (VAF), reference and alterative depths will be created. The default value is 0 for all of the above. A variant is reported for a patient only if at any time point its VAF >= min_VAF and the total depth is >= 10.
indels <- data.frame(PID = rep("D1",9), Outcome = "Rel", SampleName = c("D1.Screen.Rel","D1.Screen.Rel","D1.Screen.Rel", "D1.Cyc1.Rel","D1.Cyc1.Rel","D1.Cyc1.Rel", "D1.Cyc2.Rel","D1.Cyc2.Rel","D1.Cyc2.Rel"), Time = c(rep("Screen",3),rep("Cyc1",3),rep("Cyc2",3)), chrom = c("chr1","chr1","chr2","chr2","chr1","chr1","chr2","chr1","chr2"), pos = c(10,20,30,10,20,30,10,20,30), alt = c("ACT","ACGTCG","AGG","ACT","ACGTCG","AGG","ACT","ACGTCG","AGG"), ref = c("A","G","A","A","G","AGG","A","G","A"), ref_depth = c(12,11,9,9,12,8,13,14,8), alt_depth = c(2,20,10,2,10,15,20,20,100), SYMBOL = "BCL2", Consequence = rep(c("exon1","exon1","exon2"),times=3), qual = 49) clinicalData <- data.frame(SampleName = c("D1.Screen.Rel","D1.Cyc1.Rel", "D1.Cyc2.Rel","D1.Cyc3.Rel"), PID = "D1", AgeDiagnosis = 65, Time = c("Screen","Cyc1","Cyc2","Cyc3"), Sex = "F", BlastPerc = c(80,5,7,40)) import_indels <- combine_and_filter(variants = indels, patientID = "D1", studyGenes = "BCL2", minQual = 20, clinicalData = clinicalData)#> IMPACT is missing and will be filled with NAs.#> Error: Can't find column `variant_type` in `.data`.