Variant Annotation

Introduction

Variant annotation is the process by which variants are assigned functional information (for example, coding and genetic consequences of a variant) and it is a crucial process in genomic sequence analysis. The outcomes of such annotation are beneficial because they can directly influence the conclusions arrived at in disease studies.

The Variant Annotation tool in OmicsBox uses Ensembl Variant Effect Predictor (VEP) to get information of the variants present in the VCF introduced. This tool in OmicsBox does not only determine the effect on genes, transcripts and/or protein sequences using VEP, but also the transition/transversion ratio and other population genetics variables. To use the implementation, you just need to introduce your VCF file and a genome and annotation file in GTF format.

Please cite VEP using:
McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R., Thormann, A., ... & Cunningham, F. (2016). The ensembl variant effect predictor. Genome biology, 17(1), 1-14.

Run VEP for Variant Annotation

This tool can be found under Genetic Variation → Variant Annotation.The wizard consists of just one page to introduce some the inputs.

Input

VCF File: VCF file created in the Variant Calling step. This VCF might be filtered in the Variant Filtering step.
Genome File: FASTA file with the reference genome.
Annotation File: both GTF and GFF formats are accepted.

Make sure that the reference genome and the reference annotation have the same version. In addition, take into account that the genome file introduced here as input must be the same one used in the Variant Calling step.

Results

The Variant Annotation tool has the following outputs:

Table listing the information of all variants that have been annotated.
Summary Report with different type of information: general information, coding and genic consequences of the variants.
Type of Variation Distribution in a pie chart in order to have a quick outline of the annotated variants.
Quality-control Charts. This charts can be displayed using the sidebar buttons and they help ensure that the variant dataset does not have any kind of anomaly (a chromosome with a significantly higher number of variants, very long indels, etcetera).

Table

Table with information of all variants. This table has the next columns:
Type of Variation: according to ENSEMBL, this column can be:
- SNP: Single Nucleotide Polymorphism. A change of one nucleotide.
- Substitution: a sequence alteration where the length of the change in the variant is the same as that of the reference.
- Insertion: addition of one or several nucleotides.
- Deletion: removal of one or several nucleotides.
- Indel: an insertion and a deletion, affecting 2 or more nucleotides.
- Other: structural variation, etcetera.
- Chromosome/Scaffold: chromosome where the variant is located according to the VCF file.
- Location: 1-based position inside the chromosome.
- Reference: nucleotide or sequence that appear in the reference genome.
- Variation: nucleotide or sequence appearing in the VCF as variant.
- Gene(s): affected genes by that variant If more than one gene is affected, several genes will appear separated by semicolon.
- Pi: measures nucleotide divergence among all samples in that position. It is calculated as the average proportion of nucleotide differences between all pairs of sequences within a population. A higher value of Pi indicates a higher level of genetic diversity within a population.
- HWE p-value: reports a p-value for each site from a Hardy-Weinberg Equilibrium test. The Hardy-Weinberg equilibrium is a theoretical state in which the frequency of alleles and genotypes in a population remains constant from generation to generation in the absence of any genetic or environmental influences that can affect the distribution of alleles.

Summary Report

The summary report has three main parts:

A general outline with a summary of the variant annotation job and the variant classes that were found in the VCF file. The job summary shows information about the process of annotation itself:
Number of different variants that the VCF file contains.
Fraction of novel variants against the ones registered under some ID.
Number of overlapping genetic attributes (genes, regulatory features and transcripts).
Variants that were not annotated (filtered out).
Number of variant sites that were actually processed.

Although the first and the last entry of this table seem to be analogues, they do not. The first entry takes into account the number of lines in the VCF file without the header, whereas the last entry focuses on variants that could be mapped inside the annotation file.

Variant classes. Refers to the type of variants used in the variant annotation process. This numbers must be equal to the ones that appear in the distribution pie chart.

The total number in this table might be higher than the number of ‘Variants processed’ in the Job Summary table as some sites might have multiallelic variants, that is to say, sites with more than one variation.

**Figure 3.** General Overview of the Variant Annotation Job

An overview of the effects of the variants both at the genetic and the coding level. In the case of the genetic consequences, the percentage of severe consequences is also disclosed. The most common severe consequences are:
Stop Gained: a sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript.
Stop Lost: a sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript.
Frameshift: variant that causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three.
Splice Donor or Splice Acceptor: it changes the outcome of the splicing event.
Initiator Codon: variant that misplaces an initiator codon.
Stop Retained Variant: a sequence variant where at least one base in the terminator codon is changed, but the terminator remains.
Missense Variant: variant that changes one amino acid in the protein but the length is preserved.
Inframe Insertion: an inframe non-synonymous variant that inserts bases into in the coding sequence.
Inframe Deletion: An inframe non-synonymous variant that deletes bases from the coding sequence.

**Figure 4.** Genetic Consequences of the Variants

**Figure 5.** Coding Consequences of the Variants

Population genetics statistics such as the transition/transversion ratio (Ts/Tv ratio) and a table with the inbreeding coefficient.
The Ts/Tv ratio can be used to know if the analysed samples form a normal population if you know the Ts/Tv ratio for a normal population of the species that is being studied.
The inbreeding coefficient (F) of a sample is the probability that two alleles at any locus in that individual are identical by descent from the common ancestor(s) of the two parents. F stands for fixation index, because of the increase in homozygosity, or fixation, that results from inbreeding. If this coefficient is negative, the number of actual homozygotic sites is smaller than the number of expected homozygotic sites. If it is positive, there are more homozygotic sites than expected.

**Figure 6**. Population Genetic Results

Type of Variation Distribution

In this pie chart you can see all the variants that have been annotated. Take into account that in the same position, for example, a SNP and an indel can be found, so both variants will be taken into account for the pie chart.

**Figure 7.** Pie Chart with the Types of Variants

Quality-control Charts

There are three differents quality-control charts:

Distribution of Indel Lengths: it might follow a normal distribution.

**Figure 8**. Distribution of Indel Lengths

Variants per Chromosome: this histogram should be even for all chromosomes. That means that all chromosomes have approximately the same number of variants.

**Figure 9**. Distribution of Variants per Chromosome

Position in Protein: this histogram should also be even, as that will mean that variants that are in coding regions distribute equally.

**Figure 10.** Distribution of Coding Variants in Proteins

Detailed Information

If you want more information about a variant, you can click with the right button the line of the variant and click in "Show Annotation Details". Then, an report will open with one table per gene affected by that variant. Each table will have one row per gene feature affected by that variant. That table will have the following columns:

Feature ID: name (or ID) of the gene feature affected by that variant.
Feature Type: class of gene feature. It can be: transcript, regulatory feature or motif feature.
Consequence: effect of that variant. This consequence will be one among the ones that appeared in the report.
cDNA Position: relative position of base pair in cDNA sequence. It will only appear if that variant falls inside a region that can be transcribed.
CDS Position: relative position in coding sequence. It will only appear if that variant is inside a region that is translated.
Protein Position: the same as before but in the protein.
Amino Acids: amino acid change. Only given if the variant affects the protein-coding sequence.
Codons: the alternative codons with the variant base in upper case.
Impact: the impact modifier for the consequence type.
Distance: shortest distance to transcript.
Strand: 1 for positive strand and -1 for negative strand.