Gene Ontology Annotation

GO Annotation options

Introduction

Annotation rule is the process of selecting GO terms from the GO pool obtained by the Mapping step and assigning them to the query sequences. In the current OmicsBox version, this is the core type of functional annotation.

GO annotation is carried out by applying an annotation rule (AR) on the found ontology terms. The rule seeks to find the most specific annotations with a certain level of reliability. This process is adjustable in specificity and stringency.

For each candidate GO an annotation score (AS) is computed. The AS is composed of two additive terms.

The first, direct term (DT), represents the highest hit similarity of this GO weighted by a factor corresponding to its EC.

The second term (AT) of the AS provides the possibility of abstraction. This is defined as an annotation to a parent node when several child nodes are present in the GO candidate collection. This term multiplies the number of total GOs unified at the node by a user-defined GO weight factor that controls the possibility and strength of abstraction. When GO weight is set to 0, no abstraction is done.

Finally, the AR selects the lowest term per branch that lies over a user-defined threshold. DT, AT, and the AR terms are defined as given in figure 1.

To better understand how the annotation score works, the following reasoning can be done: When EC-weight is set to 1 for all ECs (no EC influence) and GO-weight equals zero (no abstraction), then the annotation score equals the maximum similarity value of the hits that have that GO term and the sequence will be annotated with that GO term if that score is above the given threshold provided. The situation when EC-weights are lower than 1 means that higher similarities are required to reach the threshold. If the GO-weight is different to 0 this means that the possibility is enabled that a parent node will reach the threshold while its various children nodes would not.

The annotation rule provides a general framework for annotation. The actual way annotation occurs depends on how the different parameters at the AS are set. These can be adjusted in the Annotation Configuration Dialog (figure 2) and in the Evidence Code Weight Configuration Dialog (figure 3).

Please cite:

Gotz S., Garcia-Gomez JM., Terol J., Williams TD., Nagaraj SH., Nueda MJ., Robles M., Talon M., Dopazo J. and Conesa A. (2008). High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids research, 36(10), 3420-35.

Run GO Annotation

The GO Annotation functionality can be found under the Side Panel → GO Annotation after loading a fasta file or a project.

Annotation Configuration

Annotation Cut-Off (threshold): The annotation rule selects the lowest term per branch that lies over this threshold (default=55).
GO-Weight: This is the weight given to the contribution of mapped children terms to the annotation of a parent term (default=5).
Filter GO by taxonomy: The filter will remove the Gene Ontology terms known not to be in the given taxonomy using the restrictions defined by Gene Ontology. You can select one of the given options or simply write a taxonomy id.
E-Value-Hit-Filter: This value can be understood as a pre-filter: only GO terms obtained from hits with a greater e-value than given will be used for annotation and/or shown in a generated graph (default=1.0E-6). The value to use will depend on how restrictive or permissive the annotation should be.
Hsp-HitCoverage CutOff: Sets the minimum needed coverage between a Hit and his HSP. For example, a value of 80 would mean that the aligned HSP must cover at least 80% of the longitude of its Hit. Only annotations from Hit fulfilling this criterion will be considered for annotation transference.
Hit Filter: This option allows you to consider only the first N hits during annotation. This option is correlative with the "Only hits with GOs'' feature.
Only hits with GOs: This option together with the "Hit Filter'' option allows you to apply it only on hits that have a GO term candidate.

Evidence Code Weights

Employing ECs promotes the assignment of annotations with experimental evidence and penalizes electronic annotations or low traceability.

EC code weights can be modified depending on what you want. Note that in case of influence by evidence codes is not wanted, you can set them all at 1. Alternatively, when you want to exclude GO annotations of a certain EC (for example IEAs), you can set this EC weight at 0.

**Figure 3:** Evidence Code weight configuration

Results

Successful annotation for each query sequence will result in a color change for that sequence from light-green to blue at the Main Sequence Table, and only the annotated GOs will remain in the GO IDs column.

Result Table

Selection CheckBox: This checkbox can be used to filter or select the table and to apply actions (extract data, generate charts) only to the selected part of the table.
Nr: A consecutive number for each row.
Tags: Depending on the status of a given sequence the row will show different tags like BLASTED, INTERPRO, MAPPED, ANNOTATED or GOSLIM .
SeqName: The unique name of the sequence. Duplicates are not allowed.
Description: The description line of a sequence. This description will be imported from the fasta file and can be overwritten during the annotation process or manually.
Length: The length of the sequences in bases. This can be amino-acids or nucleotides depending on the type of the sequence.
#Hits (blast related): The number of hits obtained by blast.
e-Value (blast related): The lowest e-value obtained by blast.
sim-mean (blast related): The mean similarity obtained by blast.
#GO (mapping and annotation related): The number of gene ontology terms obtained during the mapping or annotation process.
GO IDs (mapping and annotation related): The gene ontology IDs obtained during the mapping or annotation process.
GO Names (mapping and annotation related): The gene ontology names obtained during the mapping or annotation process.
Enzyme Codes (annotation related): The enzyme codes linked to the GO terms of a given sequence
Enzyme Names (annotation related): The enzyme code names linked to the GO terms of a given sequence
InterPro IDs (interproscan related): The IDs obtained during the InterProScan step.
InterPro GO IDs (interproscan related): The GO IDs linked to the InterPro IDs obtained during the InterProScan step.
InterPro GO Names (interproscan related): The GO names linked to the InterPro IDs obtained during the InterProScan step.

Individual Annotation Results

Annotation results for each sequence can also be visualized on the GO DAG by selecting "Draw Graph of GO-Mapping with Annotation Score'' in the context menu. Additionally, the "Change Annotation and Description'' figure 4 options of this menu offer also the possibility to adjust annotations specifically for a single sequence.
This function edits the annotation of the selected and allows typing and deleting of annotation or sequence description. A manual annotation check-box (see figure 5) is available for marking sequences with manual annotation. The sequence will get the pink label on the Main Sequence Table.

**Figure 4:** Manually change Annotation and Description

Charts

GO Annotation Statistics

It is possible to summarise the number of sequences that have been annotated with the annotation rule and the following statistics are available:

Annotation Distribution: This chart informs about the number of GO terms assigned per sequence.
GO Annotation Level Distribution: A bar chart that shows all GO terms for all 3 categories for a given GO level taking into account the GO hierarchy (parent-child relationships).
Annotation Score Distribution: A chart that shows the number of sequences per annotation score.
Annotated Seqs/Seq-Length: This shows the relation between the amount of annotated sequences and sequence lengths.
Number of GOs/Seq-Length: This shows the relation between sequence length and the number of GOs.
GO Distribution by Level: A bar chart that shows all the GO terms for all 3 categories for GO level 2, taking into account the GO hierarchy.
Direct GO Count:
Molecular Function: A chart for the Molecular Function GO category, which shows the most frequent GO terms within a data-set without taking into account the GO hierarchy.
Biological Process: A chart for the Biological Process GO category.
Cellular Component: A chart for the Cellular Component GO category.

An overview of the extent and intensity of the annotation can be obtained from the Annotation Distribution Chart (figure 7), which shows the number of sequences annotated with different amounts of GO-terms.

EC- Code Statistics

To see the main Enzyme classes in the dataset it is possible to generate a distribution Enzyme Code chart.

Main Enzyme Classes: This shows the distribution of the 7 main enzyme classes' overall sequences.
Second Level Classes: It is possible to create a distribution chart of the enzyme subclasses.

Export Annotation Results

The annotation results can be exported in a variety of formats. This function is available under Side Panel→ Export → Export GO Annotations.

.annot. This is the default option for Annotation export and the exchange annotation format in OmicsBox. Annotations are provided in a three-column fashion. The first column contains the sequence name, the second the annotation code and the third the sequence description. When multiple annotations for the same sequence are available, these come in subsequent rows. GO and EC annotations are exported jointly in the same format.
Custom: It is possible to customize the exportation of the annotation file according to the information desired or the column separator see the next figure.
Genespring format. One single row is given by sequence where three different columns are provided for Molecular Function, Biological Process, and Cellular Component. GO terms are denoted by their description rather than by their code.
GoStats format. One single row is given by sequence and GO terms are only denoted by entire numbers ("GO:" and left zero's are skipped)
WEGO format (native). One single row is given by sequence, including those without annotated GOs. Belonging GOs are added to each sequence separated by tabs. The format corresponds to the "WEGO native format'', shown in this example:
http://wego.genomics.org.cn/docs/input01.lst.
Export Annotations in GO Annotation File Format (GAF v.2), which is the primary format currently used by the GO Consortiumhttp://geneontology.org/page/go-annotation-file-formats.
Export GO Propagation: Exports the GO parents up to the root for the annotated sequences.
Export Sequences per GO (Gene Sets).

**Figure 10:** Export Annotation Configuration

**Figure 11:** Export Annotations Custom Configuration

Remove GO Annotation

Delete Annotation results for the selected sequences.

Merge EggNOG GOs

Once the sequences are annotated via EggNOG, it is possible to merge the GO terms and the EC codes (Enzyme Commission Codes) to a sequence project in order to add the new annotations. This can be done by clicking on project Side Panel → GO Annotation → Merge EggNOG GOs (figure 12).

In the wizard, you have to select the EggNOG project that has the GO annotations to merge with the current project. If the sequences already have annotated GO terms and/or ECs, the new information generated from EggNOG will be added to the annotations found in the project.

In addition, you can filter the annotations by E-value or Bit-Score.

**Figure 12:** Merge EggNOG GO Annotations wizard.

Once finished, this step generates a bar chart showing the total number of GOs and ECs added to the original sequence project (figure 13).

**Figure 13:** Merge EggNOG GO Annotations graph.

Annotate GOs from Descriptions

This tool looks at every significant alignment (Right-Click → Show Blast Result on a sequence) for each sequence and searches their description lines for GO ids. These GOs are now directly annotated to the sequence if the alignments similarity passes the desired minimum. Validation can also be applied and is recommended, it will remove intermediate GO terms.

There are still other annotation functions available in the submenu:

Other Annotation Functions in the Side Panel

Run EC-Code Mapping: This will map GO annotations to EC-Codes for fully annotated sequences. The mapping data is provided by the Gene Ontology Consortium.
Remove EC-Codes: This will remove the Enzyme Codes from the project.
Filter Annotation by GO Taxa
Validate Annotations. OmicsBox annotation generates the lowest node annotations. This is not always guaranteed when Annotations have been imported or changed manually. This function can be run to ensure that no parent-child redundancy is present in the annotated set.
Remove 1. Level Annotations
Annotate GOs from Blast Descriptions allows to transfer of GOs from the Blast hit descriptions to their sequences.
Compare GO Annotations: Compare a set of annotations for a given group of sequences against the annotations already loaded in OmicsBox.