Skip to content

Fisher Exact Test

Introduction

Fisher’s Exact Test (FET) can be used to find GO terms, or other annotations, that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). This set of genes can be the differentially expressed genes of differential expression analysis, a set of genes related to a phenotype of interest, etc. Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.

When the proportion of genes annotated with a determined GO term in the test set is significantly higher than the proportion in the reference set, this GO term will be detected as over-represented, and otherwise, it will be declared under-represented.

OmicsBox has integrated the FatiGO package for statistical assessment of annotation differences between 2 sets of sequences. This package uses Fisher's Exact Test and corrects for multiple testing. For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can either be the result of a OmicsBox annotation or the imported annotation by file (.annot), see Gene Ontology Annotation of this manual.

Run Fisher's Exact Test

This functionality can be found as a side panel button in the following tables:

If the FET analysis has been launched from an annotation project, Test and Reference Sequences can be selected by uploading text files or ID-List .box files containing the lists of sequence IDs for the two groups (Figure 1). When there is no reference set chosen, the whole dataset present in the project will be taken as Reference. A detailed description of each parameter is available by clicking the help icon next to the parameter.

If the Fisher’s Exact Test is applied to differential expression results, the Test Set can be selected from the significant differential expressed features (genes/transcripts). Reference Set would be the rest of annotated features from the Reference Set file provided. See the specific manual section for a more detailed information (linked in the bullet points above).

The Fisher's Exact Test implementation is sensitive in the direction of the test: the sequences that are present in the test-set and also in reference-set will be deleted from the reference, but not from the test-set.

For further details please refer to the FatiGO publication: Al-Shahrour, F., Díaz-Uriarte, R., and Dopazo, J. (2004). Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics, 20(4):578–580.

Figure 1: Fisher's Exact Test configuration wizard

Input parameters

  • Test-set Files. ID-list with sequences belonging to the test-set (Annotation project).
  • Test-set Genes. Subset of significant genes to be considered as the test set. It allows to pick between up-regulated or down-regulated genes (Pairwise Differential Expression and Combined Pathway Analysis results).
  • Type of List. Subset of significant genes to be considered as the test set. In this case, you can select the subset of genes according to either the regression variables or the experimental groups (Time Course Differential Expression).
  • Reference-set Files. ID-list with sequences belonging to the reference-set.
  • Two-tailed test. This option allows us to test for over and under-representation: the test-set will be tested against the reference-set and vice versa.
  • Annotations. You can select how gene sets are selected for the enrichment analysis: group genes by GO term, by Enzyme Code, InterPro ID, etc.

Click on the Run button to start the analysis. It may take a while depending on the number of annotations.

Results Table

Once completed the results table will be shown in a new tab (Figure 2) containing all the annotation terms, displaying a tag only where the adjusted p-values are below given threshold. The columns are:

  • Tag. It indicates if the GO term has been declared over or under-represented in the test-set.
  • GO Term. The GO Term ID.
  • GO Name. The more descriptive GO Name.
  • GO Category. The are three GO categories: Molecular Function (MF), detailing the biochemical activities of genes; Cellular Component (CC), specifying the physical locations within a cell where functions occur; and Biological Process (BP), encompassing the larger cellular pathways and processes genes are involved in.
  • Adj. P-value. Corrected p-value by the multiple test correction method chosen (False Discovery Rate control according to Benjamini-Hochberg procedure by default).
  • P-value. Raw p-Value without multiple testing corrections.
  • Nr Test. The number of sequences annotated with the GO and in the Test Set.
  • Nr Reference. The number of sequences annotated with the GO and in the Reference Set.
  • Not Annot Test. The number of sequences not annotated with the GO and in the Test Set.
  • Not Annot Ref. The number of sequences not annotated with the GO and in the Reference Set.

Figure 2: Enrichment Results Table

Context Menu

A context menu appears by right-clicking on any row of the results table. The options listed will be applied to the selected rows. The specific options for FET results are:

  • Show Details. It opens a new tab with more details about the GO term, containing a link to the GO database.
  • Create ID List of TestSet Sequences. It opens a new tab with a list containing the names of the sequences annotated to the GO that are in the Test Set.
  • Create ID List of RefSet Sequences. It opens a new tab with a list containing the names of the sequences annotated to the GO that are in the Reference Set.

In the sidebar there are located all possible action that can be performed for this enrichment result, including three options for the visual display of the results:

Actions

  • Set Over/Under Tags: this option allows to define which column (raw p-value or adjusted) should be used to display the enrichment tag, as well as the threshold value. The adjusted p-value column can also be updated by selecting a different multiple test correction method, to choose between:
    • Benjamini-Hochberg: default value, the most commonly used method when controlling for FDR. For further information about how p-values are adjusted by FDR according to Benjamini-Hochberg procedure please refer to the publication: Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289-300.
    • Bonferroni: the most restrictive method, recommended to avoid type I errors at all costs. Controls the family-wise error rate (FWER), or the probability of making one or more false discoveries.
    • Benjamini-Yekutieli: method for controlling the FDR, more conservative than Benjamini-Hochberg and designed to work with dependent conditions.
    • Holm: an updated version of the Bonferroni method, less restrictive and controlling FDR rather than FWER.
    • Hochberg: a method similar to Holm.
  • Reduce to Most Specific (only for GO annotations): use this option to remove more general GO terms from the results and get only the most specific terms (with the lowest level in the GO DAG).
  • Summary Report. Generates a report containing basic statistics about the analysis, the configuration parameters and the bibliographic references.

Charts

  • Bar Chart: this option generates a bar display of the percentages of sequences at both, test and reference set, for each annotation of the table (Figure 3). The configuration wizard allows you to select:
    • Tags: whether to plot GO terms with OVER, UNDER, or both tags.
    • GO Categories: whether to plot GO terms from BP, MF, and/or CC categories.
    • Column to plot: whether to display GO IDs or GO Names.
  • Bubble Plot: this option generates a dot plot, a chart representing 4 dimensions: the annotation term in the Y axis, the gene ratio (Nr Test / [Nr Test + Not Annot Test]) in the X axis, the Adjustet P-value as the dot color, and the number of test sequences of the set as the dot size (see Figure 4). The configuration wizard includes the same parameters as the Bar Chart:
    • Tags: whether to plot GO terms with OVER, UNDER, or both tags.
    • GO Categories: whether to plot GO terms from BP, MF, and/or CC categories.
    • Column to plot: whether to display GO IDs or GO Names.
  • Enriched GO Graph (only for GO annotations): use this option to generate a representation on the GO DAG (see Figure 5). Nodes are color-highlighted proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes. Additionally, the Filter intermediate the checkbox will hide non-enriched nodes. The configuration wizard allows you to select:
    • Tags: whether to plot GO terms with OVER, UNDER, or both tags.
    • GO Categories: whether to plot GO terms from BP, MF, and/or CC categories. More options are available in the graph viewer's sidebar. Gene Ontology Graphs of this manual gives further information on the graphical functions in OmicsBox.
  • Word Cloud: Generate a word cloud visualization based on the enriched GO names (Figure 6). The word cloud will display the GO names with sizes proportional to their associated statistical values, creating a visual representation of the data. The configuration wizard allows you to select:
    • Tags: whether to plot GO terms with OVER, UNDER, or both tags.

Figure 3: Enriched Bar Chart

Figure 4: Bubble Plot Chart

Figure 5: Enriched Graph

Figure 6: Word Cloud