Skip to content

Gene Set Enrichment Analysis (GSEA)

Introduction

OmicsBox includes the GSEA computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. GSEA considers experiments with genome-wide expression profiles from samples belonging to two classes. Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric. Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout a ranked list of genes (L) or primarily found at extrems.

If there is no association, genes in S will be uniformly distributed throughout L: that is the null hypothesis of GSEA. If there is association, genes in S will accumulate at the top or at the bottom of L. The magnitude of the association will be measured by the Enrichment Score statistic (ES), see GSEA user guide.

Run GSEA

For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can either be the result of an OmicsBox annotation or the imported annotation by file (.annot), see Gene Ontology Annotation of this manual.

This functionality can be found as a Side Panel button in the following tables:

A dialog screen appears (Figure 1). A detailed description of each parameter is available by clicking the help icon next to the parameter. The following explanations refer to the Gene Set Enrichment Analysis run from an Annotated Sequences project in the functional analysis module. To other module's specific implementations, please visit the corresponding user manual section (linked in the bullet points above).

Input Parameters

Configuration

  • Rank file. Ranked list of genes can be selected by uploading text files or ID-Value-List .box files containing the lists of sequence IDs and a statistical value for each one.
  • Number of permutations. Number of gene set permutations to assess the statistical significance of Enrichment Score.
  • Enrichment Statistic. Each time GSEA encounters a gene in S, a running-sum statistic increases, and decreases if gene is not in S. Enrichment Score (ES) will be 0 if genes in S are randomly distributed throghout L: ES represents the maximum deviation for a random distribution. This option change the way in which ES is calculated (see GSEA paper).
  • Number of Detailed Results. Set the number of GO terms to get further details.
  • Detailed Results of All GOs. Check this option to obtain detailed results for all GOs. Be aware that this task is both disk and time consuming.

Figure 1: GSEA Configuration Wizard Page

Advanced Configuration

  • GO Category. Select the Gene Ontology Category to run the analysis.
  • Gene Sets Max Size. Maximum number of genes allowed in a gene set. By default, GSEA ignores gene sets with more than 500 genes because normalization is not very accurate for extremely large gene sets.
  • Gene Sets Min Size. Minimum number of genes required in a gene set. By default, GSEA ignores gene sets with fewer than 15 genes because normalization is not very accurate for extremely small gene sets. For example, gene sets with fewer than 10 genes can generate significant results with just 2 or 3 genes.
  • Do Not Filter. By default, only IDs with a higher FDR or p-value than the specified filters will be shown. Check this option to disable the filtering.
  • Filter Mode: Choose between FDR or P-Value to filter the enriched GOs. Note that FDR is the corrected p-value for multiple testing, so it provides more information about the statistical significance than the raw p-value.
  • Filter Value: The value of the FDR or P-Value cut-off.

Click on the Run button to start the analysis. It may take a while depending on the number of permutations selected.

Figure 2: GSEA Advanced Configuration Wizard Page

Results

Once completed the results table will be shown in a new tab (Figure 3), where the adjusted p-values of each annotation above a given threshold will be shown. The main columns are:

  • Tags: Indicates whether a GO term is considered enriched. GOs with the "TOP" tag are over-represented at the top of the ranked list, whereas GO terms with the "BOTTOM" tag are over-represented at the bottom.
  • GO ID: The Gene Ontology term identifier.
  • GO Name: The descriptive name of the Gene Ontology term.
  • GO Category: The GO category (Biological Process, Molecular Function, or Cellular Component).
  • Size: Number of genes in the gene set after filtering out those genes not in the expression dataset.
  • ES: Enrichment score for the gene set; that is, the degree to which this gene set is overrepresented at the top or bottom of the ranked list of genes in the expression dataset.
  • NES: Normalized enrichment score; that is, the enrichment score for the gene set after it has been normalized across analyzed gene sets.
  • Nominal p-val: Nominal p-value; that is, the statistical significance of the enrichment score. The nominal p-value is not adjusted for gene set size or multiple hypothesis testing; therefore, it is of limited use in comparing gene sets.
  • FDR q-val: False discovery rate; that is, the estimated probability that the normalized enrichment score represents a false positive finding.
  • FWER p-val: Familywise-error rate; that is, a more conservatively estimated probability that the normalized enrichment score represents a false positive finding. Because the goal of GSEA is to generate hypotheses, the GSEA team recommends focusing on the FDR statistic.
  • Rank at Max: The position in the ranked list at which the maximum enrichment score occurred. The more interesting gene sets achieve the maximum enrichment score near the top or bottom of the ranked list; that is, the rank at max is either very small or very large.
  • Leading Edge: Displays the three statistics used to define the leading edge subset: - Tags: percentage of gene hits before (positive ES) or after (negative ES) the enrichment peak. - List: percentage of genes in the ranked list before (positive ES) or after (negative ES) the enrichment peak. - Signal: enrichment signal strength combining the two previous statistics.
  • Core Enrichment Sequences: Genes that contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the enrichment result and are the core enriched sequences that account for the enrichment of a certain function.
  • No-Core Enrichment Sequences: Genes in the gene set that do not contribute to the leading-edge subset.

For further details please refer to the GSEA User Guide.

Figure 3: GSEA result table

Context Menu

A context menu appears by right-clicking on any row of the results table. The options listed will be applied to the selected rows. The specific options for GSEA results are:

  • Show Details: it shows more details about the GO term, its enrichment plot, its ES distribution, and the GSEA statistics for each sequence in linked to the GO term.
  • Create ID List of Core Enrichment Sequences: opens an ID-List with the core enrichment sequences for the given GO term.

In the sidebar there are located all possible actions that can be performed for this enrichment result.

Actions

  • Enriched GO Graph: generate a representation on the GO DAG (Figure 3). Nodes are color-highlighted proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes.
  • Reduce to Most Specific: remove more general GO terms from the results and get only the most specific terms (with the lowest level in the GO DAG).

Charts

  • Show Bar Chart: compare the core and non-core enriched functions in terms of their abundance of annotated sequences in your dataset (Figure 4). The percentages in the bar chart are calculated as the number of sequences (core or non-core) annotated with each GO. The total is the number of sequences that were provided with the ranked list (all sequences provided for the test).
  • Bubble Plot: this option generates a dot plot, a chart representing 4 dimensions: the annotation term in the Y axis, the normalized enrichment score (NES) in the X axis, the FDR q-value as the dot color, and the number of test sequences as the dot size (Figure 5). The configuration wizard allows you to select:
    • Tags: whether to plot GO terms with enriched, non-enriched, or both tags.
    • GO Categories: whether to plot GO terms from BP, MF, and/or CC categories.
    • Column to plot: whether to display GO IDs or GO Names.
  • Word Cloud: Generate a word cloud visualization based on the enriched GOs (Figure 6). The word cloud will display the GO names with sizes proportional to their associated statistical values, creating a visual representation of the data. The configuration wizard allows you to select:
    • Tags: whether to plot GO terms with the OVER, UNDER or both tags.
  • ES Histogram Chart: this option generates a histogram of enrichment scores across gene sets, which provides a quick, visual way to grasp the number of enriched gene sets (Figure 7).
  • NES vs Significance Chart: this option generates a plot of p-values versus normalized enrichment scores, which provides a quick, visual way to grasp the number of enriched gene sets that are significant (Figure 8).

Figure 3: Enriched Graph

Figure 4: Bar Chart

Figure 5: Bubble Plot

Figure 6: Word Cloud

Figure 7: ES Histogram Chart

Figure 8: NES vs Significance Chart

References