Gene Set Enrichment Analysis (GSEA)

Introduction

OmicsBox includes the GSEA computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. GSEA considers experiments with genome-wide expression profiles from samples belonging to two classes, labelled 1 or 2. Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric. Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout a ranked list of genes (L) or primarily found at extrems.

If there is no association, genes in S will be uniformly distributed throughout L: that is the null hypothesis of GSEA. If there is association, genes in S will accumulate at the top or at the bottom of L. The magnitude of the association will be measured by the Enrichment Score statistic (ES), see GSEA user guide.

Run GSEA

For this analysis, the completion (but not exclusively) of the involved sequences with their annotations must be loaded in the application. This can either be the result of an OmicsBox annotation or the imported annotation by file (.annot), see Gene Ontology Annotation of this manual.

This functionality can be found as a side panel button in the following tables:

Annotated sequences from Functional Analysis.
Pairwise Differential Expression results.

A dialog screen appears (see Figure 1). A detailed description of each parameter is available by clicking the help icon next to the parameter.

Input Parameters

Rank file. Ranked list of genes can be selected by uploading text files or ID-Value-List .box files containing the lists of sequence IDs and a statistical value for each one.
Number of permutations. Number of gene set permutations to assess the statistical significance of Enrichment Score.
Enrichment Statistic. Each time GSEA encounters a gene in S, a running-sum statistic increases, and decreases if gene is not in S. Enrichment Score (ES) will be 0 if genes in S are randomly distributed throghout L: ES represents the maximum deviation for a random distribution. This option change the way in which ES is calculated (see GSEA paper).
Detailed Results. Set the number of GO terms to get further details.
GO Category. Select the Gene Ontology Category to run the analysis

When running the gene set enrichment analysis, the GSEA software automatically normalizes the enrichment scores (ES) for variation in gene set size.
The normalization is not very accurate for extremely small or extremely large gene sets.
For example, for gene sets with fewer than 10 genes, just 2 or 3 genes can generate significant results. Therefore, by default, GSEA ignores gene sets that contain fewer than 15 genes or more than 500 genes.
To change these default values, use: Gene Sets Max Size. Gene Sets Min Size.

Filtering Options

Filtering. Only IDs with a higher p-value or FDR than the filter value will be shown. Note that FDR is the corrected p-value for multiple testing, so it provides more information about the statistical significance than the raw p-value.
Do Not Filter. This option allows to choose to filter by significant GOs or not. With this option checked all GOs will be taken into account.
Filter Mode: Choose between FDR or P-Value to filter the enriched GOs.
Filter Value: The value of the FDR or P-Value cut-off.

Click on the Run button to start the analysis. It may take a while depending on the number of permutations selected.

Results

Once completed the results table will be shown in a new tab (see image below), where the adjusted p-values of each annotation above a given threshold will be shown. The main columns are:

ES	NES	FDR	Nominal p-value
Reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes	By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset	The estimated probability that a gene set with a given NES represents a false positive finding	Estimates the statistical significance of the enrichment score for a single gene set

For further details please refer to the GSEA User Guide.

Using the context menu of the rows tagged with the Details tag It is possible to get more details about the GO term, including the enrichment statistics, and also create an ID-List with the core enrichment sequences for each GO term.

Core and Non-core Enriched

Core enrichment or leading-edge genes are described as follows:
Often it is useful to extract the core members of high scoring gene sets that contribute to the ES. We define the leading-edge subset to be those genes in the gene set S that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero. The leading-edge subset can be interpreted as the core of a gene set that accounts for the enrichment signal.

Taken from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1239896

In other words, core enriched sequences are the ones that contribute the most to the enrichment of a certain function.

In the sidebar there are located all possible actions that can be performed for this enrichment result, including two options for the visual display of the results:

Show Bar Chart: compare the core and non-core enriched functions in terms of their abundance of annotated sequences in your dataset (see Figure 6). The percentages in the bar chart are calculated as the number of sequences (core or non-core) annotated with each GO. The total is the number of sequences that were provided with the ranked list (all sequences provided for the test).
NES vs Significance Chart: this option generates a plot of p-values versus normalized enrichment scores, which provides a quick, visual way to grasp the number of enriched gene sets that are significant (see Figure 4).
ES Histogram Chart: this option generates a histogram of enrichment scores across gene sets, which provides a quick, visual way to grasp the number of enriched gene sets. (see Figure 5).
Make Enriched Graph: use this option to generate a representation on the GO DAG (see image below). Nodes are color-highlighted proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes. (see Figure 3).
Reduce to Most Specific: use this option to remove more general GO terms from the results and get only the most specific terms (with the lowest level in the GO DAG).

Additionally, It is possible to display the enrichment results as a WordCloud to summarize relevant GO terms in a fashionable way.