Pairwise Differential Expression Analysis

Introduction

This tool is designed to perform differential expression analysis of count data arising from RNA-seq technology. This application, based on the edgeR program, allows the identification of differentially expressed genomic features (e.g. genes) in a pairwise comparison of two different experimental conditions. The software package edgeR (empirical analysis of DGE in R), which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.

Please cite edgeR as: Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.

**Figure 1:**Differential Expression Analysis Interface

Expression Data

The pairwise differential expression analysis application expects gene expression levels in the form of a count table. In OmicsBox, count tables can be generated via the Create Count Table application.

Count tables can also be imported from a text file. Go to transcriptomics → Load → Load RNA-Seq Count Table (expression data) (Figure 2) and select your .txt file containing the count table.

Notes:

This application only accepts raw counts without any type of normalization.
Replicates for each experimental condition are necessary.

Run Pairwise Differential Expression Analysis

Go to transcriptomics → Differential Expression Analysis. If there’s no count table project opened, the first wizard page (Figure 3) will ask to upload either a Count Table Project (.box file) or a Count Table File (.txt, .csv, or .tsv file). On the second wizard page, choose the "Pairwise Differential Analysis" option. If a count table is already loaded in OmicsBox (see above section), this one will be used to perform the analysis. In this case, the first wizard page will be to select the type of differential expression analysis (Figure 4). In the next pages, it is possible to specify different analysis parameters, which are divided into three different sections: Preprocessing Data (Figure 5), Experimental Design (Figure 6), and Comparison and Test (Figure 7).

**Figure 4:** Differential Expression Analysis Options wizard page.

Preprocessing Data Page

Filter Genes with low counts. Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations:

Counts Per Million: Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads).
- CPM Filter: Minimum CPM. Genes with a CPM lower than this number will be filtered out. Set to 0 to not apply this filter.
- Samples reaching CPM Filter: Set a minimum number of samples in which the gene's CPM is above the filter level (is expressed). If this value is set to e.g. 5, the gene's CPM has to be greater than the "CPM Filter" value in at least 5 of the samples. The number of samples in the smallest experimental group is usually used. That is, in an experiment that has two replicates for one condition (or group) and three for the other one, a gene should be expressed in at least two samples. Set to 0 to not apply this filter.
Automatic Filers: This function automatically calculates the filtering thresholds to keep genes with worthwhile counts in a minimum number of samples. For more information, please visit EdgeR's documentation in this link.

Calculate normalization factors to scale the raw library sizes.

Normalization Method: Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. For further details please refer to the edgeR User's Guide. You can select the normalization method to be used:
- TMM: Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
- TMM with Zero Pairing:This is a variant of TMM that should perform better for data with a high proportion of zeros.
- RLE: Relative log expression. Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).
- Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors.
- None: No normalization method is applied.

Experimental Design Page

Experimental design file: Select your .txt file containing your experimental factors with the experimental conditions associated with each sample in tab-delimited format. As shown below, rows correspond to samples and columns to experimental factors. Make sure that the names in the first column of the experimental design table are exactly the same as the sample names in the count table header. If your experimental design file has fewer samples than in the count table, only the samples contained in this file will be analyzed.

Name    Strain
SRR3666079  FA1090
SRR3666080  FA1090cpxA
SRR3666081  FA1090cpxR
SRR3666082  FA1090
SRR3666083  FA1090cpxA
SRR3666084  FA1090cpxR
SRR3666085  FA1090
SRR3666086  FA1090cpxA
SRR3666087  FA1090cpxR
SRR3666088  FA1090
SRR3666089  FA1090cpxA
SRR3666090  FA1090cpxR

Comparison and Test Page

Design Type: Choose the design type to adjust the analysis
- Simple design: Makes a pairwise comparison between samples belonging to two experimental conditions. You only have to select the experimental factor of interest and establish the comparison by selecting the reference and contrast conditions in "Primary Target".
- Paired design: Makes a pairwise comparison between samples belonging to two experimental conditions, adjusting for baseline differences of other experimental factors. In this design, you have to establish the conditions for the comparison in "Primary Target" and the experimental factor for baseline difference in "Secondary Target". This design type is appropriate for paired or blocking design, or experiments with batch effects.
- Multifactorial Design: Makes a pairwise comparison between samples belonging to two experimental conditions with two experimental factors. For this design, you have to select the two experimental factors of interest and establish the reference and contrast group for each in "Primary Target" and "Secondary Target". This design type is appropriate if you want to analyze the effects of combined experimental conditions on gene expression.
Statistical Test: Select a statistical test.
- Exact Test: Based on the quantile-adjusted conditional maximum likelihood (qCML) methods (similar to Fisher's exact test). It is only applicable to datasets with a single factor design (simple design).
- GLM (Likelihood Ratio Test): Based on fitting negative binomial Generalized Linear Models (GLMs) with the Cox-Reid dispersion estimates. It is a good choice for inferences with GLMs.
- GLM (Quasi Likelihood F-Test): The empirical Bayes quasi-likelihood F-test is an alternative to the Likelihood Ratio Test and provides a more robust and reliable error rate control when the number of replicates is small.
Robust: Estimation is strengthened against potential outlier genes.

Results

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis'' tool, a new tab is opened containing the results (Figure 8). The results table contains the differential expression statistics, where each row corresponds to a feature:

logFC: A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).
logCPM: The average log2-counts-per-millions.
LR: Likelihood ratio statistic for the GLM (Likelihood Ratio Test).
F: Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).
FDR: False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).
Tags: Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).

Genes that have not passed the filtering step are not shown in the new tab.

Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object.

**Figure 8:**Pairwise Differential Expression Results

A result page will show a summary of the pairwise differential expression analysis results (Figure 9).

Actions

The actions are available in the Side Panel→ Actions.

Summary Report

It generates the Summary Report explained above.

Set Up/Down Tags

It re-assigns the UP and DOWN labels based on different filtering cutoffs (Figure 10). Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs.

Enrichment Analysis

It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis can be found under the Side Panel → Actions of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.

Fisher's Exact Test

Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.

With this tool, the subset of genes that will be considered as a Test-set will be the genes labeled as UP or DOWN regulated (Figure 11). Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".

The project containing the functionally annotated sequences that will be used as a reference background set should be provided in the "Reference Annotation" box.

The remaining parameters are explained in the Fisher's Exact Test section.

Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes). There’s more information about the analysis and the parameters in the Gene Set Enrichment Analysis section.

This analysis needs a ranked gene list, which will be automatically computed using the following formula:

Rank = sign(logFC) * -log10(P-Value)

The project containing the functionally annotated sequences that will be used as a reference background set should be provided (Figure 12).

**Figure 12:** Gene Set Enrichment Analysis

Charts and Statistics

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panel → Chartsof the Pairwise Results viewer.

Results Chart

Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features, and down-regulated features (Figure 13).

MDS Plot

Generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 14).

Volcano Plot

A scatter plot constructed by plotting the negative log of the adjusted p-values (FDR) on the y-axis versus the log of the fold changes on the x-axis (Figure 15). Upregulated and downregulated genes are shown in green and red respectively.

MA Plot

A scatter plot showing the log of the fold changes on the y-axis versus the average of the log of the CPM on the x-axis. Differentially expressed genes are highlighted (Figure 16).

Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 17). The dendrograms added to the left and top sides are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.

Genes that will be displayed can be selected in the wizard. There are three options:

The Top 50 differentially expressed genes (ranked by FDR).
All differentially expressed genes.
Provide an ID list containing the genes to represent.

Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".

Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.