Pairwise Differential Expression Analysis
Introduction
This tool is designed to perform differential expression analysis of count data arising from RNA-seq technology. This application, based on the edgeR program, allows the identification of differentially expressed genomic features (e.g. genes) in a pairwise comparison of two different experimental conditions. The software package edgeR (empirical analysis of DGE in R), which belongs to the Bioconductor project, implements quantitative statistical methods to evaluate the significance of individual genes between two experimental conditions.
Please cite edgeR as: Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.
Expression Data
The pairwise differential expression analysis application expects gene expression levels in the form of a count table. In OmicsBox, count tables can be generated via the Create Count Table application.
Count tables can also be imported from a text file. Go to transcriptomics → Load → Load RNA-Seq Count Table (expression data) (Figure 2) and select your .txt file containing the count table.
Notes:
- This application only accepts raw counts without any type of normalization.
- Replicates for each experimental condition are necessary.
Run Pairwise Differential Expression Analysis
Go to transcriptomics → Differential Expression Analysis. If there’s no count table project opened, the first wizard page (Figure 3) will ask to upload either a Count Table Project (.box file) or a Count Table File (.txt, .csv, or .tsv file). On the second wizard page, choose the "Pairwise Differential Analysis" option. If a count table is already loaded in OmicsBox (see above section), this one will be used to perform the analysis. In this case, the first wizard page will be to select the type of differential expression analysis (Figure 4). In the next pages, it is possible to specify different analysis parameters, which are divided into three different sections: Preprocessing Data (Figure 5), Experimental Design (Figure 6), and Comparison and Test (Figure 7).
Preprocessing Data Page
-
Filter low count genes:
-
CPM Filter: Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads).
- Samples reaching CPM Filter: Set a minimum number of samples in which the gene's CPM is above the filter level (is expressed). If this value is set to e.g. five, at least 5 of the samples have to be above the given CPM. The number of samples of the smallest group is usually used (e.g. in an experiment that has two replicates for each condition (or group), a gene should be expressed in at least two samples). Set value to 0 if no filter is desired.
-
Calculate normalization factors to scale the raw library sizes:
-
Normalization Method: Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute the effective library sizes. For further details please refer to the edgeR User's Guide. You can select the normalization method to be used:
- TMM: Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
- TMM with Zero Pairing:This is a variant of TMM that should perform better for data with a high proportion of zeros.
- RLE: Relative log expression. Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).
- Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors.
- None: No normalization method is applied.
Experimental Design Page
- Experimental design file: Select your .txt file containing your experimental factors with the experimental conditions associated with each sample in tab-delimited format. As shown below, rows correspond to samples and columns to experimental factors. Make sure that the names in the first column of the experimental design table are exactly the same as the sample names in the count table header. If your experimental design file has fewer samples than in the count table, only the samples contained in this file will be analyzed.
Name Strain
SRR3666079 FA1090
SRR3666080 FA1090cpxA
SRR3666081 FA1090cpxR
SRR3666082 FA1090
SRR3666083 FA1090cpxA
SRR3666084 FA1090cpxR
SRR3666085 FA1090
SRR3666086 FA1090cpxA
SRR3666087 FA1090cpxR
SRR3666088 FA1090
SRR3666089 FA1090cpxA
SRR3666090 FA1090cpxR
Comparison and Test Page
-
Design Type: Choose the design type to adjust the analysis
- Simple design: Makes a pairwise comparison between samples belonging to two experimental conditions. You only have to select the experimental factor of interest and establish the comparison selecting the reference and contrast conditions in ``Primary Target''.
- Paired design: Makes a pairwise comparison between samples belonging to two experimental conditions, adjusting for baseline differences of other experimental factors. In this design, you have to establish the conditions for the comparison in
Primary Target'' and the experimental factor for baseline difference in
Secondary Target''. This design type is appropriate for paired or blocking design, or experiments with batch effects. - Multifactorial Design: Makes a pairwise comparison between samples belonging to two experimental conditions with two experimental factors. For this design, you have to select the two experimental factors of interest and establish the reference and contrast group for each in
Primary Target'' and
Secondary Target''. This design type is appropriate if you want to analyze the effects of combined experimental conditions on gene expression. -
Statistical Test: Select a statistical test.
-
Exact Test: Based on the quantile-adjusted conditional maximum likelihood (qCML) methods (similar to Fisher's exact test). It is only applicable to datasets with a single factor design (simple design).
- GLM (Likelihood Ratio Test): Based on fitting negative binomial Generalized Linear Models (GLMs) with the Cox-Reid dispersion estimates. Is a good choice for inferences with GLMs.
- GLM (Quasi Likelihood F-Test): The empirical Bayes quasi-likelihood F-test is an alternative to the Likelihood Ratio Test and provides a more robust and reliable error rate control when the number of replicates is small.
- Robust: Estimation is strengthened against potential outlier genes.
Results
Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis'' tool, a new tab is opened containing the results (Figure 8). The results table contains the differential expression statistics, where each row corresponds to a feature:
- logFC: A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).
- logCPM: The average log2-counts-per-millions.
- LR: Likelihood ratio statistic for the GLM (Likelihood Ratio Test).
- F: Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).
- FDR: False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).
- Tags: Indicate whether a gene is upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).
Genes that have not passed the filtering step are not shown in the new tab.
Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object.
A result page will show a summary of the pairwise differential expression analysis results (Figure 9).
Actions
The actions are available in the Side Panel→ Actions.
Summary Report
It generates the Summary Report explained above.
Set Up/Down Tags
It re-assigns the UP and DOWN labels based on different filtering cutoffs (Figure 10). Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs.
Enrichment Analysis
It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis can be found under the Side Panel → Actions of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.
Fisher's Exact Test
Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.
With this tool, the subset of genes that will be considered as a Test-set will be the genes labeled as UP or DOWN regulated (Figure 11). Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".
The project containing the functionally annotated sequences that will be used as a reference background set should be provided in the "Reference Annotation" box.
The remaining parameters are explained in the Fisher's Exact Test section.
Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes). There’s more information about the analysis and the parameters in the Gene Set Enrichment Analysis section.
This analysis needs a ranked gene list, which will be automatically computed using the following formula:
Rank = sign(logFC) * -log10(P-Value)
The project containing the functionally annotated sequences that will be used as a reference background set should be provided (Figure 12).
Charts and Statistics
Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panel → Chartsof the Pairwise Results viewer.
Results Chart
Bar chart which shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features, and down-regulated features (Figure 13).
MDS Plot
Generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 14).
Volcano Plot
A scatter plot constructed by plotting the negative log of the adjusted p-values (FDR) on the y-axis versus the log of the fold changes on the x-axis (Figure 15). Upregulated and downregulated genes are shown in green and red respectively.
MA Plot
A scatter plot showing the log of the fold changes on the y-axis versus the average of the log of the CPM on the x-axis. Differentially expressed genes are highlighted (Figure 16).
Heatmap
A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 17). The dendrograms added to the left and top sides are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).
The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.
Genes that will be displayed can be selected in the wizard. There are three options:
- The Top 50 differentially expressed genes (ranked by FDR).
- All differentially expressed genes.
- Provide an ID list containing the genes to represent.
Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".
Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.