Pairwise Differential Expression Analysis (Without Replicates)
Introduction
Detecting genes that are differentially expressed between two experimental conditions (e.g. diseased vs healthy individuals) is a fundamental part of understanding the molecular basis of phenotypic variation. To carry out this task, there are statistical tools designed to perform differential expression analysis of the count data arising from RNA-seq technology (e.g. edgeR and maSigPro are statistical packages integrated into OmicsBox). However, these tools usually require the presence of replicates (both biological and technical) of each experimental condition that will be tested. This is a problem in cases where no replicates are available.
The Pairwise Differential Expression Analysis (Without Replicates) functionality offers a strategy for analyzing RNA-seq datasets that do not have replicates. It is based on the software package NOISeq, which belongs to the Bioconductor project. NOISeq is a novel nonparametric approach for the identification for differentially expressed genes from RNA-Seq count data. NOISeq creates a null or noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the genes in samples within the same condition. This reference distribution is then used to assess whether the M and D values computed between two conditions for a given gene are likely to be part of the noise or represent a true differential expression.
NOISeq method was designed to compute a differential expression on RNA-Seq data even when there are no replicates available for any of the experimental conditions. In this scenario, NOISeq can simulate technical replicates. The simulation relies on the assumption that read counts follow a multinomial distribution, where probabilities for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These mapping probabilities are approximated by using counts in the only sample of the corresponding experimental conditions. Given the sequencing depth (total amount of reads) of the unique available sample, the size of the simulated is a percentage of this sequencing depth, allowing a small variability.
Please remember that to obtain really reliable statistical results, biological replicates are needed.
Please cite NOISeq as:
Expression Data
The pairwise differential expression analysis application expects gene expression levels in the form of a count table. In OmicsBox, count tables can be generated via the Create Count Table application.
Count tables can also be imported from a text file. Go to transcriptomics → Load → Load RNA-Seq Count Table (expression data) (Figure 2) and select your .txt file containing the count table.
Run Pairwise Differential Expression Analysis (Without Replicates)
Go to transcriptomics → Run Differential Expression Analysis and choose the "Pairwise Differential Expression Analysis (Without replicates) option. This application requires a Count Table object as input data. The wizard allows to adjust analysis parameters (Figure 3 and Figure 4).
Preprocessing Data
-
Filter low count genes:
-
CPM FIlter:Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads). To pass the filter, the gene's CPM should be above the filter level in at least one sample (contrast or reference sample).
-
Normalization procedure:
-
Normalization Method:Normalization is an important step to make the samples comparable and to remove possible biases (as sequencing depth bias) in count data. The normalization methods available for this analysis are:
- TMM:Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
- RPKM:Reads Per Kilobase per Million mapped reads. This method corrects for gene length and the number of sequencing reads (gene length is required).
- Upper-quartile:75% quantile for the counts for each library is used to calculate the scale factors for normalization.
- None:No normalization method is applied.
- Feature Length File: For RPKM normalization load a tab-delimited file (or ID-Value object) with two columns containing the name and length of each gene or genomic feature.
Analysis Options
-
Replicates Simulation:
-
Number of Simulated Replicates:Set the number of replicates to be simulated for each condition.
- Size of the Simulated Replicates:Establish the percentage of the total reads used to simulate each sample.
- Variability:Variability in the simulated sample total reads.
-
Targets:
-
Contrast Condition:Choose the sample to be treated as contrast condition. Genes classified as UP will be upregulated in this sample.
- Reference Condition:Choose the sample to be treated as reference condition. Genes classified as DOWN will be upregulated in this sample.
Results
Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis (Without Replicates)" feature, a new tab is opened containing the results (Figure 5). The results table contains the differential expression statistics, where each row corresponds to a feature:
- Contrast Condition:Normalized expression values for the contrast condition sample.
- Reference Condition:Normalized expression values for the reference condition sample.
- M:Is the log2-ratio of the two conditions.
- D:The value of the difference between the conditions.
- Probability:The probability of differential expression for each feature. It is obtained by comparing the M and D values of a given feature against the noise distribution. If the probability is higher than a given threshold (0.9 by default), the feature is considered to be differentially expressed between conditions.
- Ranking:Is a summary statistic of M and D values equal to -sign(M)*sqrt(M^2 + D^2), which can be used as a ranked value in gene set enrichment analysis (GSEA).
Genes that have not passed the filtering step are not shown in the results tab.
Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object.
A result page will show a summary of the pairwise differential expression analysis results (Figure 6).
If you want to filter for differential expression based on another probability threshold, you can go to Side Panel→ Set Up/Down Tags and establish a new threshold value (Figure 7). Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs. To view, the updated summary results go to Side Panel→ Result Summaryand it can be exported in PDF.
During the Pairwise Differential Expression Analysis (without replicates), raw counts are transformed according to the normalization method selected in the analysis configuration. Go to Export Normalized Counts(sidebar) to export normalized counts to a tabular text file.
Charts and Statistics
Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panelof the Pairwise Results Viewer.
Results Chart
A bar chart shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features, and down-regulated features (Figure 8).
MDS Plot
It generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 9).
This plot is only available if the input count table contains more than 2 samples (although only two of them are compared).
Expression Plot
A scatter plot showing the average expression values of each condition (Figure 10). Differentially expressed features considering the probability threshold (0.9 by default) will be highlighted in red.
MD Plot
A scatter plot showing the log-fold change (M) and the absolute value of the difference in expression between conditions (D). D values are displayed in a log scale (Figure 11).
Heatmap
A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 12). The dendrograms added to the left and top sides are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).
The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.
Genes that will be displayed can be selected in the wizard. There are three options:
- The Top 50 differentially expressed genes (ranked by FDR).
- All differentially expressed genes.
- Provide an ID list containing the genes to represent.
Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".
Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.
Enrichment Analysis
It is possible to perform a functional enrichment analysis from the pairwise differential expression project. Both options, Fisher's Exact Test and Gene Set Enrichment Analysis can be found under the Side Panel of the Pairwise Results viewer. For a detailed tutorial on how to launch an Enrichment Analysis from Pairwise Differential Expression results, link here.
Fisher's Exact Test
Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). This set of genes can be the differentially expressed genes of differential expression analysis, a set of genes related to a phenotype of interest, etc. Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.
The subset of genes that will be considered as a Test-set (Figure 13) has to be provided. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".
The project containing the functionally annotated sequences that will be used as a reference background set should be provided.
The remaining parameters are explained in the Fisher's Exact Test section.
Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
The "Probability Threshold for Ranked List" parameters allow setting a filter to exclude those genes whose probability value is not above it (Figure 14). The ranked gene list will be created using the "ranking" statistic.
The project containing the functionally annotated sequences that will be used as a reference background set should be provided.
The rest of the parameters are explained in the Gene Set Enrichment Analysis section.