Skip to content

Pairwise Differential Expression Analysis (Without Replicates)

Introduction

Detecting genes that are differentially expressed between two experimental conditions (e.g. diseased vs healthy individuals) is a fundamental part of understanding the molecular basis of phenotypic variation. To carry out this task, there are statistical tools designed to perform differential expression analysis of the count data arising from RNA-seq technology (e.g. edgeR and maSigPro are statistical packages integrated into OmicsBox). However, these tools usually require the presence of replicates (both biological and technical) of each experimental condition that will be tested. This is a problem in cases where no replicates are available.

The Pairwise Differential Expression Analysis (Without Replicates) functionality offers a strategy for analyzing RNA-seq datasets that do not have replicates. It is based on the software package NOISeq, which belongs to the Bioconductor project. NOISeq is a novel nonparametric approach for the identification for differentially expressed genes from RNA-Seq count data. NOISeq creates a null or noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the genes in samples within the same condition. This reference distribution is then used to assess whether the M and D values computed between two conditions for a given gene are likely to be part of the noise or represent a true differential expression.

NOISeq method was designed to compute a differential expression on RNA-Seq data even when there are no replicates available for any of the experimental conditions. In this scenario, NOISeq can simulate technical replicates. The simulation relies on the assumption that read counts follow a multinomial distribution, where probabilities for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These mapping probabilities are approximated by using counts in the only sample of the corresponding experimental conditions. Given the sequencing depth (total amount of reads) of the unique available sample, the size of the simulated is a percentage of this sequencing depth, allowing a small variability.

Please remember that to obtain really reliable statistical results, biological replicates are needed.

Please cite NOISeq as:

Tarazona S, Furio-Tari P, Turra D, Di Pietro A, Nueda MJ, Ferrer A and Conesa, A (2015). "Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package." Nucleic Acids Research, 43(21), e140.

Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A and Conesa A (2011). "Differential expression in RNA-seq: a matter of depth." Genome Research, 21(12), 2213-2223.

Figure 1: Differential Expression Interface

Expression Data

The pairwise differential expression analysis application expects gene expression levels in the form of a count table. In OmicsBox, count tables can be generated via the Create Count Table application.

Count tables can also be imported from a text file. Go to transcriptomics → Load → Load RNA-Seq Count Table (expression data) (Figure 2) and select your .txt file containing the count table.

Figure 2:Count Table File

Run Pairwise Differential Expression Analysis (Without Replicates)

Go to transcriptomics → Run Differential Expression Analysis and choose the "Pairwise Differential Expression Analysis (Without replicates) option. This application requires a Count Table object as input data. The wizard allows to adjust analysis parameters (Figure 3 and Figure 4).

Preprocessing Data

  • CPM FIlter: Establish a filter to exclude genes with low counts across libraries, as those genes may interfere with the subsequent statistical approximations. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between samples (e.g. a CPM of 1 corresponds to a count of 6 in a sample with 6 million reads). To pass the filter, the gene's CPM should be above the filter level in at least one sample (contrast or reference sample).
  • Normalization Method: Normalization is an important step to make the samples comparable and to remove possible biases (as sequencing depth bias) in count data. The normalization methods available for this analysis are:

    • TMM: Weighted trimmed mean of M-values. In this method, weights are obtained from the delta method on Binomial Data (this method is recommended).
    • RPKM: Reads Per Kilobase per Million mapped reads. This method corrects for gene length and the number of sequencing reads (gene length is required).
    • Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors for normalization.
    • None: No normalization method is applied.
  • Feature Length File: For RPKM normalization load a tab-delimited file (or ID-Value object) with two columns containing the name and length of each gene or genomic feature.

Figure 3: Preprocessing Data Page

Analysis Options

  • Replicates Simulation:

  • Number of Simulated Replicates: Set the number of replicates to be simulated for each condition.

  • Size of the Simulated Replicates: Establish the percentage of the total reads used to simulate each sample.
  • Variability: Variability in the simulated sample total reads.
  • Targets:

  • Contrast Condition: Choose the sample to be treated as contrast condition. Genes classified as UP will be upregulated in this sample.

  • Reference Condition: Choose the sample to be treated as reference condition. Genes classified as DOWN will be upregulated in this sample.

Figure 4: Analysis Options Page

Results

Once the input counts have been processed and analyzed via the "Pairwise Differential Expression Analysis (Without Replicates)" feature, a new tab is opened containing the results (Figure 5). The results table contains the differential expression statistics, where each row corresponds to a feature:

  • Contrast Condition: Normalized expression values for the contrast condition sample.
  • Reference Condition: Normalized expression values for the reference condition sample.
  • M: Is the log2-ratio of the two conditions.
  • D: The value of the difference between the conditions.
  • Probability: The probability of differential expression for each feature. It is obtained by comparing the M and D values of a given feature against the noise distribution. If the probability is higher than a given threshold (0.9 by default), the feature is considered to be differentially expressed between conditions.
  • Ranking: Is a summary statistic of M and D values equal to -sign(M)*sqrt(M^2 + D^2), which can be used as a ranked value in gene set enrichment analysis (GSEA).

Genes that have not passed the filtering step are not shown in the results tab.

Results can be saved as a Pairwise Results object. Note that it is not possible to perform the analysis on this object. For this purpose, you have to open the Count Table object.

Figure 5: Differential Expression Results

A result page will show a summary of the pairwise differential expression analysis results (Figure 6).

Figure 6: Results Summary

Side Panel

Actions

Summary Report

It shows the Summary report previously explained in the above "Results" section (Figure 6).

Rename Features

This option allows modifying the sequence IDs in the Feature column using different methods:

  • Add: Add a prefix or sufix to all IDs in the table.
  • Replace: Replace specific text within the IDs. The text to be replaced must be defined in the Find parameter using a regular expression (regex).
  • Mapping: Use a mapping file to rename features. The mapping file must be a tab-separated text file with two columns: the first column contains the original feature IDs from the dataset, and the second column contains the new feature names. If duplicate IDs occur during renaming, you can define how they are handled:
    • Sum Rows: Combine counts for all matching features.
    • First Row: Retain only the counts of the first occurrence.
Set Up/Down Tags

It re-assigns the UP and DOWN labels based on different filtering cutoffs (Figure 7). Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs.

During the Pairwise Differential Expression Analysis (without replicates), raw counts are transformed according to the normalization method selected in the analysis configuration. Go to Export Normalized Counts(sidebar) to export normalized counts to a tabular text file.

Figure 7: Set Up/Down Tags

Fisher's Exact Test

Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). This set of genes can be the differentially expressed genes of differential expression analysis, a set of genes related to a phenotype of interest, etc. Fisher’s Exact Test uses a contingency table-based method to examine the association between two kinds of classification.

The subset of genes that will be considered as a Test-set (Figure 8) has to be provided. Up-regulated and down-regulated genes are those that are tagged according to the criteria established by the option "Set Up/Down Tags".

The project containing the functionally annotated sequences that will be used as a reference background set should be provided.

The remaining parameters are explained in the Fisher's Exact Test section.

Figure 8: Fisher’s Exact Test

Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

The "Probability Threshold for Ranked List" parameters allow setting a filter to exclude those genes whose probability value is not above it (Figure 9). The ranked gene list will be created using the "ranking" statistic.

The project containing the functionally annotated sequences that will be used as a reference background set should be provided.

The rest of the parameters are explained in the Gene Set Enrichment Analysis section.

Figure 9: Gene Set Enrichment Analysis

Charts

Different statistics charts can be generated for a global visualization of the results. These charts can be found under the Side Panelof the Pairwise Results Viewer.

Results Chart

A bar chart shows the number of total features, kept features (those who have passed the filtering step), differentially expressed features, up-regulated features, and down-regulated features (Figure 10).

Figure 10: Results Chart

MDS Plot

It generates a two-dimensional scatterplot in which the distances represent the typical log2 fold changes between samples. You can select an experimental factor by which you want to color the MDS graphic (Figure 11).

This plot is only available if the input count table contains more than 2 samples (although only two of them are compared).

Figure 11: MDS Plot

Expression Plot

A scatter plot showing the average expression values of each condition (Figure 12). Differentially expressed features considering the probability threshold (0.9 by default) will be highlighted in red.

Figure 12:Expression Plot

MD Plot

A scatter plot showing the log-fold change (M) and the absolute value of the difference in expression between conditions (D). D values are displayed in a log scale (Figure 13).

Figure 13: MD Plot

Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors (Figure 14). The dendrograms added to the left and top sides are produced by a hierarchical clustering method that takes as input the Euclidean distance computed between genes (left) and samples (top).

The heatmap supports zooming by keeping clicked a node of either of the two dendrograms. The first bars contain the experimental design of the data showing the association between samples and experimental covariates.

Genes that will be displayed can be selected in the wizard. There are three options:

  • The Top 50 differentially expressed genes (ranked by FDR).
  • All differentially expressed genes.
  • Provide an ID list containing the genes to represent.

Differentially expressed genes are those that are labeled as UP or DOWN in the table project ("Tags" column). The criteria for considering a gene as differentially expressed can be adjusted using the option "Set Up/Down Tags".

Furthermore, the wizard allows adjusting the type of expression data that will be represented, as well as the transformation that can be applied to this data.

Figure 14: Heatmap

Export

Besides the generic Export Table, this object contains the following export options.

Export Raw Counts

Export the raw counts to a text file. It will not contain the genes discarded during the filtering step.

Export Normalized Counts

Export the normalized counts to a text file. It will not contain the genes discarded during the filtering step.

Export Experimental Desing

Esport the experimental design to a tab-separated file. The first column will contain the samples, whereas the rest will be the experimental factors.

Context Menu

Besides the generic context menu options, the available action for this object is:

  • Extract Selection to New Tab: Extract the data from the selected rows and open it in a new tab.