Single Cell RNA-Seq Differential Expression Analysis

Introduction

This tool is designed to perform a DIfferential Expression Analysis from Single-cell RNA-seq data. It can be performed both after the Single-cell RNAseq Clustering or the Trajectory Analysis, since these tools assign each cell to a group (a cluster in the former and a pseudotime range in the latter).

This tool is based on the R package EdgeR. Please cite EdgeR as: Robinson MD, McCarthy DJ and Smyth GK (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26, pp. -1.

Run scRNA-seq Differential Expression Analysis

From the Side Panel of a scRNA-seq Clustering object or Trajectory Analysis object, go to Actions → Differential Expression.

Input

The inputs necessary to perform a Differential Expression analysis are a count table and an experimental design. A count table is a table with genes in rows, and cells in columns with each value corresponding to the gene expression level. The experimental design specifies which group belongs to each of the cells. The two inputs are automatically retrieved from the scRNA-seq Clustering or Trajectory Analysis objects.

Configuration 1. Filtering and Normalization

Genes Filtering

This step is thought to remove genes with low counts from the analysis. These genes could interfere with some of the statistical approximations of the analysis and don’t apport meaningful information. The filtering can be applied in two different ways (Figure 1), depending on the value given:

Counts Per Million. Filtering is performed on a count-per-million (CPM) basis to account for differences in library size between cells. For example, a CPM of 1 corresponds to a count of 6 in a cell with 6 million reads.
CPM Filter. Establish the minimum CPM. Set this parameter to 0 to not filter.
Cells Reaching CPM Filter. Set a minimum number of cells in which the gene's CPM is above the filter level. If this value is set to e.g. five, at least 5 of the cells have to be above the given CPM. Set it to 0 to not filter.
Raw Counts. The values introduced in the parameters below refer to raw expression values. In summary, the strategy keeps genes with at least a minimum reads count in a worthwhile number of cells. More details can be found in EdgeR’s Documentation.
Minimum Sample Count. Keep genes that have at least this minimum of counts in at least n cells, where n is the smallest group size. For example, in the case of testing clusters, the n would be the size of the smallest cluster.
Minimum Total Count. Keep genes that have at least this total number of counts across all the cells.

Normalization

Here the normalization takes the form of scaling factors for library sizes that enter into the statistical model. These correctional factors are used to compute effective library sizes. For further details please refer to the EdgeR User's Guide. You can select the normalization method to be used:

TMM (Trimmed Mean of M values): In this method, weights are obtained from the delta method on Binomial Data.
TMM with Zero Pairing: This is a variant of TMM that should perform better for data with a high proportion of zeros.
RLE (Relative Log Expression): Scale factors are the median ratio of each sample to the median library (geometric mean of all samples).
Upper-quartile: 75% quantile for the counts for each library is used to calculate the scale factors.
None: All normalization factors are set to 1, so no normalization is performed.

**Figure 1.** Filtering and Normalization Page.

Configuration 2. Metadata

This page shows the experimental design stored in the input object (Figure 2). It contains each of the samples or count matrices present in the object with the factors and conditions specified by the user.

In addition, it is possible to specify the column containing Biological Replicates. In this case, a pseudobulk analysis will be performed. That means aggregating the cell counts belonging to the same biological replicate and cluster (Figure 3). If not, each cell would be treated as a replicate. It is highly recommended to perform a pseudobulk approach in the presence of biological replicates (Squair et al., 2021).

Configuration 3. Design.

This page allows configuring the design for the differential expression test (Figure 4). Please refer to the blog "Tutorial: Single-cell RNA-Seq Differential Expression Analysis " for a detailed explanation of how to effectively configure the analysis.

Simple Design

This will test for differential expression taking into account only one factor.

Test Contrasts Separately. If checked, one DE test will be performed for each of the conditions specified in "Primary Contrast Conditions". Otherwise, only one DE test will be performed taking as contrast all the specified conditions together.
Primary Factor. Select which factor (or column) from the metadata to test for differential expression.
Primary Contrast Conditions: select which condition(s) to use as contrast.
Primary Reference Conditions: select which condition(s) to use as reference.
Blocking Factor. Adjust for baseline differences of the selected experimental factor. Please refer to the "3.4.2 Blocking" section of edgeR’s User Manual for a detailed description.

Notice that if the option "Test Contrasts Separately" option is checked, it is possible to select the same condition in the primary contrast and reference. However, during each of the test the condition selected as contrast won’t be used in the reference group.

Multifactorial Design

This will test for differential expression between cells belonging to the Primary Contrast Condition + Secondary Contrast Condition against cells belonging to Primary Contrast Condition + Secondary Reference Condition. Only available if the metadata contained in the object has more than one factor. For example, if an experimental design has been provided during the Single-cell RNA-Seq Clustering.

Secondary Factor. Select which factor (or column) from the metadata to use as the secondary factor.
Secondary Contrast Conditions: select which condition(s) to use as contrast.
Secondary Reference Conditions: select which condition(s) to use as reference.

Results

When the analysis finishes, the Single-Cell Differential Expression (SCDE) results are opened in a new tab (Figure 5). The results table shows the differential expression statistics, where each row corresponds to a contrast and a feature:

Tags: Indicate whether a gene is considered upregulated (FDR ≤ 0.05, logFC ≥ 0) or downregulated (FDR ≤ 0.05, logFC ≥ 0).
Contrast: which conditions from the primary factors have been used as contrast.
Reference: which conditions from the primary factors have been used as reference.
Feature: feature used for counting in the input Count Table and for DE testing (eg. gene, exon, transcript, etc.).
FDR: False Discovery Rate calculated by the Benjamini-Hochberg method (multiple hypothesis testing corrections).
logCPM: The average log2-counts-per-millions.
logFC: A measure that describes how much the expression changes between conditions (log2-fold-changes are shown).
LR: Likelihood ratio statistic for the GLM (Likelihood Ratio Test).
F: Quasi-likelihood F-statistic for the GLM (Quasi Likelihood F-test).
PValue: raw p-value.

Genes that have not passed the filtering step are not shown in the new tab.

In addition, a Summary report and a chart showing an overview of the results are generated as well.

The Summary report (Figure 6) contains general information about the analysis, divided into these sections:

Dataset Overview: shows the total present in the starting count table, the filtered, and the total in the final count table number of features.
Results: shows the number of features considered UP and DOWN regulated in the entire project and for each of the primary contrast conditions. In addition, it is possible to obtain the list of UP or DOWN features of each contrast by clicking on the Id list button.
Analysis Parameters: shows the parameters used for the analysis.

The Results Overview chart (Figure 7) shows the total number of features present in the analysis as well as the number of UP and DOWN features for each of the primary contrast conditions.

Tip: it is possible to remove the "Total" column from the chart by applying the rigth filters in the Side Panel. On the "Filtering Options" section, select the following configuration:

Filter method: Absolute Value.
Show only: Lower Than.
Threshold: specify a value lower than the "Total" number of features, but greater than the rest.
Show others: disabled.

**Figure 5**. Single-cell Differential Expression results.

**Figure 6**. Single-cell Differential Expression results summary.

**Figure 7.** DE Results Overview chart.

Side Panel Features

Actions

Summary

It shows the Summary report previously explained in the above "Results" section (Figure 6).

Set UP/DOWN Tags

It re-assigns the UP and DOWN labels based on different filtering cutoffs (Figure 8). Tags will be updated, and the result section of the Result Summary and statistical charts will change according to the new cutoffs.

Fisher’s Exact Test

Fisher’s Exact Test can be used to find GO terms that are over and under-represented in a set of genes (test set) with respect to a reference group (reference set). In this case, the test set is composed of all the features tagged as UP or DOWN and belonging to the primary contrast condition specified in "Contrast to Test" (Figure 9). Once finished, it will open the Fisher’s Exact Test results in a new tab (Figure 10). Please refer to the Fisher's Exact Test section of the manual for further details about the analysis and the parameters.

**Figure 10**. Fisher’s Exact Test results from Single-cell Differential Expression.

Charts

Heatmap

A heatmap is a two-dimensional visual representation of data in which numerical values of points are represented by a range of colors. In this heatmap, rows correspond to the top differentially expressed features, columns to contrast conditions, and values to mean feature expression level.

It is possible to configure the visualization in the wizard (Figure 11). Firstly, it allows deciding which genes to plot

Top N Differentially Expressed Genes. Features are ranked according to the FDR and then the top N is selected, where N can be set in the "Nº of DE Genes" parameter.
ID List. Features specified in the list will be plotted. A text file or an ID-Lis Object must be specified in the "ID List" parameter.

In addition, how to plot the data can be configured as well:

Expression Data. Which type of data use for plotting: raw counts or CPM (Count Per Million).
Logarithm: if checked, it applies the log2 for each expression value.
Z-Score: if checked, it applies the z-score.

Results Overview

This tool generates the chart explained in the above "Results" section (Figure 7).

Export

Export Raw Counts

Export the raw counts to a text file. It will not contain the genes discarded during the filtering step.

Export Normalized Counts

Export the normalized counts to a text file. It will not contain the genes discarded during the filtering step.

Export Table

Export the differential expression results to a text file.