Single Cell RNA-Seq Clustering

Introduction

This tool is designed to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) experiments. That is, it aims to find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. Prior to the clustering, this tool performs the preprocessing of the data in order to make it suitable for the clustering algorithm (Figure 1). This application is based on the widely-used Seurat package.

Please cite Seurat as:

Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018). https://doi.org/10.1038/nbt.4096

Stuart, T., Butler, A., Hoffman, P. et.al. Comprehensive Integration Of Single-Cell Data. Cell 177 (7): 1888-1902.e21 (2019). doi:10.1016/j.cell.2019.05.031.

Hao, Y., Stuart, T., Kowalski, M.H. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 42, 293–304 (2024). https://doi.org/10.1038/s41587-023-01767-y

**Figure 1.** General workflow for the single-cell RNA seq clustering analysis

Run single-cell RNA sequencing clustering

In order to perform the Single Cell Clustering, a Count Table object has to be opened. It can be loaded from different formats by going to transcriptomics → Load → Single Cell RNA-Seq Count Matrix.

It can also be generated from FASTQ sequencing files with the Single Cell RNA-Seq Quantification tool available in transcriptomics → Single Cell RNA-Seq → Single Cell RNA-Seq Quantification.

Once the scRNA-Seq count table is loaded, go to the Side Panel → Actions → Clustering (Figure 2).

**Figure 2**. Launch clustering analysis from a scRNA-Seq Count Matrix Side Panel.

Configuration: Preprocessing.

These steps are meant to prepare data for the clustering analysis (Figure 3).

Normalization

Normalization aims to reduce differences in gene expression due to technical variation. This step tries to ensure that the observed heterogeneity within cells is due to biological reasons, rather than technical biases (Figure 3).

Normalize Data. Check this option to perform the normalization step. Extremely recommendable unless your data is already normalized or you are sure that your data does not need it.
Normalization Method. Which normalization method to use:
Log Normalization. Feature (gene) counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. This is then natural-log transformed using log1p.
Relative Counts. Feature counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. No log transformation is applied.
Centered Log Ratio Transformation (CLR). For each feature, the CLR-transformation is defined as the logarithm of the feature counts for one cell, divided by the geometric mean of all counts for that cell.

Data Adjustment

These steps are necessary whether the normalization is done or not, in order to prepare data for the dimensional reduction step.

High Variable Genes. Number of high variable genes to keep for further analysis. Keeping only those genes reduces both computational cost and non-relevant signals.
Scale Data. If checked, it scales features to have unit variance.
Center Data. If checked, it centers features to have zero mean.

Data Correction

Regress Out Mitochondrial Genes. Check this option to reduce the heterogeneity between cells associated with the expression of mitochondrial genes. This could prevent grouping cells during the clustering step that present a higher expression of those genes. In some cases, this information could not be informative since the higher expression of mitochondrial genes could be caused by technical reasons (e.g. because the mRNA has been leaked during cell manipulation) rather than by biological reasons. In order to perform this analysis, a file with a list of mitochondrial genes must be provided with one gene per line (Figure 4).
Regress Out Cell Cycle Genes. Check this option to reduce the heterogeneity between cells associated with the expression of cell cycle genes. This could prevent grouping cells during the clustering step that are in the same developmental stage, independently of their type. In some cases, these differences in cell cycle may be uninformative but, in other cases, they could be treated as indicative of proliferating cell populations which can be different across treatment conditions, for example. So whether to check this option or not will depend on the dataset under study and the target of the experiment. In order to perform this analysis, a file with cell cycle genes in one column and the cell cycle phase they belong to (S or G2/M) must be provided. One gene per line and columns separated by a tab (Figure 5).

Dimensional Reduction

Common single-cell RNAseq analysis implies from hundreds to thousands of cells and tens of thousands of genes. That is, it is really high dimensional data, so it is advisable to reduce the dimensionality of the dataset in order to reduce the computational cost of further analysis. The most widely used method for dimensional reduction is Principal Component Analysis (PCA). Keeping only the firsts Principal Components for further analysis reduces the dimensionality of the data while keeping the heterogeneity of the dataset, since they explain the largest amount of variance present in the sample. For more details please see "Orchestrating Single-Cell Analysis", 2020.

Principal Components. Number of Principal Components to compute from the Principal Component Analysis. It must be smaller than the number of cells present in the count table.

**Figure 3**. Preprocessing wizard page.

**Figure 4**. Mitochondrial genes file example.

**Figure 5**. Cell cycle genes file example.

Configuration: Multi-sample Data Integration.

This is an additional step needed when the input count table contains data coming from multiple samples e.g. from wild-type and mutant organisms, from control and stimulated samples, etc. All these situations could cause changes in gene expression that could make a joint analysis of all the data difficult, with cells clustering both by condition and by cell type (Figure 6-A). In order to avoid that, the Multi-sample Data Integration step aims to integrate scRNA-seq datasets by identifying common cell types based on common sources of variation. As a result, this step enables the identification of shared populations across datasets (Figure 6-B and C) and thus further downstream comparative analyses.

Integration Factor. Choose the condition to integrate datasets by. Datasets will be "integrated" by this condition so it doesn’t interfere during the clustering of cells. If no integration is to be done, select "None".
Integration Method. The statistical approach to use to integrate datasets.
Harmony. This method uses an iterative graph-based clustering algorithm to perform the integration. It performs a first soft clustering, favoring clusters with mixed dataset representation. Then, it gets the cluster centroids for each dataset and estimates a correction factor for each dataset and cluster. Finally, it applies the correction factor and moves the cells on each cluster accordingly. This procedure is repeated until the algorithm converges. For a detailed description of the algorithm, please see Korsunsky, I., Millard, N., Fan, J. et al. (2019).
- Theta. Diversity clustering penalty parameter. Adjusting this parameter can influence the diversity of the clusters. Larger values of theta result in more diverse clusters.
- Lambda. Ridge regression penalty parameter. It can affect the balance between underfitting and overfitting the model. Bigger values protect against over-correction.
- Tau. Protection against over-clustering small datasets with large ones.
- Nº Clusters. The number of clusters in the model. A value of 1 is equivalent to simple linear regression. This parameter can greatly influence the resolution of the data integration.
- Epsilon. Convergence tolerance for Harmony. It determines when the algorithm should stop iterating, based on the improvement in the objective function.
- Seurat Integrations. The Seurat algorithms are "anchor-based". These algorithms first embed all samples into a shared dimensional reduction space. Then, they find "anchors" between datasets, that is, equivalent cells across datasets. The anchors are identified with the mutual nearest neighbors (MNNs) algorithm. The different methods differ in the approach used to compute the dimensional reduction:
- Seurat-CCA. The classic integration algorithm, which performs Canonical Correlation Analysis. It is recommended when the same cell types are expected on different datasets, but there are substantial gene expression differences. This, this integration is suitable when experimental conditions or disease states introduce very strong expression shifts, or when integrating datasets across modalities and species. For a detailed description of the method please visit Butler et al. 2018.
- Seurat-Joint PCA. The dimensional reduction is one unique PCA (Principal Components Analysis) using all the datasets.
- Seurat-RPCA. One PCA is computed per each dataset, and then each dataset is projected into the other’s PCA. It is more conservative, faster, and doesn’t tend to overintegrate the datasets as much as the CCA method. It is recommended when a high proportion of cells in one dataset have no matching type in the other datasets, when the datasets originate from the same platform (i.e. multiple lanes of 10x genomics), or when there are a large number of datasets or cells to integrate. For a more detailed description please visit Stuart et al. 2019.
- All three methods use the same parameters for the "anchors" identification:
- N. Dimensions for Integration. The number of dimensions to use from the dimensional reduction space for the integration step.
- K Anchor. How many neighbors (k) to use during the MNNs when picking anchors. A higher number will result in a stronger integration.
- K Filter. How many neighbors (k) to use during the MNNs when filtering anchors.
- K Score. How many neighbors (k) to use during the MNNs when scoring anchors.
- K Weight. How many neighbors (k) to use during the MNNs when weighing anchors.

The Cell Metadata needed to perform integration can be specified while Merging Single-cell RNA-Seq counts or by Adding Cell Metadata.

**Figure 6.** UMAP plots of 10,799 cells coming from wild type (WT) and Rb mutant Drosophila eye disc, prior to **(A)** and post **(B)** alignment. After alignment, cells across conditions are grouped together based on shared cell type, allowing for a single joint clustering **(C)** to detect 11 populations.

**Figure 7**. Multi-sample Data Integration configuration wizard page.

Configuration: Clustering.

This step performs the actual cell clustering. It groups cells with similar expression patterns, which should correspond to the same cell type. For this analysis, Seurat uses a graph-based clustering algorithm. For more information, please see "Orchestrating Single-Cell Analysis", 2020.

Clustering

These parameters affect the clustering algorithm (Figure 8).

Define Dimensions by. How to decide the number of dimensions to use from the dimensional reduction (PCA) for the clustering step.
Elbow Point. This option automatically decides the number of dimensions to use. The assumption is that each of the top PCs capturing biological signals should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained from the last "biological" PC to the others. This is the so-called "Elow Point".
Manual. With this option, it is necessary that the user specifies the number of PCs to use in the "Number of Dimensions" parameter.

It should be taken into account that strong biological variation in the early PCs will shift the elbow point, potentially excluding weaker (but still interesting) variation in the next PCs immediately following it. So it could be interesting to repeat the analysis increasing the number of dimensions established by the Elbow Point to see if it improves the results.

Number of Dimensions. The number of dimensions to use from the dimensional reduction for the clustering step. This option is only available if the "Manual" option from the "Dimensions Choice" is selected.
k-value. The number of neighboring cells computed during the clustering algorithm. Normally, a greater k-value would produce a smaller number of clusters, and vice versa.
Resolution. This parameter determines how "fine" the clustering is: values above 1.0 would produce a larger number of clusters, and vice versa.

UMAP Configuration

These parameters affect only the UMAP visualization. The Uniform Manifold Approximation and Projection (UMAP) method is a non-linear dimensionality reduction technique (similar to PCA), but this time it is used to plot the high-dimensional single-cell data into two dimensions. The UMAP is used for visualization purposes because it represents the variability of single-cell data more accurately than the PCA. For more information, please refer to "Orchestrating Single-Cell Analysis", 2020.

Point’s Minimum Distance. This controls how tightly are the points (cells) compressed together.
Point’s Spread. This controls how expanded the points are. In combination with the minimum distance, this determines how clustered the points are.

**Figure 8**. Clustering configuration wizard page.

Results

Once the input counts have been processed and analyzed via the "scRNA-seq Clustering '' tool, a new tab is opened containing the Clustering Results (Figure 9). This new tab contains the same information as the count table and an additional column with the cluster assigned.

It also generates a Summary Report (Figure 9) with the following sections:

The "Data Overview" table shows some general statistics about the data.
The "Clustering Results" table shows the total number and a link to a list of cells belonging to each cluster. Before the table, there is a line specifying the number of dimensions from the PCA used for the clustering step.
The "Analysis Parameters" table shows the parameters used for the clustering analysis.

In addition, an interactive UMAP/tSNE viewer is also opened (Figure 10). This viewer allows coloring the cells by clusters, cell metadata, custom groups, or gene expression. In addition, cells can be selected with different tools and assigned to custom groups.

**Figure 9**. Clustering Results main viewer.

**Figure 10**. A detailed view of the interactive UMAP/tSNE viewer.

Side Panel Features

Summary Report

It shows the Summary Report previously explained in the above "Results" section (Figure 9).

Add Cell Metadata

Add per-cell information (e.g. cell type annotations, experimental conditions, etc.) from a text file. The annotations will be stored in the object and can be used for further analysis. For example, they can be visualized in the UMAP/tSNE viewer, used for differential expression analysis or for trajectory analysis, etc. It will open a wizard (Figure 11) to specify the text file and which group(s) to import. Columns must be tab-separated and the groups will be read from the first line in the text file (Figure 12).

**Figure 12**. Example of a text cell metadata file. Columns are separated by tab and the first line contains the column names.

Extract Counts

Extract the counts for cells belonging to the selected subgroup(s) (Figure 13).

For example, it may be interesting in the scenario in which one or more big clusters have been obtained, so it could be desirable to extract them and re-run the clustering to obtain sub-clusters of cells within it. That could make it possible to find more specific cell types.

Cell Type Prediction

Perform Cell Type Prediction with SingleR.

Trajectory Analysis

Order cells along pseudotime using Monocle3.

Differential Expression

Perform differential expression analysis between two or more clusters, cell types, conditions, etc.

Charts

Expression Profile

With this feature, it is possible to see the expression levels of known gene markers across the different clusters in order to identify putative cell types. To this end, a Bubble Plot is generated with clusters in rows and the specified genes in columns (Figure 16). The size of the dot represents the percentage of cells expressing the gene, that is, the percentage of cells that have a gene expression level greater than 0. The color represents the average gene expression in that cluster. You can configure the following options in the wizard (Figure 17):

Input genes.

Input Genes. You can specify here which genes to plot. The gene name or ID should correspond to the one in the input count table used during the clustering analysis. They can be specified in two ways:
Text: write the genes to plot in the "Genes List" text box, one per line.
File: specify a file containing the genes to plot, one gene per line.

Plot Options.

This affects the visualization.

Scale Gene Expression. When checked, it applies the Z-Score transformation to scale average gene expression across clusters. It allows the visualization of both highly and lowly expressed genes on the same color scale. It should be noted that this may exaggerate the results, but it is still advisable if you are going to plot genes with different expression level ranges. If unchecked, it colors the dots by the raw average gene expression of each cluster.

**Figure 16**. Expression Profile example.

**Figure 17**. Expression Profile Wizard.

Metadata Pie Chart

Generates a pie chart with the number of cells in each category (Figure 18).

**Figure 18**. Seurat clusters pie chart.

Elbow Plot

Elbow Plot (Figure 19) shows the amount of variance (given by the standard deviation) explained by successive PCs. This helps to decide how many principal components to use for the clustering algorithm, in case you want to re-run the clustering with different parameters.

Export

Export Cell Metadata

It generates a file containing one row per each cell, containing the Cell ID, the cluster label assigned, and the sample of origin (Figure 20). Columns are separated by a tab.