Single Cell RNA-Seq Cell Type Prediction with SingleR

Introduction

Identifying the cell types present in your dataset is one of the key steps in Single-cell RNA-Seq analysis (scRNA-Seq). On this matter, the reference-based methods demonstrated to be very powerful and sensitive, being SingleR one of the most widely used.

This method compares the gene expression patterns of the cells in a query scRNA-Seq dataset with the expression of a reference, annotated, single-cell dataset. The method is divided into the following steps:

Training. Gene markers are identified for each cell type on the reference dataset. Those gene markers will be used to identify each cell type in the classification step.
Prediction. Spearman correlation scores are computed for each query cell and reference label. The label with the highest score is assigned to each cell.
Fine-tuning. For each cell, a second round of prediction is performed only with the highest scoring labels.
Pruning. Low-confident label assignments are pruned.

Please cite SingleR as:

Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). "Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage." Nat. Immunol., 20, 163-172. doi:10.1038/s41590-018-0276-y.

Run Single-cell RNA-Seq Cell Type Prediction

This option is available on the Side Panel of a Seurat Clustering object (Side Panel > Actions > Cell Type Prediction) (Figure 1). This will open a wizard to specify the reference annotation and the execution parameters.

Figure 1. Cell Type Prediction tool in the Side Panel of a scRNA-Seq Clustering results object.

Input

Specify the input reference annotation on this page and which metadata to use during the analysis (Figure 2).

Reference Annotation

Reference Format. The format of the reference annotation.
H5 Annotated Data. Files with the .h5ad extension, called AnnData. This is a compressed format used by many single-cell data scientists. It can be visualized with the software HDFView or loaded in OmicsBox. Inside the h5ad file, the count matrix must be stored in a group named "X", the cell metadata must be in a group named "obs" and the feature metadata in a group named "var". For more details about the format, please visit the AnnData Documentation.
Text File. Plain text file containing the count table. Cells must be in columns and genes in rows. In order to provide the cell annotations, an additional file must be specified in the "Annotation" parameter.
OmicsBox File. A .box file containing a Single-cell Annotated object, which can be the result of a Clustering, Cell Type Prediction, or Trajectory analysis.
Reference. Select here the single-cell annotated file.
Annotation. Only available if the "Reference Format" parameter is set to "Text File". It must be a text file containing the labels for each cell in the "Reference" file. It must include a header, one row per cell, and columns must be separated by a tab.
Cell Types. Factor in the cell metadata to predict the cell types. The values present in this factor will be assigned to the cells in the query dataset.

Single-cell RNA-Seq reference annotation can be downloaded from databases or generated with OmicsBox. Recommended databases are Tabula Sapiens, Tabula Muris, CellXGene, scPlantDB, etc.

Feature Matching

Select Query Feature. The type of feature to use from the query single-cell dataset: "ID" or "Name". They correspond to the "Feature ID" and "Name" columns in the opened clustering results, respectively.
Select Matching Reference Feature. The type of feature name or ID to use from the reference annotation. It is mandatory that the selected feature type matches the feature names in the query dataset.

Figure 2. Input wizard page.

Configuration

This page allows configuring the parameters for the cell type prediction, refining, and pruning steps (Figure 3).

Classification Parameters

Gene Marker Selection Method. First, SingleR is trained with the reference single-cell annotation by computing marker genes for each of the cell types present. Select here the statistical method to find differentially expressed (DE) genes between pairs of cell type labels in the reference. The identified DE genes obtained for each cell type will be used as marker genes to perform the prediction. Available options:
Classic. For each gene, this method computes the log-fold change between the medians of labels. Then, it sorts genes by the log-fold changes and takes the top DE genes. It is more suited if the reference comes from a bulk RNA-Seq analysis.
Wilcoxon Ranked Sum-Test. Instead of comparing means, this test evaluates differences in ranks of observations between two groups. Then, it takes the top 10 upregulated genes per comparison. More suitable in scenarios where data does not adhere to normal distribution assumptions, like Single-cell RNA-Seq references.
Welch T-test. This statistical test compares the means of each pair of labels, even when their variances are unequal. Then, it takes the top 10 upregulated genes per comparison. It is especially useful when assumptions of equal variances are not met.
Tune Threshold. SingleR performs a first round of cell type predictions and, for each cell in the query, an annotation score is calculated for all the labels in the reference. The label with the highest score is assigned to each cell. During fine-tuning, a second round of prediction is performed only with the highest-scoring labels. The 'tune threshold' sets a range below the highest score to decide which labels to keep for refining the prediction. Only labels within this range are considered in the next iteration during the classification process. For example, consider cell X with 3 cell type candidates: A = 1, B = 0.9, C = 0.85. With a 'tune threshold' of 0.1, only cell types A and B will be included in the next classification cycle of SingleR.

Pruning Parameters

After predicting and fine-tuning the prediction, a pruning step is performed. It removes low-quality assignments based on the cell-label score.

The SingleR algorithm inherently labels every cell, even when the cell's true label is not present in the reference set, leading to potentially incorrect assignments. To identify and prune low-quality prediction, SingleR calculates a "delta" value for each cell, representing the difference between the score for the assigned label and the median score across all labels. A small delta suggests that the cell matches all labels with similar confidence levels, meaning that the assigned label is less significant. There are two methods to prune labels:

Prune Outliers. For every label before fine-tuning, delta distribution across all assigned cells is generated (Figure 4). Cells falling more than a given number of Median Absolute Deviations (MADs) below the median score are pruned. This approach assumes that the majority of cells are accurately assigned to their true labels and that cells sharing the same label exhibit an unimodal distribution of delta values.
Nº MADs. Numeric scalar specifying the number of Median Absolute Deviations (MADs) to use for defining low outliers in the per-label distribution of delta values. The default is 3, which is motivated by the fact that, for a normal distribution, 99% of observations lie within 3 standard deviations from the mean. Smaller values for Nº MADs will increase the stringency of the pruning.
Prune by Threshold. Cell labels with deltas under a given threshold are pruned. This serves as an alternative filtration method if the assumptions underlying outlier detection are not met. For instance, if a label consistently experiences misassignment, the erroneous assignments will not be pruned. In such scenarios, setting a threshold will help with the removal of low-scoring cells.
Min. Delta. The minimum acceptable delta for each cell.

Output

Save PNG Scores Heatmap. Where to store scores heatmap in png format.

Figure 3. Configuration wizard page.

Figure 4. Example of a Delta Distribution score for a cell type. Each dot represents a cell, colored in yellow if it has been pruned.

Results

Annotated Clustering Object

The main result is an updated clustering object with the cell-type predictions stored in the cell metadata. The new annotations can be seen in the UMAP/tSNE viewer (Figure 5). The new annotation will be named as the input reference file. In addition, a second annotation with the suffix "_pruned" will be generated as well. In this annotation, the cells that have not passed the pruning thresholds will be labeled as "Pruned". This visualization is useful to see if the pruned labels are more present in a particular cell type or if they are distributed along all the cell types. The former case may indicate that the real cell type of the cells labeled as "Pruned" is not present on the reference dataset. This visualization also aids in identifying the amount of pruned labels visually.

Figure 5. UMAP colored by the predicted cell types.

Delta Distribution Chart

For each predicted cell type, a violin plot showing the delta score distribution is shown (Figure 6). Each violin plot contains only the cells assigned to that label. The dots (cells) colored in yellow represent cells that have been pruned. Please see the above "Pruning Parameters" section for a more detailed description of the delta score and the pruning procedure.

Figure 6. Delta Score distribution for each predicted cell type. Cells colored in yellow have been pruned.

Scores Heatmap

This heatmap shows the scores for all cells across the reference labels, allowing an easy assessment of the confidence of predicted labels (Figure 7). Ideally, each cell (represented by a column in the heatmap) should exhibit one score significantly higher than the others, indicating a clear assignment to a single label. However, if scores for a cell are similar, it suggests uncertainty in the assignment. Nevertheless, this might be acceptable if the uncertainty spans similar cell types that are difficult to distinguish.

The labels displayed on the top legend are the final assignment. It must be noted that the final label assignment may not correspond with the highest-scoring label (more yellow). This is because the scores displayed are the ones obtained before the fine-tuning step, since the scores after it are not comparable between labels.

This heatmap is generated by SingleR and is stored in the location indicated by the "Save PNG Scores Heatmap" parameter.

Figure 7. Score heatmap with cells in columns and labels in rows. The value is the score given for a particular cell-label pair. The labels displayed on the top legend are the final label assignment.

Summary Report

This summary (Figure 8) provides a basic overview of the reference scRNA-Seq annotation used for prediction. It shows the number of cells, genes, and the number of cells for each cell type. In addition, the number of cells in the query assigned to each cell type is also displayed, along with the number of pruned cells.

The parameters used for the analysis and the citation are displayed at the bottom of the summary report.