Single Cell RNA-Seq Cell Type Prediction with CellKB

Introduction

CellKb is an advanced tool that combines a robust cell-type prediction algorithm with an extensive knowledge base.

The knowledge base contains thousands of manually curated references obtained from research papers. For each cell type identified on a paper, the ranked list of up-regulated genes is kept and enriched with sample metadata. The ranked gene list is called a gene signature. Thus, the same cell type in the knowledgebase has multiple signatures (or ranked gene lists) associated, coming from different experiments.

The prediction algorithm operates by comparing a provided list of up-regulated genes against the signatures in the knowledge base. By evaluating the similarity between the query and the curated signatures, this approach delivers highly precise cell-type predictions. This approach enables CellKb to harness the collective power of diverse reference datasets, ensuring reliable and context-aware predictions.

Thus, previously to run CellKb, the list of up-regulated genes for each group of cells is needed. To this end, the widely known Scanpy package is used.

For more information about the knowledgebase curation and the cell type prediction approach please visit:

Ajay Patil, Ashwini Patil. CellKb Immune: a manually curated database of mammalian hematopoietic marker gene sets for rapid cell type identification. bioRxiv 2020.12.01.389890; doi: 10.1101/2020.12.01.389890.

For more information about CellKb please visit:

CellKb. Combinatics Inc. https://www.cellkb.com/.

For Scanpy please cite:

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018 Feb 6;19(1):15. doi: 10.1186/s13059-017-1382-0.

Run scRNA-Seq Cell Type Prediction with CellKb

This option is available on the Side Panel of a Seurat Clustering object (Side Panel > Actions > Cell Type Prediction) (Figure 1-A). To run CellKb, select the option on the opening wizard (Figure 1-B).

**Figure 1.** (A) Cell Type Prediction tool in the Side Panel of a scRNA-Seq Clustering results object, and (B) opening wizard to select CellKb algorithm.

Configuration 1: Differential Expression Between Groups.

The first step of the analysis is to perform a differential expression analysis between the selected groups of cells (Figure 2). This will generate a ranked list of up-regulated genes per subgroup, that later will be used to query against the CellKb knowledge base and obtain a cell type prediction.

Group: Select the group to perform the analysis.
Diff. Expression Method: Select the statistical test to perform differential expression between the subgroups.
- T-Test: This method performs a standard statistical test to compare the means of two groups, assuming normally distributed data and equal variance between groups.
- T-Test with Variance Overstimation: A variation of the T-Test that overestimates variance to account for noise or outliers.
- Logistic Regression: The logistic regression models the probability of group membership based on gene expression, enabling the detection of subtle patterns at the cost of higher computational demand.
- Wilcoxon rank-sum: This is a non-parametric approach that compares ranks instead of raw values, making it robust to non-normal data and suitable for sparse single-cell datasets.

**Figure 2.** Differential Expression configuration wizard page.

Configuration 2: Cell Annotation.

This configuration wizard page allows specifying filters for the knowledge base, so our data is only compared against the gene lists meeting the applied criteria (Figure 3). This allows for a more tailored and precise annotation.

Input Species: Select the species of your data. It is only possible to analyze species on this list. Please see the info panel above for more information.
Query Species: Query your data against gene lists from this species. This filter is mandatory. The query species can be different from the input species for cross-species annotation.
Tissues, Conditions, Cell Types: Compare your data against gene lists from the selected tissue(s), conditions(s), and/or cell type(s). Those filters are optional and multiple options can be selected on each of them. Each time an option is selected on one of the filters, the options available in the rest are updated. Start typing or press the space key to see available options and click to add one.

**Figure 3.** Cell Annotation configuration wizard page.

Adding more species

Only the species listed in the “Input Species” menu are available for cell type prediction. If you are missing a species, please feel free to contact support with your request (support@biobam.com).

Additionally, if you would like to add a reference to the knowledge base, please contact support with the reference you would like to add. The suggestions must be accompanied by the paper so it can be evaluated prior to adding it to the CellKb knowledge base.

Results

Annotated Clustering Object

The main result is an updated clustering object with the cell-type predictions stored in the cell metadata. The new annotations can be seen in the UMAP/tSNE viewer (Figure 4) under the “My Classifications” category. The new annotations will be named “CellKb-Broad” and “CellKb-Granular”. The first annotation contains more general cell-type terms, whereas the latter contains more specific terms. The names can be changed on the UMAP/tSNE viewer.

**Figure 4.** UMAP colored by the predicted cell types.

Summary Report

The summary (Figure 5) provides an overview of the cell type prediction. For each group, the final predictions including both broad (general) and granular (specific) cell types, along with a detailed list of matches to the reference are shown.

It also presents the candidate cell type predictions with their scores, as well as a list of the cell type marker genes that were also found in the query list, if any. These candidate cell types are chosen from the top gene lists matching the query. The score assigned to each candidate is a relative score, calculated by comparing the rank-based scores of the top matching references with each other. Thus, a high score means that the rank-based score of that candidate cell type is significantly higher than the rank-based score of all other cell types.

The detailed results (Figure 6) are shown after clicking on the button in the “Top Hits” column. The table shows the top 10 most similar reference gene lists with their associated statistics. The ranked-based score measures the degree of similarity between the query and the reference gene lists. It takes into consideration the number of genes in the query and the reference gene list, the number of overlapping genes between the two, and their positions on the lists.

Info

CellKb will always return an annotation, whenever at least one gene is found in the reference gene lists that match the filter criteria. It returns the cell type whose gene list is the closest match to the query, but it doesn’t imply that it’s a strong match. Thus, it is highly recommended to look into the detailed results to verify the robustness of the prediction.