Single Cell RNA-Seq Quantification
Introduction
There are two main groups of Single-cell RNA-Seq (scRNA-Seq) library construction technologies: Cell Barcoded (or 3' enriched) and Full-length technologies. The final structure of the sequencing reads will heavily depend on which technology has been used during the library preparation step.
The Cell Barcoded technologies attach two main types of tags to the sequencing reads: Cell Barcodes (CB) and Unique Molecular Identifiers (UMI). The CB is unique for each cell, and the UMI is unique for each RNA molecule. Thus, the CB helps to identify the cell and the UMI to identify the RNA molecule from which the read originated. Thanks to the CB, all the cells can be sequenced together, thus allowing a higher throughput with this type of technology. The UMI helps in reducing the amplification bias introduced during the sequencing. Since the CB and the UMI have to be sequenced, the total length of the transcript being sequenced tends to be short and biased towards one end of the molecule (3' or 5'). Examples of these technologies are 10x Chromium, Drop-seq, etc.
On the other hand, Full-length technologies don’t use Cell Barcodes nor UMIs. This allows the sequencing of the RNA molecule from both ends, that's why they are called "Full-length". Since no CB is used, cells have to be sequenced separately, so the throughput is lower in comparison with Cell Barcoded technologies. Examples of these technologies are SMART-Seq, SMARTer, etc.
In Single-cell RNA-Seq analysis, gene expression level is measured by cell. Thus, it is important to take into account the structure of the reads in order to separate them and count them at the cell level. The scRNA-Seq Create Count Table tool uses STARsolo, which is the Single-cell version of the well-known aligner STAR. This tool aligns the reads against the reference genome, demultiplexes them by cells, and counts them by the genes present in the GFF or GTF annotation. In addition, in the case of analyzing Cell Barcoded technologies, it includes different options to perform the Cell Barcode calling, and CB and UMI correction and filtering. This makes the tool highly flexible and capable of analyzing reads coming from most of the scRNA-Seq library construction technologies available.
Run Single Cell RNA-Seq Quantification
Go to transcriptomics → Single Cell RNA-Seq → Single Cell RNA-Seq Quantification. A new wizard will be opened (Figure 1). On the first wizard page, select the type of technology used during the library preparation step. Please refer to the above introductory section for more details. Depending on the type chosen, the following pages will display different configuration parameters.
Figure 1. First wizard page.
Cell Barcoded Technologies
The pipeline to obtain counts from barcoded scRNA-Seq data consists of many steps, summarized in Figure 2. Each of the steps can be configured in the following wizard pages. This makes the analysis very customizable and adaptable to the dataset under study.
Figure 2. scRNA-Seq Quantification pipeline for barcoded library construction technologies with STARsolo.
Input
- Reference Genome. Reference genome in FASTA format. Reads will be mapped against this reference.
-
Annotation GFF/GTF. Reference annotation in GTF or GFF format. Counts will be counted by the genes present in this annotation. The chromosome names present in the annotation must match the ones present in the genome sequence headers.
-
Exon Feature. Select the feature from the GFF/GTF file containing exons.
- Overhang. This parameter must correspond to the length of the sequenced transcript minus 1. That is, the length of the downstream read minus 1.
-
Single-cell RNA-Seq Reads. Select the FASTQ files containing the scRNA-seq reads. If using reads coming from 10x Chromium, avoid specifying the index FASTQ, that is, the files containing I1 and I2. In case of having multiple FASTQ for the same sample (e.g. multiple sequencing lanes), specify all the fastq files in the same run. However, in the case of analyzing multiple samples or batches, one STARsolo run must be performed for each of them.
-
Upstream Files Pattern. Pattern to recognize upstream FASTQ files.
- Downstream Files Pattern. Pattern to recognize downstream FASTQ files.
- Strandness. Whether the sequencing is strand-specific (Forward or Reverse) or not (Unstranded).
Figure 3. Cell-barcoded technologies input wizard page.
Configuration1: Reads Configuration.
- Library Technology. When something different than "Custom" is selected, the parameters regarding read structure are automatically updated to match the technology and version specified.
-
Barcode Mate. In which read is the Cell Barcode and UMI located.
-
Separate Read. The upstream read is composed entirely of the Cell Barcode and UMI. The downstream read contains only the transcript sequence.
- Part of Mate 1. The upstream read contains CB, UMI, and part of the transcript sequence. The downstream read is composed only of transcript sequence.
- Part of Mate 2. The downstream read contains CB, UMI, and part of the transcript sequence. The upstream read is composed only of transcript sequence.
- Cell Barcode Start. Starting base of the Cell Barcode in the barcoded read.
- Cell Barcode Length. Length in basepairs of the Cell Barcode.
- UMI Start. Starting base of the UMI in the barcoded read.
- UMI Length. Length in basepairs of the UMI.
-
Clip from 5' end. In the case that the read(s) containing transcript sequence contain CB, UMI, or adapter sequence as well, they have to be clipped before aligning. Check this option in order to remove those sequences from the read. The bases will be removed from the read specified in the "Barcode Mate" parameter. One example is the 10x Chromium 5' protocol, in which the Mate 1 contains CB (16bp) + UMI (10bp) + adapter (13bp) which in total sum up to 39 bp. If Mate 1’s sequencing extends beyond 39 base pairs, the additional bases correspond to cDNA and can be aligned with Mate 2, which exclusively contains cDNA.
-
5' Number of Bases. Specify the number of bases to trim from the 5' end for each of the bases. The first number is for Mate 1 and the second is for Mate 2. They must be separated by a space.
-
Clip from 3' end. Clip bases from the 3' end of the read(s). See the "Clip from 5' end" parameter above for a detailed explanation.
-
3' Number of Bases. Specify the number of bases to trim from the 3' end for each of the bases. The first number is for Mate 1 and the second is for Mate 2. They must be separated by a space.
-
Add Cell Barcode Whitelist. If checked, a file with the list of Cell Barcodes used by the library preparation technology must be provided. If not checked, Cell Barcodes will be automatically inferred from the data.
-
Cell Barcodes Whitelist. Specify here the file containing the Cell Barcodes sequences.
-
Cell Barcode Match Type. It is likely to find sequencing errors in the region of the read containing the Cell Barcode, which will produce differences between the Cell Barcodes detected in the data and the ones present in the whitelist. With this parameter, it is possible to choose how to match the sequences identified in the reads with the Cell Barcodes whitelist:
- Exact Matches. Only Cell Barcodes with exact matches to the whitelist are kept.
- 1MM. Cell Barcodes with only one match to the whitelist with a maximum of one mismatch are kept.
- 1MM Multi. Cell Barcodes with multiple matches to the whitelist with one mismatch are allowed. However, allowed CBs have to have at least one read with an exact match. Then, a posterior probability calculation is used to choose one of the matches. This option matches best with CellRanger 2.2.0
- 1MM Multi + Pseudocounts. It follows the same procedure as the 1MM Multi option, but pseudocounts of 1 are added to all whitelist barcodes.
- 1MM Multi + Pseudocounts + N Bases. It follows the same procedure as the 1MM Multi option, but pseudocounts of 1 are added to all whitelist barcodes, and Cell Barcodes with N’s are allowed. This option matches best with CellRanger >= 3.0.0.
Figure 4. Cell Barcode Reads Configuration wizard page.
Configuration 2: Alignment Parameters
- 2-pass Mapping. This option allows a most sensitive novel junction discovery. The aligner algorithm is executed first to collect the junctions. These junctions are used for second-pass mapping.
- Min. Intron Length. Specify the minimum intron size. A genomic gap is considered an intron if its length is equal to or greater than the given value. Otherwise, it is considered a deletion.
- Max. Intron Length. Specify the maximum intron size.
- Max. # of Mismatches. Set the maximum number of mismatches allowed per read pair.
- Max. # of Multiple Alignments. Establish the maximum number of multiple alignments allowed per read. If exceeded, the read is considered unmapped.
- Include Chimeric Alignments. This option allows including the chimeric alignments together with normal alignments in the main BAM file. The format of chimeric alignments follows the latest SAM/BAM specifications.
- Max. Distance Between Mates. Specify the maximum genomic distance between two mate pairs.
Figure 5. Alignment Parameters configuration wizard page.
Configuration 3: Counting Parameters
-
UMI Collapsing. Algorithm for collapsing UMIs, that is, aggregate the counts of the equivalent UMIs. There are different approaches to trying to identify UMIs produced by sequencing errors:
-
One Mismatch. UMIs that differ by only one mismatch are collapsed, meaning they are considered as the same one and thus counted together.
- UMI tools. Follows the "directional" method developed by Smith, Heger, and Sudbery (Genome Research 2017), first used in the UMI-tools package.
- Directional. Same as "UMI tools", but with more stringent criteria for duplicate UMIs.
- Exact. Only UMIs that are an exact match are counted together.
- No Deduplication. Do not collapse UMIs. This will produce read-level counts.
- CellRanger Algorithm. Algorithm performed by CellRanger v2.4 for UMI collapsing.
-
UMI Filtering. How to filter poor-quality UMIs:
-
Basic. Remove UMIs with N and homopolymers. This behavior is similar to the filtering performed in CellRanger 2.2.0.
- Multi Mapping UMIs. In addition to the basic filtering, this option removes lower-count UMIs that map to more than one gene.
- CellRanger > 3.0.0. The same filters as the above "Multi Mapping UMIs" option, but matching the behavior of CellRanger > 3.0.0 . It only works along with the "UMI Collapsing" parameter set to "CellRanger Algorithm".
-
Multimapping Reads. How to handle the multi-mapping reads:
-
Discard. Do not count UMIs/reads that map to more than one gene.
- Uniform. Distribute uniformly the UMI/read counts among all the mapping genes.
- Proportional. Distribute the multi-gene UMIs counts in proportion to the number of unique UMIs for each gene. UMIs mapped to genes without unique UMI support are distributed evenly.
- MLE-EM Model. Uses Maximum Likelihood Estimation (MLE) and Expectation-Maximization (EM) algorithms to distribute multi-mapping UMIs among the genes. This algorithm has previously been used in TEtranscripts, Alevin, and Kallisto-bustools tools.
- Rescue. Use the approach described by Mortazavi et. al.. It distributes the counts uniformly among the multi-mapping UMIs in a way proportionally to the sum of uniquely mapped UMIs.
-
Feature Counting.
-
Exon. For each gene, count only the reads aligning to the exonic regions.
- Exon + Intron. For each gene, count both the reads aligning to exonic and intronic regions.
-
Cell Filtering. It is very common that the resulting count table has spurious cells, that is, Cell Barcodes that have been detected as such but they come from sequencing errors. There are different approaches to filtering detected cells:
-
None. Do not filter resulting cells.
- Top Cells. Only report top cells by UMI count, followed by the exact number of cells
- CellRanger. Use the approach followed by CellRanger 2.2. It needs the expected number of cells.
- Empty Drops. Use the approach first developed by EmptyDrops filtering in CellRanger flavor. It needs the expected number of cells.
Figure 6. Cell Barcode counting configuration wizard page.
Output
- Project Name. Give a name to the project. The resulting count table will be named after it.
-
Save Raw Matrix. The default output is the Coun Table obtained after the Cell Filterig. With this option checked, the prior Count Table containing all the detected cells is saved as well.
-
Matrix Folder. Select the folder to save the unfiltered Count Table. It is saved in MTX format, thus, three files will be saved: one text file containing the counts (.mtx), one text file containing the cell barcodes (barcodes.tsv), and one text file containing the genes (features.tsv).
-
Save BAM File. Check this option to save the BAM file generated during the read alignment step.
-
BAM File. Specify the path to save the BAM file.
- Sort by Coordinate. Sort the output BAM file by coordinates. If not checked, the BAM file will be unsorted.
-
Add Read Group Information. Include the 'Read Group' header (@RG) in output BAM files. This information may be required for downstream analysis or third-party tools. If this option is checked, the following read group tags will be included:
- Identifier (ID), automatically generated.
- Name of the sample (SM), inferred from file names.
- Sequencing Platform (PL), provided in the "Sequencing Platform" parameter.
- Sequencing Platform. Choose the sequencing platform that was used to obtain the input data.
Figure 7. Cell Barcode output configuration wizard page.
Full-length Technologies
Input
- Reference Genome. Reference genome in FASTA format. Reads will be mapped against this reference.
-
Annotation GFF/GTF. Reference annotation in GTF or GFF format. Counts will be counted by the genes present in this annotation. The chromosome names present in the annotation must match the ones present in the genome sequence headers.
-
Exon Feature. Select the feature from the GFF/GTF file containing exons.
- Overhang. This parameter must correspond to the length of the sequenced transcript minus 1.
-
Single-cell RNA-Seq Reads. Select the FASTQ files containing the scRNA-seq reads. Each FASTQ file is supposed to belong to one individual cell. In the case of analyzing multiple samples or batches, one STARsolo run must be performed for each of them.
-
Upstream Files Pattern. Pattern to recognize upstream FASTQ files.
- Downstream Files Pattern. Pattern to recognize downstream FASTQ files.
- Strandness. Whether the sequencing is strand-specific (Forward or Reverse) or not (Unstranded).
-
Provide Cell IDs. If checked, it will use the cell IDs specified in the "FASTQ to Cell IDs" file to name the resulting cells. If not, the FASTQ file name is used for naming the cells.
-
FASTQ to Cell IDs. Specify a file containing the FASTQ file names and the cell ID wanted for each of them, separated by a tab. One FASTQ file - cell ID pair per line.
Figure 8. Full-length input wizard page.
Configuration: Alignment Parameters.
- 2-pass Mapping. This option allows a most sensitive novel junction discovery. The aligner algorithm is executed first to collect the junctions. These junctions are used for second-pass mapping.
- Min. Intron Length. Specify the minimum intron size. A genomic gap is considered an intron if its length is equal to or greater than the given value. Otherwise, it is considered a deletion.
- Max. Intron Length. Specify the maximum intron size.
- Max. # of Mismatches. Set the maximum number of mismatches allowed per read pair.
- Max. # of Multiple Alignments. Establish the maximum number of multiple alignments allowed per read. If exceeded, the read is considered unmapped.
- Include Chimeric Alignments. This option allows including the chimeric alignments together with normal alignments in the main BAM file. The format of chimeric alignments follows the latest SAM/BAM specifications.
- Max. Distance Between Mates. Specify the maximum genomic distance between two mate pairs.
Figure 9. Alignment configuration wizard page.
Output
- Project Name. Give a name to the project. The resulting count table will be named after it.
-
Save BAM File. Check this option to save the BAM file generated during the read alignment step.
-
BAM File. Specify the path to save the BAM file.
- Sort by Coordinate. Sort the output BAM file by coordinates. If not checked, the BAM file will be unsorted.
-
Add Read Group Information. Include the 'Read Group' header (@RG) in output BAM files. This information may be required for downstream analysis or third-party tools. If this option is checked, the following read group tags will be included:
- Identifier (ID), automatically generated.
- Name of the sample (SM), inferred from file names.
- Sequencing Platform (PL), provided in the "Sequencing Platform" parameter.
- Sequencing Platform. Choose the sequencing platform that was used to obtain the input data.
Figure 10. Full-length output configuration wizard page.
Results
The following outputs are produced.
- Single-cell count matrix. Two independent tables for cells on the left and features on the right side (Figure 11). Both tables can be ordered and filtered separately, while the number of visible rows is shown in the top-right corner.
- Report with a detailed summary of the quantification process and some common metrics.
-
UMIs per cell plot that helps to judge the quality of the quantification process (Figure 12).
-
This plot is inspired by the Barcode Rank Plot from 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/barcode-rank-plot )
Figure 11. scRNA-Seq Count Matrix object main viewer.
Figure 12. UMIs Per Cell plot output of scRNA-Seq Quantification tool.