Identification and Quantification with IsoQuant
Introduction
With the rapid advancements made in the field of long-read sequencing, new computational tools are constantly published to aid in processing and analyzing these data. IsoQuant is one such tool, which allows long reads to be aligned to a reference genome (using Minimap2) and subsequently reconstructs and quantifies transcript models. While this can also be achieved in OmicsBox using FLAIR, IsoQuant offers not just a distinct, alternative algorithm, but also additional features. Whether FLAIR or IsoQuant should be used depends on the nature of the data set and the goal of the analysis.
IsoQuant is a computational tool for the genome-based analysis of long RNA read data originating from technologies such as PacBio or Oxford Nanopore. The tool can be run with or without a reference annotation and consists of two stages:
- Reference-based analysis: If a reference annotation is provided, the provided long reads undergo reference-guided splice site correction, are assigned to the reference transcripts, and then quantified.
- Transcript discovery: IsoQuant reconstructs transcript models based on the provided reads and performs abundance quantification for discovered isoforms.
Please cite IsoQuant as:
Prjibelski, A.D., Mikheenko, A., Joglekar, A. et al. Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol 41, 915–918 (2023).
As IsoQuant utilizes a gffutils database, you may also want to consider citing the gffutils GitHub repository as:
Dale, R. (2023). gffutils v0.12. Retrieved 2024, from https://github.com/daler/gffutils.
Run IsoQuant for Long-Read Isoform Definition
IsoQuant can be found under Transcriptomics → Long-Reads Analysis → Transcript Identification → Identification and Quantification with IsoQuant. The wizard consists of 4 pages and facilitates the definition of the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4, and Figure 5)
Input
IsoQuant requires at least the following files:
-
Long-Reads Files: Reads in FASTA or FASTQ format, or already aligned reads in BAM format. Note that, when supplying reads in FASTA or FASTQ format, they will be aligned to the given reference genome using minimap2.
-
When supplying files in FASTA or FASTQ format, additionally the Read Strandness option can be set, which influences the aligning step. This may be important e.g. when working with Nanopore dRNA stranded reads. Otherwise, it should be left at the default value of "None".
-
Reference Genome: FASTA file with the reference genome.
Info
- If you are supplying BAM files as inputs, ensure that you are supplying the reference genome in the same version as you used to align your reads.
- Note that, for quantification, every file you supply will be treated as an individual sample. If you have multiple files belonging to the same sample, consider concatenating them into a single file first.
Annotation Inputs
Additionally, a reference annotation file may also be provided:
-
Transcriptome Annotation: GTF file with annotations of the reference transcriptome.
-
When using a transcriptome annotation which already includes "gene" and "transcript" level features, you may also check the checkbox Detailed Gene Database. This saves some time during the conversion process, as transcript and gene entries will not have to be inferred. If you are unsure, leave this option off.
Info
- If you are providing a reference annotation, ensure that its version matches that of the provided reference genome.
- While the use of a reference annotation is not mandatory, it is likely to increase precision and recall. If you have access to a fitting reference annotation, it is recommended to use it.
Optionally, short reads may be provided:
- Short-Read BAM Files: One or multiple .bam format short-read alignment files which are used to correct the splice junction alignments of the provided long reads. Note that the short reads are NOT used for transcript discovery or quantification.
Info
If you are providing short reads, ensure that they are aligned to the same version of the reference genome as you are supplying.
Algorithm Options
This page provides some more detailed options to configure the algorithm:
-
Transcript and Gene Quantification: What quantification strategy should be used to assess abundance on both transcript- and gene-level.
- Unique Only: Only count reads that are uniquely assigned and consistent with a transcript (default for transcript-level).
- With Ambiguous: Ambiguously assigned reads are split with equal weights.
- Unique Splicing Inconsistent: Uniquely assigned reads which do not contradict annotated splice sites are included (default for gene-level).
- Unique Inconsistent: Uniquely assigned reads are included, allowing any kind of inconsistency.
- All: Both ambiguous and inconsistent reads are included.
-
Data Type: Most importantly, the Data Type has to be specified:
- PacBio CCS or FLNC,
- ONT dRNA or cDNA,
- assembled / corrected transcript sequences.
-
Full-Length Transcripts: Whether both ends of the sequences can be considered reliable (e.g. in PacBio FLNC reads).
-
Matching Strategy: How exact or loose the read-to-isoform-matching algorithm should be.
- Exact: All minor errors are treated as inconsistencies.
- Precise: Only minor alignment errors are allowed. (default for PacBio)
- Flexible: Alignment errors typical for Nanopore are allowed, short novel introns are treated as deletions. (default for ONT)
- Loose: Even more serious inconsistencies are ignored, ambiguity is resolved based on nucleotide similarity.
-
Splice Correction: Which splice correction strategy should be employed.
- None: No correction is applied.
- Default PacBio: Optimal settings for PacBio CCS reads. (default for PacBio)
- Default ONT: Optimal settings for ONT reads. (default for ONT)
- Conservative ONT: Conservative settings for ONT reads, only incorrect splice junctions and skipped exons are fixed.
- Assembly: Optimal settings for a transcriptome assembly. (default for Transcript Assembly)
- All: Correct all discovered minor inconsistencies, may result in overcorrection.
-
Model Construction: Which model construction strategy should be employed.
- Reliable: Only the most abundant and reliable transcripts are reported; precise, but not sensitive.
- Default PacBio: Optimal settings for PacBio CCS reads. (default for PacBio)
- Sensitive PacBio: Sensitive settings for PacBio CCS reads, more transcripts are reported possibly at a cost of precision.
- Full-Length PacBio: Optimal settings for full-length PacBio CCS reads.
- Default ONT: Optimal settings for ONT reads. (default for ONT)
- Sensitive ONT: Sensitive settings for ONT reads, more transcripts are reported possibly at a cost of precision.
- Assembly: Optimal settings for a transcriptome assembly: input sequences are considered to be reliable and each transcript to be represented only once, so abundance is not considered. (default for Transcript Assembly)
- All: Reports almost all novel transcripts, loses precision in favor to recall.
-
Report Mono-Exonic Transcripts: Whether to report novel mono-exonic transcripts. (default to off for ONT, on for PacBio and Transcript Assembly)
Info
When selecting the Data Type, the subsequent options will automatically be set to appropriate default settings. However, you may still adjust them for your specific purposes.
Output
- Output File Prefix: Set a name which will serve as a prefix for all output files.
- Save Transcript Model Annotations: Whether you want to save the transcript models identified by IsoQuant as a .gtf file. If so, select a location for the .gtf file below.
- Count Table Outputs: As IsoQuant also performs quantification, you can choose several options on which quantification outputs to receive:
- Reference Gene Counts: Quantification at gene-level, grouped by input files. This option will be unavailable if you have not provided a reference annotation.
- Reference Transcript Counts: Quantification at transcript-level, using only the transcript models present in the supplied reference annotation, grouped by input files. This option will be unavailable if you have not provided a reference annotation.
- Transcript Counts: Quantification at transcript-level, using the transcript models identified by IsoQuant, grouped by input files.
Tip
Note that if only Reference Gene and/or Transcript Counts are selected as desired outputs, IsoQuant runs with the “--no_model_construction” flag. This means that only quantification based on the provided reference is performed, and the model construction step is skipped. This saves considerable computational resources, which results in much faster runtime.
Results
IsoQuant has the following outputs:
- Transcript Models as Annotation (GTF file): An annotation of transcript models for which IsoQuant finds sufficient evidence in the given data. This can be used to run a SQANTI3 quality control and filtering analysis.
- Extended Annoation (GTF file): When supplying a reference annotation to IsoQuant, an extended annotation can be obtained which contains all reference transcripts (even those not observed in the given data), extended by any novel transcripts discovered by IsoQuant.
- Report with information on the assignment and alignment of reads, as well as the categories of identified transcript models.
- Count Tables as specified in the output page. These can be used to run differential expression analyses.
- Length Distribution Chart which shows the distribution of isoform lengths.
Report
The summary report of an IsoQuant run in OmicsBox first provides a description of all provided input and reference files, as well as the chosen algorithm configuration. The report further gives an overview over some statistics concerning the assignment and alignment of reads, as well as the structural classification of Transcript Models as “Known”, “Novel In Catalog”, or “Novel Not In Catalog”. An example of this section can be seen in Figure 6.
Transcript Models
The transcript models identified by IsoQuant are provided as a .gtf file. When running IsoQuant with a reference annotation, an extended annotation can also be obtained.
Count Tables
Depending on the selection in the outputs, the different count tables will be provided as OmicsBox objects. These can be used in downstream differential expression analyses through the use of the Sidebar action “Differential Expression Analysis.”
Length Distribution Chart
This histogram shows the distribution of Isoform lengths. This information can be useful in order to judge the acceptable range of isoform lengths, as well as to set the length threshold when running SQANTI3.