Long Read Transcript Identification and Quantification
OmicsBox offers a variety of tools for the identification and quantification of transcripts from long-read RNA-sequencing data, suitable for different use-cases. We strongly recommend that any such workflow should be followed by careful curation of discovered transcripts through SQANTI3. After curation, FLAIR or IsoQuant can re-quantify the reads on the curated transcriptome.
-
PacBio-based Identification with IsoSeq: PacBio's IsoSeq pipeline preprocesses PacBio single-molecule sequencing data and defines transcript models. This composable workflow combines existing tools and algorithms with a novel clustering technique to handle the increasing data output from PacBio sequencing platforms. This tool is recommended to be used primarily for pre-processing of PacBio data, supporting the following steps:
- CCS Calling from subreads
- Kinnex-Demultiplexing with skera
- Primer removal and demultiplexing with lima
- Refine to trim poly(A) tails and remove artificial concatemers (chimeric reads)
Further, it can also perform transcript identification. Compared to other tools such as FLAIR and IsoQuant, IsoSeq is notably more permissive in defining a wide variety of transcript models, leading to high levels of redundancy and potential artifacts. While we recommend that any transcriptome reconstruction workflow should be followed by careful curation with SQANTI3 regardless of the tool used, this becomes especially important when identifying isoforms with IsoSeq.
-
Identification and Quantification with FLAIR: FLAIR enables transcriptome reconstruction and quantification from long-read RNA sequencing data. During reconstruction, it uses reference annotations and short-read data to correct splice junctions observed in long reads, then identifies both known and novel transcript isoforms. For quantification, FLAIR can map long reads to either a newly reconstructed transcriptome or a provided reference transcriptome.
FLAIR is a top-performing transcriptome reconstruction tool in benchmarks such as the LRGASP challenges. When both reference transcriptome annotations and short-read RNA-seq data are available, it excels in the discovery of novel isoforms. However, without short-read data, it is unable to discover novel splice junctions, limiting the discovery of potential novelty.
-
Identification and Quantification with IsoQuant: IsoQuant performs genome-based analysis of long RNA reads, enabling reconstruction and quantification of transcript models with high precision and good recall. When a reference annotation is provided, IsoQuant assigns reads to annotated isoforms based on intron-exon structure and performs quantification at both gene and isoform levels.
IsoQuant has also displayed strong performance in benchmarks such as the LRGASP challenges. While it is generally more conservative in reporting novel isoforms than FLAIR, it can discover novel splice junctions even without access to reference transcriptome annotations or short-read data. However, when available, their use is still recommended.
-
Reference-free Isoform Reconstruction with the isON-pipeline: The isON-pipeline reconstructs transcriptomes from long-read sequencing data (PacBio or ONT) without requiring reference annotations or genomes. This three-component pipeline (isONclust3, isONcorrect, and isONform) is particularly well-suited for non-model organisms.
As the isON-pipeline should only be used where a reference genome is not available, it cannot be followed by curation with SQANTI3.