Curation of Long-Read Transcriptomes with SQANTI3

Introduction

SQANTI3 is a bioinformatics tool designed for the quality control and filtering of full-length transcripts sequenced with PacBio’s long-read technology. It is designed as the next step of the IsoSeq pipeline. The interest in this tool comes from the usefulness of long-read transcriptome sequencing to describe eukaryotic transcriptomes and replace the use of second-generation sequencing. Illumina short-reads cannot contain a whole transcript and are not able to well-characterize eukaryotic transcriptomes.

This tool can reveal the nature and novelty found by long-read sequencing by classifying transcripts based on the comparison between their splice junctions and the reference transcriptome provided. It combines the long read-defined transcripts (in a FASTA/Q or GTF format) with the reference annotation and with other optional data to provide a wide range of descriptors of transcript quality. SQANTI3 generates a comprehensive report to facilitate quality control and filtering of the isoform models and performs two different tasks, both of them equally important:

Isoform classification and quality control for long read-defined transcriptomes: the categories in which transcripts are classified, together with a long list of attributes and descriptors, allow to carefully inspect the properties of their transcriptome and identify potential problems generated data processing.
Filter for long read-defined transcriptomes: with the supplied information, users can set different parameters to remove potential false positive isoforms.

In SQANTI3 version 5, another step was added. In this step, SQANTI3 rescues isoforms that have been removed from the curated transcriptome but that have evidence in the reference transcriptome. The idea is to avoid losing transcripts and/or genes that are detected as expressed by long read sequencing but could not be confidently validated using orthogonal data.

Please cite SQANTI3 as:

Pardo-Palacios, F. J., Arzalluz-Luque, A., Kondratova, L., Salguero, P., Mestre-Tomás, J., Amorín, R., ... & Conesa, A. (2024). SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nature Methods, 1-5.

Run SQANTI3 for transcriptome characterization

SQANTI3 can be found in the Transcriptomics Module of OmicsBox under Transcriptomics → Long-Reads Analysis → Transcript Identification → Curation of Long-Read Transcriptomes with SQANTI3.The wizard consists of 5 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6).

Input 1

First of all, SQANTI3 can be fed with long-read information in two different ways:
Transcriptome Annotation File: annotation file created in your transcriptome reconstruction pipeline. This option is recommended in order to analyze non-redundant isoforms precisely. If you do not have a long-read annotation file, please use the other option.
Transcriptome FASTA/Q File: FASTA/Q File with Transcripts. If long-read data is introduced this way, transcripts will be mapped using minimap2. The BAM file will be transformed in a GTF file respecting splice sites.
Raw Long Reads: FASTA/Q File with Raw Long Reads both from PacBio or Nanopore Technology. If long-read data is introduced this way, long reads in this file (FASTA or FASTQ format) will be mapped using minimap2, and mapped reads in the SAM file will be collapsed using cDNA Cupcake to generate non-redundant isoforms.

It is recommended that users generate their own transcriptome files (in GTF or FASTX format) so as to know exactly which isoforms are being analysed.

Reference Genome: this tool also needs the reference genome in FASTA format. Please check that the chromosome/scaffolds names are the same in the reference annotation and the reference genome.
Annotation of the Reference Genome: it must be introduced in GTF format. This file will be taken as reference to describe the degree of novelty of each transcript. Make sure your annotation file is based on the correct reference genome version.

Input 2

This tool can also accept short reads (paired-end or single-end) to validate isoforms taking into account the coverage in splice junctions. The expression of each isoform can also be estimated with paired-end short-reads (not with single-end because it is very time-consuming).

If paired-end date is introduced, the upstream and downstream files patterns must be typed in order to differentiate each kind of file.

It is strongly recommended to use short-read data, even if it comes from a different experiment, to validate isoforms using splice junctions coverage.

Input 3

In this page, extra files can be introduced in order to have additional details in the output.

Transcription Start Site (TSS) Annotation File: bed file with information of the TSS in a genome. This kind of information can only be used for human and mouse yet. To have more information please visit the referenceTSS webpage.
File with PolyA Motifs: text file with one polyA motif for that species in each line. In the PolyASite webpage different files for H. sapiens, C. elegans and M. musculus can be found with PolyA information, and the last column can be parsed to get the polyA motifs.
File with PolyA Peaks: complementary to polyA motif information, polyA site data can be supplied. For human, mouse, and worm, you can download public polyA site data from the PolyASite atlas.
File With FL Counts: text or TSV file with Full Length Counts. This information can be obtained, for example, using a cDNA cupake script (get_abundance_post_collapse.py).

Configuration

In this page the parameters for SQANTI3 Quality Control and SQANTI3 Filtering Stepcan be set:

Quality control:
Ignore Transcript ID Nomenclature: allow the usage of transcript IDs non related with PacBio's nomenclature (PB.X.Y).
Min. length of Reference Transcript: minimum reference transcript length.
Skip ORF Prediction: check it to skip ORF prediction so as to save time. If ORF prediction is checked, the translated transcriptome could not be returned.
Set of Splice Sites: set of splice sites to be considered as canonical. If the set is going to be changed, type the new splice sites separated with comma and no spaces.
Filtering Rules Parameters:
Filter by Rules: check this option to filter isoforms that do not comply some rules.
Rules Filter: select the rules to filter out isoforms.
Filtering ML Parameters:
Filter by Machine Learning: select this option to filter isoforms with a Random Forestclassifier that labels possible transcripts as true isoforms or artifacts using SQANTI QC descriptors as predictive variables.
Classification Threshold: Machine Learning probability threshold to classify transcripts as positive isoforms. Probability has to be higher than this value to be classified as isoform. Select a higher value to be more stringent on which possible transcripts are considered actual isoforms.
Adenine Percentage: adenine percentage at genomic 3' end to flag an isoform as intra-priming. Intra-priming is the process of generation transcript artefacts due to cDNA poly-dT priming off genomic A stretches that are not true polyA tails.
Retain FSM: when checked, forces retaining FSM transcripts regardless of ML filter result (FSM are threfore automatically classified as isoforms).
Filter Mono Exonic Transcripts: remove transcripts made of just one exon. Normally, this transcripts are removed as they might be truncated isoforms.

Quality control and filtering steps are mandatory. In the case of the filtering step, you have to select one of the two methods: rules filter or ML filter.

Picture 5. Configuration Page

Configuration

In this page the parameters for the Rescue step can be set:

Rescue Transcripts: the SQANTI3 Rescue Step is designed to be run after the filtering step and uses the long read-based evidence provided by discarded isoforms to recover transcripts in the associated reference transcriptome. The idea behind this strategy is to avoid losing transcripts/genes that are detected as expressed by long read sequencing, but whose start/end/junctions could not be confidently validated using orthogonal data, resulting in the removal of those genes/transcripts from the transriptome. As a result, SQANTI3 rescue will generate an expanded transcriptome GTF including a set of reference transcripts as well as the long read-defined isoforms that passed the filter.
Rescue Mono Exonic Transcripts: you can choose whether to enable or disable the rescue of transcripts with only one exon, regardless of their category, or limit the rescue to those in the FSM category.
Rescue Mode:
Automatic Mode: all reference transcripts that were represented by at least one FSM or ISM in the original post-QC transcriptome are retrieved. Then, those reference transcripts for which all FSM representatives were removed by the filter are rescued.
Full Mode: also NIC and NNC categories will be used to rescue transcripts.Since there is no associated transcript information, all transcripts classified as artifacts will be included in the rescue candidate list. As a result, we consider all reference or long read-defined transcripts from genes that have at least one rescue candidate to be rescue targets.

Picture 6. Configuration 2 Page

Output

Set Prefix for Output Files: prefix to add to the output filenames.
Information Files: directory to save files with all information of isoforms and junctions, and the transcriptome in GTF format.
Isoform FASTA file: directory to save isoform sequences in FASTA format.
Translated transcriptome FASTA file: directory to save translated isoform sequences in FASTA format.This is only possible if ORF prediction has not been skipped.

Results

SQANTI3 has the following outputs:

Table with the main information of the isoforms (Figure 8).
Report with information of the number of genes, isoforms per category and types of junctions before and after filtering isoforms (if not filtering is applied, only information of the first step will be shown) (Figure 9).
Charts:
General information charts.
Distance to TSS and TTS.
Filtering.
Short and Long Read Coverage.
Redundancy Analysis.
Quality Features.
Rescue Step
Files:
Transcriptome Annotation File.
Transcriptome File.
Translated Transcriptome File (optional).
Junctions File.
Classification File.

Table with information of the isoforms

The next columns can be seen (Figure 8):

Category: structural category of the transcript.
Subcategory: deeper classification than the general category for each transcript.
Isoform: ID of the isoform that comes from the collapsing algorithm.
Chr: name of the chromosome/scaffold where the encoding gene is located.
Length: isoform length.
Exons: number of exons of the isoform.
Gene ID: gene ID that appears in the annotation file, if annotated.
Transcript ID: transcript ID that appears in the annotation file, if annotated.
Gene exons: number of exons of the gene if that gene is annotated. If it is novel, this information is not shown.
Difference to TSS: distance of query isoform 5' start to reference transcript start end. Negative value means query starts downstream of reference.
Difference to TTS: distance of query isoform 3' end to reference annotated end site. Negative value means query ends upstream of reference.
RT switching: TRUE if one of the junctions could be a RT switching artifact.
All Canonical Junctions: whether all junctions are canonical or not.
Min. coverage: minimum junction coverage based on STAR algorithm. If no short reads are given, this column is not shown.
FL counts: FL count associated with this isoform if a FL counts file is provided. Otherwise this column is not shown.
Isoforms expression: short read expression for this isoform if paired-end reads are provided, otherwise this column is not shown.
Gene expression: short read expression for the gene associated with this isoform (summing over all isoforms) if paired-end reads are provided, otherwise this column is not shown.
Coding Type: whether the isoform encodes a protein or not.
Predicted NMD: TRUE if there's a predicted ORF and CDS ends at least 50 bp before the last junction; FALSE if otherwise. NA if non-coding.
CAGE: TRUE if the PacBio transcript start site is within a CAGE Peak. If a TSS annotation file is not given, this column is not shown.
PolyA: shows the location of the last base of the hexamer. Position 0 is the putative polyA site. This distance is hence always negative because it is upstream.
PolyA motif: if a polyA motif list is given, shows the top ranking polyA motif found within 50 bp upstream of end. Otherwise, this column is not shown.
TSS ratio: the short-read mean coverage of the 100bp upstream and downstream a reported TSS are measured. Then the ratio coverage inside isoform + 0.01/ coverage outside isoform + 0.01 is calculated.

Report

In the report, there are two main parts: in the first one, some Job Information can be seen (Figure 9). There are four tables:

Table at the gene level: information of the number of annotated genes and novel genes before and after filtering.
Table at the isoform level: information of the number of isoforms in the different structural categories before and after filtering and description of each structural category.
Table at the splice junction level: information of the number of each type of splice junction before and after filtering and description of each type of splice junction.
Filtering Information: if rules filter is applied, a table with the rules applied will be shown. If the ML filter is used, different ML metrics are displayed.
Rescue Information: if the rescue step has been applied, two additional tables will be shown in the report. The first one will show the category and the number of isoforms rescued. The second one will show the reasons why other transcripts were excluded from being rescued:
Reference Already Present: if the mapping hit is a reference transcript that is already represented by an isoform, be it an FSM that passed the filter or a transcript that was obtained during automatic rescue.
ML Probability: if the mapping hit did not pass the supplied ML probability threshold or the rules in the JSON file.
Long Read Transcript: if the mapping hit is a long read-defined isoform, and is therefore already present in the transcriptome.

Intergenic and Genic Intron transcripts are the structural categories that belong to novel genes. These categories, along with Genic Genomic and Antisense, are not common, so they should not be frequent in the report.

In the second part, the main parameters used in SQANTI3 job are shown (Figure 10).

**Figure 8.** Classification table of transcripts in OmicsBox

**Figure 9.** Report with information of genes, isoforms and splice junctions

**Figure 10.** Parameters information in the report

Charts

In the side panel of the table (Figure 8) there are different action buttons with different chart categories:

General Information Charts: these charts are related to general characteristics.
Isoforms per Gene: frequency of the number of isoforms.
Isoforms per Structural Category: number of isoforms in the different structural categories.
Transcript Length per Structural Category: boxplots of the length of the different structural categories.
Isoforms per Type of Gene: stacked barcharts with the number of isoforms per gene in annotated genes and novel genes.
Exons per Structural Category: boxplots of the number of exons that are present in the different structural categories.
Distance to TSS and TTS: these charts are related to the distance of Full Splice Match (FSM) and Incomplete Splice Match (ISM) transcripts to annotated Transcription Start Sites (TSS) and Transcription Termination Sites (TTS).
Distances of FSM to TTS: histogram of the distribution of the distance of FSM isoforms to a TTS. If a polyA motif file is introduced, only isoforms with a polyA motif found will be shown in this chart.
Distances of ISM to TTS: idem with ISM isoforms.
Distances of FSM to TSS: histogram of the distribution of the distance of FSM isoforms to a TSS. If a CAGE annotation file is introduced, only isoforms within a CAGE will be shown in this chart.
Distances of ISM to TSS: idem with ISM isoforms.
Filtering: these charts are related to the removal of transcripts after filtering.
Reasons of Transcripts Removal: percentage of isoforms removed because of intrapriming, RT Switching, or because of Low Coverage/non-canonical junctions. The current filtering rules are as follow:
- If a transcript is FSM, then it is kept unless the 3' end is unreliable (intrapriming).
- If a transcript is not FSM, then it is kept only if all of below are true:
- 3' end is reliable.
- Does not have a junction that is labeled as RTSwitching.
- All junctions are either canonical or has short read coverage above the threshold that was set by the user.
- Isoforms Before and After Filtering: stacked barcharts with the number of isoforms for each structural category before and after filtering.
Short and Long Read Coverage: these charts are related to coverage analysis of transcripts by short reads and long reads.
Short Read Coverage:stacked barchart with the number of transcripts whose splice junctions are covered by short reads by structural category.
Long Read Counts on Genes: boxplots with the number of FL counts per type of gene (annotated or novel). The plotted value is the logarithm to base 2 of the Counts per Million plus 1.
Long Read Counts on Categories: idem with structural categories.
Saturation Plot: saturation plots are used to assess the performance of RNA-seq experiments and determine the optimal sequencing depth for a given experiment. Saturation plots are generated by plotting the number of unique transcripts detected versus the total number of reads (sequencing depth) for a given RNA-seq experiment. The vertical line shows the saturation point, that is to say, the point where the curve starts flattening (e.g when more sequencing depth is not going to make you find new transcripts).
Increment Plot: this plot is similar to the other one, but shows the increase of discovered transcripts as sequencing depth is increased. Because of that, after the saturation point, the bars showing the increase of discovered transcripts should be shorter.
Redundancy Analysis: these charts are related to redundancy analysis of transcripts. Redundancy reflects the fact that transcripts, after being validated, can often incorporate 3’ and 5’ end variability, which can lead to the detection of multiple FSM and/or ISM isoforms per reference transcript differing in their start and/or end positions. If a CAGE annotation file and/or a polyA motif file is/are introduced, that data is taken into account.
Redundancy Analysis of FSM Transcripts
Redundancy Analysis of ISM Transcripts
Redundancy Analysis of Both Types
Quality Features: these charts summarize features that represent good/bad quality of transcripts per structural categories.
Good quality features: percentage of transcripts per structural category with different features that show good quality:
- All_CJ: every splice junction is a canonical junction.
- Annotated: Distance of query isoform 3' start to the closest start end of any transcripts of the matching gene is smaller than the distance introduced in Configuration 2 wizard (Figure 5)
- Cage_support: if CAGE annotation file is added, this bar appear in the chart. Isoform is within a CAGE site.
- Covered: if short reads files are introduced, this bar is added to the chart. Isoform has short read coverage.
- PolyA_support: if polyA motif file is introduced, this bar is added to the chart. Isoform has a polyA motif.
- Bad quality features: percentage of transcripts per structural category with different features that show bad quality:
- RTSwitching: isoform has at least one junction that might be marked as RT Switching. This information only appear if the filtering step has run.
- Non_all_CJ: not every splice junction is a canonical junction.
- Non_covered: if short reads files are introduced, this bar is added to the chart. Isoform has not read coverage.
- Predicted_NMD: isoform’s predicted ORF and CDS ends at least 50bp before the last junction. This information only appear if the filtering step has run.
- Rescue Step: this charts give information about the rescue step:
Categories used to rescue transcripts: barchart with the number of isoforms per category used to rescue a reference transcript.
Reasons of artifact exclusion: reasons to finally not use an artifact to rescue a reference transcript.

Files

Transcriptome Annotation File: this GTF annotation file shows the location in the genome of every transcript and exon.
Transcriptome File: Multifasta file with the nucleotide sequence of every isoform.
Translated Transcriptome File (optional): Multifasta file with the translated amino acidic sequence of every isoform. This file can only be returned if ORF prediction is not deactivated.
Junctions File: Text file with information of every splice junction found.
Classification File: Text file with information of every isoform found.

If the rescue step is performed, the Transcriptome Annotation File will be an extended version with rescued reference transcripts.

**Figure 11.** Isoforms per Structural Category Chart

**Figure 12.** Histogram of Distances of FSM to annotated TTS

**Figure 13.** Isoforms per Structural Category Before and After Filtering

**Figure 14.** Saturation Lines per Structural Category with 5 FL-reads required to call an isoform

**Figure 15.** Boxplots of FL Counts per Structural Category

**Figure 16.** Redundancy Analysis for FSM and ISM

**Figure 17.** Stacked barcharts with percentages of features of good quality

Isoform extraction from FASTA

Another useful tool that has been implemented is the extraction of isoforms from the transcriptome file that SQANTI3 returns. To use this tool, just select every isoform whose sequence is needed and right click in one of them and then click in "Extract Isoforms from FASTA". Then a wizard will open (Figure 19), and the FASTA transcriptome file and the output folder must be introduced. The output filename will be extracted_transcriptome-file-name.fasta.