Identification and Quantification with FLAIR

Introduction

Long-read sequencing technologies are becoming increasingly popular for transcriptome analysis because they can capture a full RNA molecule within a single read, enabling more detailed analyses of properties such as alternative splicing. However, whereas short reads have to be assembled in order to form transcript models, long reads require processing in order to separate artifacts and noise from genuine novel transcripts.

FLAIR is a computational pipeline designed to identify both known and novel transcript isoforms in long-read RNA-sequencing data. It consists of the following steps (see Figure 1):

FLAIR-align: Long reads are aligned to the reference genome using minimap2.
FLAIR-correct: Information from reference transcriptome annotations and/or short-read data is used in order to correct splice junctions in long reads.
FLAIR-collapse: Long reads are grouped and collapsed by their splice junctions and transcript ends are defined to identify and annotate transcript models.
FLAIR-quantify: Long reads are quantified on the identified transcript models. In OmicsBox, this step can also be run by itself to 1) quantify long reads based solely on a reference annotation, without discovering novel isoforms; or 2) re‑quantify long reads on a custom transcriptome after curation with SQANTI3.

Please cite FLAIR as:

Tang, A. D., Soulette, C. M., van Baren, M. J., Hart, K., Hrabeta-Robinson, E., Wu, C. J., & Brooks, A. N. (2020). Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nature communications, 11(1), 1-12.

**Figure 1.** FLAIR pipeline; graphic adapted from https://flair.readthedocs.io/

Run FLAIR for Long-Read Isoform Definition

FLAIR can be found under Transcriptomics → Long-Reads Analysis → Transcript Identification → Identification and Quantification with FLAIR. The wizard consists of 5 pages and allows the definition of the input and output options as well as the analysis parameters (see Figure 2, Figure 3, Figure 4, Figure 5, and Figure 6).

Required Inputs

This page includes input options for the basic files required by FLAIR:

Long Reads Files: FASTA/Q files containing long reads originating from PacBio or ONT technologies. These should already be pre-processed (e.g. FLNC reads in the case of PacBio).
Reference Genome: FASTA file with the reference genome.

FLAIR wizard - required inputs — **Figure 2.** "Required Inputs" page of the FLAIR wizard in OmicsBox.

Reconstruction and/or Quantification

Next, the desired analysis has to be configured with the following options:

Transcriptome Reconstruction: Whether or not to perform transcriptome reconstruction, consisting of the FLAIR align, correct, and collapse steps. This step identifies known and novel transcript isoforms and produces a transcriptome in GTF format as its primary output. Performing these steps requires supplying at least one of the following file options: a reference transcriptome annotation in GTF/GFF3 format, and/or short-read data as alignments in BAM format or as SJ.out.tab format as produced by the STAR aligner.
Quantify Reads: Whether or not to perform quantification of long reads to produce a count matrix. When both transcriptome reconstruction and quantification are selected, quantification will be performed on the newly reconstructed transcriptome. When only quantification is selected, quantification will be performed on the provided transcriptome annotation file.
Use Annotation File: Whether or not to supply a transcriptome annotation file.
Transcriptome Annotation: The transcriptome annotation in GTF/GFF3 format. If transcriptome reconstruction is performed, this file will be used to correct splice junctions in long reads. If only quantification is performed, this file will be used as a basis for quantification.
Use Short Reads: Whether or not to supply short-read alignments or splice junctions for the splice junction correction in transcriptome reconstruction.

FLAIR wizard - Reconstruction and/or Quantification — **Figure 3.** "Reconstruction and/or Quantification" page of the FLAIR wizard in OmicsBox.

Alignment

This page defines whether pre-existing alignments are supplied, or whether FLAIR should perform its own alignment in the FLAIR-align step.

Existing Alignment:
Use Own Alignment Files: Whether or not to provide custom alignments. This may be useful if you want to use an aligner other than minimap2 or if you want to configure your alignment in more detail.
Aligned Reads: Alignments of the provided reads in BAM format. These alignments are merged into a single file for the FLAIR correct step and therefore do not necessarily have to correspond 1:1 to the provided read files.

Alignment:

Native RNA: Use native‑RNA‑specific alignment parameters for minimap2. This flag indicates that the input consists of native RNA sequences rather than pre‑processed or adapter‑trimmed sequences.
Min. Mapping Quality: Minimum mapping quality score of read alignments to the genome.
Retain Secondary Alignments: Retain that number of secondary alignments from minimap2 (i.e. alignments of the same read in other parts of the genome). Please proceed with caution; changing this setting is only useful if you know there are closely related homologs elsewhere in the genome. It will likely decrease the quality of FLAIR's final results.

Configuration

This page allows for the configuration of FLAIR's correct, collapse, and quantify steps.

Correct:
Window Size: Window size for correcting splice sites.
Collapse:
Minimum Supporting Reads: Minimum number of long reads required to call an isoform.
Window Size for TSS and TTS: Window size for comparing transcript starts (TSS) and ends (TTS).
Ends Determined at Isoform Level: When specified, TSS/TTS for each isoform will be determined from supporting reads for individual isoforms rather than from genes.
Get TSS and TTS from Supporting Reads: Do not use TSS/TTS from the input GTF to adjust isoform TSS/TTS. Instead, each isoform’s TSS/TTS will be determined from supporting reads.
How to Treat Redundant Isoforms:
- No redundancy control: best TSS/TTS chosen for each unique set of splice junctions.
- TSS/TTS that maximize length: choose this to maximize transcript length.
- Most supported TSS/TTS: single most‑supported TSS/TTS by reads.
How to Filter Isoforms:
- Filter based on support: this is the default filter.
- Filter out subset isoforms: any isoforms that are a proper subset of another isoform are removed.
- Both options: as the name suggests, both previous options are used.
- Both options and remove single-exon isoforms: as above, but also removes single‑exon isoforms. These isoforms are typically considered to be noise in transcriptome sequencing data and are often removed.
Collapse & Quantify:
Minimum Mapping Quality: Minimum mapping quality of a read assignment to an isoform.
Stringent Mode: supporting reads must cover 80% of their isoform and extend at least 25 nt into the first and last exons. If those exons are themselves shorter than 25 nt, the requirement is that the read must start within 4 nt from the start or end within 4 nt from the end.
Check Splice Sites: Enforces coverage of 4 out of 6 bp around each splice site and disallows insertions greater than 3 bp at the splice site.
Trust Ends: Specify if reads are generated from a long‑read method with minimal fragmentation.

FLAIR wizard - Configuration — **Figure 5.** "Configuration" page of the FLAIR wizard in OmicsBox.

Output

This page defines where output files are saved.

Transcriptome Annotation: transcriptome annotation in GTF format. SQANTI3 can be used to check the quality of the assembled transcriptome using this file.
Transcriptome Sequences: sequences for each isoform of the transcriptome in FASTA format.
Counts File: only if quantification has been performed. This file can be provided to SQANTI3 as full-length counts.

FLAIR wizard - Output — **Figure 6.** "Output" page of the FLAIR wizard in OmicsBox.

Results

FLAIR has the following outputs:

Files:
Transcriptome Annotation (GTF file). This file can be included in SQANTI3 for quality control and characterization of transcripts.
Transcriptome Sequences (FASTA file). File with the sequences of all the defined isoforms.
Counts File. File with the counts per isoform and per sample. This file can also be provided to SQANTI3 as full-length counts.
Report with information of the input files, as well as the correction and collapsing steps.
Length Distribution Chart.
Counts Matrix

Report

This report first shows the input data used and then some summary metrics of the defined isoforms (number of isoforms and maximum, minimum, and average length). After this summary, it shows the number of valid and dismissed transcripts during the correction step, and then the number of isoforms created and the number of transcripts used to create them. Finally, the chosen parameters are displayed.

Length Distribution Chart

Histogram with the distribution of lengths of the defined isoforms in the collapsing step. This histogram may be useful for determining the acceptable range of isoform lengths and which threshold to set in SQANTI3.

FLAIR - isoform lengths histogram — **Figure 8.** Distribution of Isoform Lengths

Count Table

If quantification was performed, a transcript count table will also appear when FLAIR finishes. In the sidebar of this table you will see a button to perform Differential Expression Analysis. The transcript names are the same as those that appear in the transcriptome generated by FLAIR.

The available tools for Differential Expression Analysis are the same as those for short reads.

Count Table Charts

Different statistical charts can be generated from the count table. These charts provide additional information about the quantification, as well as a quality assessment of the resulting counts. They can be found in the Side Panel → Charts of the Count Table Viewer.

Library Size per Sample

Bar chart showing the number of read counts aligned to genomic features contained in each sample (Figure 10).

Distribution of Counts

Box plot showing how counts are distributed within each sample for all transcripts (Figure 11). Features with 0 counts in all samples will be discarded for this chart. The binary logarithm (log2) of raw counts is shown.

Distribution of Library Sizes

Box plot showing the distribution of library sizes across all samples being quantified (Figure 12).

PCA Plot

This feature performs a Principal Component Analysis and generates a 2D (Figure 13) or 3D (Figure 14) plot with the first two or three principal components, respectively. This chart helps to identify which samples are similar to each other in terms of gene expression. Ideally, samples belonging to the same condition should appear closer in the plot.

The 3D PCA plot is available only for datasets with three or more samples.