Identification and Quantification with FLAIR
Introduction
Long-read sequencing technologies are becoming increasingly popular for transcriptome analysis because they can provide higher accuracy and completeness of transcript assembly and gene annotation. This is especially true for complex transcriptomes. Nevertheless, as short reads must be assembled in order to get a sequenced transcriptome, long reads need to be preprocessed in order to define isoforms. To achieve this, OmicsBox now offers FLAIR (Full-Length Alternative Isoform analysis of RNA).
The FLAIR tool is a computational pipeline designed for the correction, isoform definition and quantification of transcriptomes using long-read sequencing technologies such as PacBio or Oxford Nanopore. The pipeline is based on a combination of alignment-based methods (using Minimap2) and subsequent de novo assembly to collapse long reads and get isoforms. This tool is optimized for the specific characteristics of long-read sequencing data such as high error rates and long read lengths. The tool is able to handle complex gene structures and alternative splicing events that may be challenging to detect with short-read data alone. In addition, FLAIR is able to quantify the discovered isoforms.
Please cite FLAIR as:
Tang, A. D., Soulette, C. M., van Baren, M. J., Hart, K., Hrabeta-Robinson, E., Wu, C. J., & Brooks, A. N. (2020). Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nature communications, 11(1), 1-12.
Run FLAIR for Long-Read Isoform Definition
FLAIR can be found under Transcriptomics → Long-Reads Analysis → Transcript Identification → Identification and Quantification with Flair.The wizard consists of 4 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3, Figure 4 and Figure 5).
Input 1
-
First of all, FLAIR requires some necessary files:
-
Long-Reads Files: FASTA/Q files containing long reads proceeding from PacBio or ONT technologies.
- Reference Genome: FASTA file with the reference genome.
-
In addition, it is necessary to use as input at least one of these two kind of files:
-
Genome Annotation: GTF file with annotations of the reference genome.
- Short-Reads Information: you can also add information about a splice-aware aligner such as STAR. You can add either the BAM files (provided that they have the XS tag, as this tag denotes the strand orientation of an intron) or tab files that can also be obtained from STAR.
Make sure that the reference genome and the reference annotation have the same version and that short reads were aligned to the same reference genome used here.
Input 2
-
Additionally, some optional files can be included.
-
Aligned Reads: BAM files proceeding from aligning the input long-read FASTA/Q files. This option can be enabled if you want to map your long reads with other tool rather than Minimap2 (aligner used by FLAIR) or if you want to use a specific set of parameters.
- Short-read BAM Files: These BAM files must be the result of aligning short reads with a gapped mapper like STAR. They will be converted into a BED file that will be used in the alignment and correction steps to support the existence of the splice junctions that appear in the long-read BAM file.
- Reads Manifest: A tab-delimited file used in the quantification step with no header and these columns: sample name, condition, batch, filename. If this file is added, a count table will also appear as an output of FLAIR.
The reads manifest can only be uploaded if the ‘Quantify Reads’ option is checked. In addition, the last column, the filename, must be equal to the filename of some input added in the Long-Read Files box of the first page.
Configuration
In this page, the parameters for every FLAIR step can be set:
-
Alignment:
-
Native RNA: use native-RNA specific alignment parameters for minimap2. This parameter is a flag that tells the input is original RNA sequences as input, rather than using pre-processed or adapter-trimmed sequences.
- Min. Mapping Quality: minimum mapping quality score of read alignment to the genome.
-
Retain Secondary Alignments: retain that number of secondary alignments from minimap2 (i.e. alignments of the same read in other parts of the genome). Please proceed with caution, changing this setting is only useful if you know there are closely related homologues elsewhere in the genome. It will likely decrease the quality of Flair's final results.
-
Correction:
-
Window Size: window size for correcting splice sites.
-
Collapsing:
-
Minimum Supporting Reads: minimum number of long reads to call an isoform.
- Window Size for TSS and TTS: window size for comparing transcripts starts (TSS) and ends (TTS).
- Ends Determined at Isoform Level: when specified, TSS/TTS for each isoform will be determined from supporting reads for individual isoforms and not from genes.
- Get TSS and TTS from Supporting Reads: do not use TSS/TTS from the input GTF to adjust isoform TSS/TTS. Instead, each isoform will be determined from supporting reads.
-
How to Treat Redundant Isoforms:
- No redundancy control: best TSSs/TTSs chosen for each unique set of splice junctions.
- TSS/TTS that maximize length: choose this to maximize length of transcripts.
- Most supported TSS/TTS: single most supported TSS/TES by reads.
-
How to Filter Isoforms:
-
Filter based on support: this is the default filter.
- Filter out subset isoforms: any isoforms that are a proper set of another main isoform are removed.
- Both options: as the name states, both previous options are used.
- Both options and remove single-exons isoforms: the same as before but also single-exons isoforms. These isoforms are typically considered to be noise in transcriptome sequencing data and are often removed.
- Common Parameters (for more than one step of above):
-
Minimum Mapping Quality: minimum mapping quality of a read assignment to an isoform.
- Stringent Mode: supporting reads must cover 80% of their isoform and extend at least 25 nt into the first and last exons. If those exons are themselves shorter than 25 nt, the requirement is that the read must start within 4 nt from the start or end within 4 nt from the end.
- Check Splice Sites: enforces coverage of 4 out of 6 bp around each splice site and no insertions greater than 3 bp at the splice site.
- Trust Ends: specify if reads are generated from a long read method with minimal fragmentation.
Output
- Transcriptome Annotation: transcriptome annotation in GTF format. SQANTI3 can check the quality of the assembled transcriptome using this file.
- Transcriptome Sequences: sequences of each isoform of the transcriptome in FASTA format.
- Isoform-Read Map File: text file that links each defined isoform with the long-reads collapsed.
- Counts File: only if quantification has been applied. This file can be also used in SQANTI3 as the file with the full-length counts.
Results
SQANTI3 has the following outputs:
-
Files:
-
Transcriptome Annotation (GTF file). This file can be included in SQANTI3 to do the quality control and characterization of transcripts.
- Sequence Transcriptome (FASTA file). File with the sequence of all the defined isoforms.
- Isoform-Read Map File. This file link each defined isoform with the long-reads used to define it.
- Counts File. File with the number of counts per isoform and per sample. This file can also be included in SQANTI3 as an additional file.
- Report with information of the correction and collapsing steps.
- Length Distribution Chart.
- Counts Matrix (if the Reads Manifest was added).
Report
This report first shows the input data used and then some summary metrics of the defined isoforms (number of isoforms and maximum, minimum and average length). After this summary, it shows the number of valid and dismissed transcripts during the correction step, and then, the number of isoforms created and of transcripts used for it. Finally, the chosen parameters are displayed.
Length Distribution Chart
Histogram with the distribution of lengths of the defined isoforms in the collapsing step. This histogram might be interesting in order to know the acceptable range of isoform lengths and know which threshold must be set in SQANTI3.
Count Table
If a reads manifest was added, a transcript count table will also appear when FLAIR finishes. In the sidebar of this table you will see a button to perform Differential Expression Analysis. The names of the transcripts are the same that the ones that appear in the transcriptome generated by FLAIR.
The available tools to perform a Differential Expression Analysis are the same as the ones for short reads.
Count Table Charts
Different statistical charts can be generated from the count table. These charts provide additional information about the quantification, as well as a quality assessment of the resulting counts. These charts can be found under the Side Panel → Charts of the Count Table Viewer.
Library Size per Sample
Bar chart showing the number of read counts aligned to genomic features contained in each sample (Figure 9).
Distribution of Counts
Box plot that allows seeing how counts are distributed within each sample for all the transcripts (Figure 10). Features with 0 counts in all samples will be discarded for this chart. The binary logarithm of raw counts is represented.
Distribution of Library Sizes
Box plot that allows seeing the distribution of Library Sizes of all samples being quantified (Figure 11).
PCA Plot
This feature performs a Principal Component Analysis and generates a 2D (Figure 12) or 3D (Figure 13) with the two and three first Principal Components, respectively. This chart helps to identify which samples are similar to each other in terms of gene expression. Ideally, samples belonging to the same condition should appear closer in the plot.
PCA Plot in 3 Dimensions is only available for datasets with 3 or more samples.