Reference-free Isoform Reconstruction

Introduction

With the advancement of transcriptomics through long-read technologies, various reference-guided tools for reconstructing a transcriptome from long reads have become publicly available. As their name suggests, these tools utilize a reference genome and its annotation. However, for non-model organisms, it is common to encounter situations where there is a need for conducting a differential expression analysis without access to a characterized genome or transcriptome. To address this challenge, we have incorporated the isONpipeline (isONclust, isONcorrect and isONform) into OmicsBox. This pipeline enables the reconstruction of a transcriptome using long reads from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT).

Reference-free Isoform Reconstruction

Reference-Free Isoform Reconstruction can be found in the Transcriptomics Module of OmicsBox under Transcriptomics → Long-Reads Analysis → Transcript Identification → Reference-free Isoform Reconstruction.The wizard consists of 3 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, Figure 3).

Input Page

In this page you will be able to select the files that contain a transcriptome and some parameters regarding long reads length-filtering.

Long-Read Files: FASTQ files with long reads that come from PacBio or ONT sequencing technologies.

If you select multiple files containing reads, all the reads from those files will be combined into a single file and then this tool will run. If you have multiple transcriptomes sequenced in different files, please ensure to run this tool on each individual file separately.

Minimum Read Length: reads shorter than this value will be filtered out.
Maximum Read Length: reads longer than this value will be filtered out.

While filtering outliers can improve the runtime of the algorithm, it is also vital not to exclude too many reads in order to reconstruct the complete sequenced transcriptome.

isONpipeline in OmicsBox - Input — **Figure 1:** Input page of the OmicsBox isONpipeline wizard.

Configuration

General Parameters: these parameters do not belong to a specific step.
- Reconstruction Pipeline: The option to run the isONpipeline with PyChopper integrated has been temporarily removed as we work to restructure and improve the implementation of the isONpipeline in OmicsBox. If you need help to run PyChopper to pre-process your ONT data, please contact support for assistance.
  - ONT Pipeline without Pychopper: only the isONpipeline is executed. This might be useful for already-corrected cDNA datasets or from non-cDNA ONT technologies.
  - PacBio Pipeline: for PacBio reads, only isONclust and isONform are executed, as it is stated in the isONpipeline.
- Isoform Read Support: number of reads to call a cluster (in isONclust) and a transcript (in isONform).
Clustering (isONclust):
- K-mer Size: length of the k-mers to make a Hash table before clustering.
- Window Size: length of the sliding window to obtain the k-mers used to create the Hash table.
- Minimum Mapped Fraction: minimum mapped fraction of a read to be included in a cluster. The density of minimizers to classify a region as mapped depends on the quality of the read.
- Minimum Aligned Fraction: minimum aligned fraction of a read to be included in a cluster. Aligned identity depends on the quality of the read.
- Minimizers Shared with Cluster: minimum number of minimizers shared between a read and cluster in order to belong to it.

Info

For ONT data, k-mer size and window size seem to be optimal at 13 and 20, respectively, and at 15 and 50, respectively, for PacBio data.

Correction (isONcorrect):
- K-mer Size: length of the k-mers to make a Hash table before the correction step.
- Window Size: length of the sliding window to obtain the k-mers.
Reconstruction (isONform):
- K-mer Size: length of the k-mers to make a Hash table before clustering.
- Window Size: length of the sliding window to obtain the k-mers used to create the Hash table.
- Maximum Difference in 3': maximum length difference at 3' end, for which subisoforms are still merged into longer isoforms.
- Maximum Difference in 5': same at 5' end.

isONpipeline in OmicsBox - Configuration — **Figure 2:** Configuration page of the OmicsBox isONpipeline wizard.

Output

Transcriptome FASTA File: location to save the final transcriptome FASTA file.

Quantification File: location to save the transcript count table (.csv file). Quantification is grouped by input files.

isONpipeline in OmicsBox - Output — **Figure 3:** Output page of the OmicsBox isONpipeline wizard.

Results

The main output is the reconstructed transcriptome in FASTA format as well as the quantification count table.

A summary report with the input FASTA filenames, information about the reconstructed transcriptome (number of isoforms and maximum, minimum and average length), the parameters set and the references will be generated too (see Figure 4). In addition, two charts will be output:

Isoform Length Distribution: in this chart you can see the distribution of length of the transcriptome (see Figure 5).
Isoform Support Distibution: distribution of the number of reads used to reconstruct a transcript (see Figure 6).

isONpipeline in OmicsBox - Summary Report — **Figure 4:** Example Summary Report of an isONpipeline run in OmicsBox.

isONpipeline in OmicsBox - Transcript Length Chart — **Figure 5:** Customizable chart of the length distribution of discovered transcripts.

isONpipeline in OmicsBox - Transcript Support Chart — **Figure 6:** Customizable chart of the support distribution of discovered transcripts.