Reference-free Isoform Reconstruction

Introduction

With the advancement of transcriptomics through long-read technologies, various reference-guided tools for reconstructing a transcriptome from long reads have become publicly available. As their name suggests, these tools utilize a reference genome and its annotation. However, for non-model organisms, it is common to encounter situations where there is a need for conducting a differential expression analysis without access to a characterized genome or transcriptome. To address this challenge, we have incorporated the isONpipeline (isONclust, isONcorrect and isONform) into OmicsBox. This pipeline enables the reconstruction of a transcriptome using long reads from PacBio or Oxford Nanopore Technologies (ONT).

Reference-free Isoform Reconstruction

Reference-Free Isoform Reconstruction can be found in the Transcriptomics Module of OmicsBox under Transcriptomics → Long-Reads Analysis → Transcript Identification → Reference-free Isoform Reconstruction.The wizard consists of 3 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, Figure 3).

Input Page

In this page you will be able to select the files that contain a transcriptome and some parameters regarding long reads length-filtering.

Long-Read Files: FASTQ files with long reads that come from PacBio or ONT sequencing technologies.

If you select multiple files containing reads, all the reads from those files will be combined into a single file and then this tool will run. If you have multiple transcriptomes sequenced in different files, please ensure to run this tool on each individual file separately.

Minimum Read Length: reads shorter than this value will be filtered out.
Maximum Read Length: reads longer than this value will be filtered out.

While filtering outliers can improve the runtime of the algorithm, it is also vital not to exclude too many reads in order to reconstruct the complete sequenced transcriptome.

Configuration

General Parameters: these parameters do not belong to a specific step.
Reconstruction Pipeline: pipeline that you want to run on your long-read dataset.
- ONT Pipeline with Pychopper: in this pipeline, before running the isONpipeline (isONclust, isONcorrect and isONform), Pychopper is executed in order to trim and filter cDNA ONT reads.
- ONT Pipeline without Pychopper: only the isONpipeline is executed. This might be useful for already-corrected cDNA datasets or from non-cDNA ONT technologies.
- PacBio Pipeline: for PacBio reads, only isONclust and isONform are executed, as it is stated in the isONpipeline.
- Isoform Read Support: number of reads to call a cluster (in isONclust) and a transcript (in isONform).
- Clustering (isONclust):
K-mer Size: length of the k-mers to make a Hash table before clustering.
Window Size: length of the sliding window to obtain the k-mers used to create the Hash table.
Minimum Mapped Fraction: minimum mapped fraction of a read to be included in a cluster. The density of minimizers to classify a region as mapped depends on the quality of the read.
Minimum Aligned Fraction: minimum aligned fraction of a read to be included in a cluster. Aligned identity depends on the quality of the read.
Minimizers Shared with Cluster: minimum number of minimizers shared between a read and cluster in order to belong to it.

For ONT data, k-mer size and window size seem to be optimal at 13 and 20, respectively, and at 15 and 50, respectively, for PacBio data.

Correction (isONcorrect):
K-mer Size: length of the k-mers to make a Hash table before the correction step.
Window Size: length of the sliding window to obtain the k-mers.
Reconstruction (isONform):
K-mer Size: length of the k-mers to make a Hash table before clustering.
Window Size: length of the sliding window to obtain the k-mers used to create the Hash table.
Maximum Difference in 3': maximum length difference at 3' end, for which subisoforms are still merged into longer isoforms.
Maximum Difference in 5': same at 5' end.

Output

Transcriptome FASTA File: folder to save the final transcriptome FASTA file.

Results

The main output is the reconstructed transcriptome in FASTA format.

A summary report with the input FASTA filenames, information about the reconstructed transcriptome (number of isoforms and maximum, minimum and average length), the parameters set and the references will be generated too (see Figure 4). In addition, two charts will be output:

Isoform Length Distribution: in this chart you can see the distribution of length of the transcriptome (see Figure 5).
Isoform Support Distibution: distribution of the number of reads used to reconstruct a transcript (see Figure 6).

**Figure 5.** Isoform Length Distribution

**Figure 6.** Isoform Support Distribution