PacBio-Based Identification with IsoSeq
Introduction
IsoSeq is a composable workflow of existing tools and algorithms, combined with a new clustering technique, which allows processing the ever-increasing yield of PacBio machines. Starting from subreads or CCS reads, this tool allows identifying transcripts in PacBio single-molecule sequencing data. The IsoSeq pipeline is made up of up to eight steps:
- CCS Calling: Each sequencing run is processed by the ccs software to generate one representative circular consensus sequence (CCS) for each ZMW (Zero-mode Waveguide).
- Kinnex-Demultiplexing: Kinnex reads are demultiplexed into individual transcript reads using skera.
- Primer removal and demultiplexing: Removal of primers and identification of barcodes is performed using lima.
- Refine: This step consists of trimming of poly(A) tails and identification and removal of artificial concatemers (chimeric reads).
- Clustering: Clustering using hierarchical n*log(n) alignment and iterative cluster merging.
- Polishing (optional): Generate per base QVs for transcript consensus sequences and improve results.
- Mapping: clustered reads are mapped to a reference genome.
- Collapsing: the mapped reads are finally collapsed into transcripts in order to define isoforms and obtain a transcriptome.
Please, cite IsoSeq as:
IsoSeq v4. Scalable De Novo Isoform Discovery. Töpfer, A. and Tseng, E. 2018. Retrieved 2024 from https://github.com/PacificBiosciences/IsoSeq
Run IsoSeq
This functionality can be found under Transcriptomics → Long-Reads Analysis → Transcript Identification → PacBio-Based Identification with IsoSeq3.The wizard allows adjusting analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).
Input
-
Data Type:
-
Subreads: Subreads are the continuous raw sequences produced from a single pass of the polymerase around the circular DNA template.
- Circular Consensus Sequence / HiFi: CCS refers to the high-accuracy consensus sequence derived from multiple subreads of the same circularized DNA molecule. HiFi (High-Fidelity) reads are high-accuracy consensus sequences generated by this CCS technology.
-
Long-Read Files: Select the files containing PacBio sequencing reads in the selected data type.
-
Primers/Barcodes File: Specify a FASTA file with primer or barcoded primer sequences. This file will be used in lima in order to remove primers and demultiplex input reads.
- Perform Deconcatenation with Skera: Check this option if you want to perform deconcatenation. Skera is used to deconcatenate or split the HiFi reads generated using Kinnex (formerly MAS-Seq) methodology at adapter positions, generating segmented reads (S-reads). For each input BAM file (e.g., HiFi), skera will create a BAM file with deconcatenated reads. A parent HiFi read can contain many S-reads.
- Deconcatenation Barcodes File: Specify a FASTA file with Kinnex barcodes to be used by skera.
Configuration 1
Circular Consensus Sequence Calling
- Minimum Passes: Minimum number of full-length subreads required to generate CCS for a ZMW.
- Minimum SNR: Minimum SNR of subreads to use for generating CCS.
- Minimum Length: Minimum draft length.
- Skip Polishing: Only output the initial draft template (faster, less accurate). It does not refer to the last optional polishing step.
- Minimum Predicted Accuracy: Establish the minimum predicted accuracy (0 - 1).
Primer Removal and Demultiplexing
- Minimum Score: Reads below the minimum barcode score are removed from downstream analysis.
- Minimum End Score: Minimum end barcode score threshold is applied to the individual leading and trailing ends.
- Minimum Signal Increase: The minimal score difference, between first and combined, required to call a barcode pair different.
- Minimum Score Lead: The minimal score lead required to call a barcode pair significant.
- Peek Guess: Try to infer the used barcodes subset, by peeking at the first 50000 ZMWs, whitelisting barcode pairs with more than 10 counts and mean score >= 45. Check this option to remove spurious false-positive signals.
- Merge by Barcode: If this option is checked, reads will be merged by barcode. This is useful if the input consists of multiple files, e.g. from multiple SMRTCells, containing multiple barcodes, e.g. when the same samples were ran in multiple SMRTCells.
Configuration 2
Refine
- Remove Poly(A) Tails: Check this option if your sample has poly(A) tails. This filters for FL reads that have a poly(A) tail with at least the number of base pairs set in the following parameter. It removes identified tails.
- Minimum Poly(A) Tail Length: Establish the minimum poly(A) tail length.
Clustering
- Perform Clustering: Whether the clustering step should be performed. This is required for the subsequent mapping and collapsing steps.
- Use CCS QVs: Use CCS QVs. If it is checked, the POA (Partial Order Alignment) Coverage is set to 100.
Polishing
- Perform Polishing: If the input files were subreads, this optional step can improve results by generating per base QVs for transcript consensus sequences.
Note that this step is outdated and only offered to support legacy data sets. It is very time consuming and the improvement in quality is often unnecessary.
- RQ Cutoff: RQ cutoff for fastx output.
- Coverage: Maximum number of subreads used for polishing.
Configuration 3
Map and Collapsing Step
- Perform Collapsing: Whether the mapping and collapsing steps should be performed.
- Reference Genome: reference to align the clustered reads to make a transcriptome model.
- Minimum Alignment Coverage: Ignore alignments with less than minimum query read coverage.
- Minimum Alignment Identity: Ignore alignments with less than minimum alignment identity.
- Max. Size of Fuzzy Junction: Ignore mismatches or indels shorter than or equal to N.
- 5' Difference in Exon: Maximum allowed 5' difference on same exon.
- 3' Difference in Exon: Maximum allowed 3' difference on same exon.
- Collapse Smaller Transcripts: Collapse 5' shorter transcripts which miss one or multiple 5' exons to a longer transcript.
Output
-
FLNC Reads:
-
In this directory, the pre-process FLNC (Full-Length Non-Chimeric) reads will be saved.
- If collapsing was performed, an abundance file will also be saved here.
-
Clustered Sequences in FASTA Format
-
In this directory, the clustered high-quality transcript sequences will be saved. This is only available if clustering is performed. If collapsing is performed as well, a clustering report will also be generated.
-
Polished Sequences in FASTQ Format
-
In this directory, the polished sequences will be saved. This is only available if polishing is performed.
-
Transcriptome FASTA File
-
The collapsed transcript sequences. This is only available if collapsing is performed.
-
Transcriptome Annotation File
-
The collapsed transcripts in GTF Format. This is only available if collapsing is performed.
Results
The main output is the clustered/polished .fasta file. It contains the transcripts identified from the input data. Additional BAM and FASTQ files contain the same information in a different format.
The report.csv file contains information about how many PacBio reads have contributed to the reconstruction of each transcript.
In addition, a report and a chart are generated with complementary information. The report shows a summary of the IsoSeq results (Figure 6).
In addition, a report for each input sample can be opened (Figure 7). They contain additional details about the processing of each sample.
Three types of charts are generated:
Length Distribution
Show the distribution of the lengths of the resulting transcripts (Figure 8).
Coverage Distribution
Show the distribution of the coverage, this is, the number of reads supporting each transcript (Figure 9).
Subreads Distribution
This chart is only generated if the polishing step is applied. It shows the distribution of subreads supporting the resulting transcripts (Figure 10).