Long-Read Alignment

Introduction

Minimap2 is a versatile sequence alignment tool widely used in genomics and transcriptomics research. It efficiently aligns DNA or RNA long reads to a reference genome or among them, allowing for accurate identification of sequence similarities and aligning reads to a reference. Minimap2's speed, sensitivity, and ability to handle long reads make it an essential tool for tasks like genome assembly or transcriptome reconstruction. With its diverse range of applications, Minimap2 provides researchers with a powerful tool to explore genomic data and gain deeper insights into the complex biological processes underlying diverse organisms.

Minimap2 follows a typical seed-chain-align procedure as is used by most full-genome aligners. It begins by identifying short seed matches between the query sequence and the reference genome, using a hash minimizer index for rapid retrieval. Then, it constructs longer alignments by chaining and extending the seed matches, considering their quality and proximity. Minimap2 employs bidirectional extension and dynamic programming to accurately align sequences, considering parameters such as gaps and mismatches.

Please cite Minimap2 as:

Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100.

Run Minimap2 to Align Sequences

Minimap2 can be found under Transcriptomics → Long-Read Analysis → Long-Read Alignment with Minimap2.The wizard consists of 5 pages and allows to define the input and output options as well as the analysis parameters (Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5).

Input

First of all, FLAIR requires some necessary files:
Long-Reads Files: FASTA/Q files containing long reads proceeding from PacBio or ONT technologies.
Reference Genome: FASTA file with the reference genome.

Configuration 1

Ignore Base Quality: check it to ignore base quality in the input file.
Preset Options:
Preset Options:
- Map ONT Reads (Default): align noisy long reads of ~10% error rate to a reference genome. This option can be used with ONT long reads.
- Map PacBio HiFi Reads: align PacBio high-fidelity (HiFi) reads to a reference genome.
- Map PacBio CLR Reads: align older PacBio continuous long (CLR) reads to a reference genome.
- Long-Read Spliced Mapping: long-read spliced alignment. In the splice mode:
- Long deletions are taken as introns and represented as the ‘N’ CIGAR operator.
- Long insertions are disabled.
- Deletion and insertion gap costs are different during chaining.
- The computation of the ‘ms’ tag ignores introns to demote hits to pseudogenes.
- Long-Read Splice Alignment for PacBio CCS Reads: This preset is similar to the last one but it's designed for high-quality data or small genomes.
- Indexing Options: Indexing referes to the first step of minimap2 and the majority of full-genome aligners. Minimap2 obatins minimizers and index them in a hash table. When chosen a preset index, recommended index parameters are already chosen. Nevertheless, you are able to modify these parameters:
Minimizer k-mer Length: Number of nucleotides for each minimizer. A minimizer is the smallest k-mer in a window of consecutive k-mers.
Minimizer Window Size: Number of consecutive k-mers. It is recommended that this value represents 2/3 of the k-mer length.
Use HPC Minimizers: This option is recommended if you are going to map PacBio Long Reads. A Homopolymer-Compressed Minimizer (HPC) sequence is constructed by contracting homopolymer runs to a single base. An HPC minimizer is a minimizer on the HPC sequence.
Seeding Options: Seeding refers to the second step of minimap2. In this step, minimizers are taken as seeds to initiate alignments. This is an efficient method to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences.
Ignore Most Frequent Minimizers: If a fraction, ignore the top fraction of most frequent minimizers. If an integer, ignore minimizers occurring more than that number of times.
Lower bound of K-Mer occurrences: Prevents excessively small numbers of ignored minimizers.
Upper bound of K-Mer occurrences. Prevents excessively big numbers of ignored minimizers.
Ignore Most Frequent Minimizers: Discard a query minimizer if its occurrence is higher than that fraction of query minimizers and the reference occurrence threshold. Set 0 to disable.
Basepairs to Sample Minimizer: Number of basepairs before sampling a high-frequency minimizer.

Configuration2

Chaining Options: Chaining is the next step after seeding. This step consists of elongating the seed anchored before.
Basepairs to Stop Chain Elongation: Number of basepairs to look for minimizers after stopping chain elongation.
Chain Gap Scale: Scale of gap cost during chaining.
Bandwidth for chaining: Bandwidth used for initial chaining and alignment extension.
Minimum Number of Minimizers in a Chain: Discard chains consisting of less than this number of minimizers.
Minimum Chaining Score: Minimum chaining score not to discard a chain. Chaining score equals the approximate number of matching bases minus a concave gap penalty. It is computed with dynamic programming.
Chain Gap Scale: Scale of gap cost during chaining.
All-vs-All Overlapping: Primarily used for all-vs-all read overlapping.
Secondary Alignment Options:
Retain Secondary Alignments: Check this option to get secondary alignments. Secondary chains are segments mapped in the same place as another chain with better quality.
Retain All Chains: Retain all chains and don’t attempt to set primary chains.
Fraction to Mark a Chain as Secondary: Mark as secondary a chain that overlaps with a better chain by this fraction or more of the shorter chain.
Minimal Secondary-to-Primary Score Ratio: Between two chains overlapping the fraction set in the previous parameter, the chain with a lower score is secondary to the chain with a higher score. If the ratio of the scores is below this fraction, the secondary chain will not be outputted.
Number of Secondary Alignments: Maximum number of secondary alignments.
Splice Options:
Splice Mode: Enable the splice alignment mode. In this mode, the chaining gap cost distinguishes insertions to and deletions from the reference.
Maximum Gap on the Reference: Maximum number of nucleotides in a insertion or deletion.
Where to Find Splice Sites:
- Both Strands: find splice sites in both strands, not only in the strand that has been transcribed.
- Transcript Strands: find splice sites only in the transcript strand.

Configuration 3

Alignment Options:
Matching Score: Score of a nucleotide in the query matching a nucleotide in the reference.
Mismatch Penalty: Penalty when a nucleotide in the query and a nucleotide in the reference do not match.
First Gap Open Penalty: Penalty when a first gap is opened in the alignment.
Subsequent Gap Open Penalty: Penalty for the next gaps after the first one is opened in the alignment.
First Gap Extension Penalty: Penalty for each nucleotide that the first gap is extended.
Subsequent Gap Extension Penalty: Penalty for each nucleotide that the next gaps are extended.
Non-Canonical Splice Site Penalty: Cost for a non-canonical GT-AG splicing.
End Bonus Score: Score bonus when alignment extends to the end of the query sequence.
Ambiguous Base Penalty: Penalty of a mismatch involving ambiguous bases.
Use Splice Flank: Assume the next base to a GT donor site tends to be A/G (91% in human and 92% in mouse) and the preceding base to a AG acceptor tends to be C/T. This trend is evolutionarily conservative, all the way to S. cerevisiae. Specifying this option generally leads to higher junction accuracy by several percents, so it is applied by default with the splice mode. However, the SIRV control does not honor this trend (only ~60%). This option reduces accuracy. If you are benchmarking minimap2 on SIRV data, please do not use this option.
Splice Junctions Information:
Use Junction File: Check this option if you want to use a Junction File.
Junction Bonus: Bonus Score when a Junction is present in the Junction File.
Junction Bed File: You can introduce a BED, BAM or GTF file as Junction File. The easiest option, which is also efficient, is to map short reads using STAR and then introduce that BAM as input here.

Output

BAM file: path to save BAM files with aligned long reads. Each fastq file with generate one BAM file.
Output (BAM) Options:
Make CIGAR: Generate CIGAR in BAM file.
Make Long CIGAR: Write CIGAR with >65535 operators at the CG tag. Older tools are unable to convert alignments with >65535 CIGAR ops to BAM. This option makes minimap2 SAM compatible with older tools. Newer tools recognizes this tag and reconstruct the real CIGAR in memory.
Add MD Tag to BAM: Check this option to add the MD tag, important to do Variant Calling without the reference genome.
Add CS Tag to BAM: Check this option to add the CS tag. This tag is similart to the MD tag as it shows nucleotidic differences between the reference and the read.

Results

The main outputs are the BAM files. A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not.

In addition, a report and two charts are generated with complementary information. The report (Figure 6) shows a summary of the DNA-Seq Alignment results. This page contains information about the reference genome sequences, the input FASTQ files, and a results overview. The last section is divided into several subsections: globals, paired information, ACTG content, coverage, mapping quality, insert size, mismatches, and indels.

The bar charts (Figures 7 and 8) show the number of mapped and unmapped reads of each input file.