DNA-Seq Bowtie 2

Introduction

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small. This algorithm has some interesting features:

Bowtie 2 supports gapped alignment with affine gap penalties.
Bowtie 2 supports local alignment, which doesn't require reads to align end-to-end. Local alignments might be "trimmed" ("soft clipped") at one or both extremes in a way that optimizes alignment score.
Bowtie 2 allows alignments to overlap ambiguous characters (e.g. Ns) in the reference.
Bowtie 2's paired-end alignment is more flexible. E.g. for pairs that do not align in a paired fashion, Bowtie 2 attempts to find unpaired alignments for each mate.

Please cite Bowtie 2 as:

Langmead B. and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-9.

Run DNA-Seq Alignment (Bowtie 2)

This functionality can be found under Genome Analysis → DNA-Seq Alignment → Bowtie 2. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3, Figure 4, and Figure 5).

Input

Input Reads: Select the files containing sequencing reads. These files are assumed to be in FASTQ/FASTA format. Both, single and paired-end data are accepted.
Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.

For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.

Reference Genome: Specify a FASTA file with the genome reference sequences. Multiple reference sequences (e.g. chromosomes or scaffolds) are allowed.

It is not recommended to provide masked genome sequences since the algorithm will force those reads that originate in repeats to map (falsely) somewhere else in the genome.

General Options

Parameter Preset: Bowtie 2 comes with some useful combinations of parameters packages into shorter ‘preset’ parameters. The preset options are designed to cover a wide area of the speed/sensitivity/accuracy trade-off space, with the presets ending in ‘fast’ generally being faster but less sensitive and less accurate, and the presets ending in sensitive generally being slower but more sensitive and more accurate.

Selecting a preset will overwrite most of the parameters. These parameters can be readjusted, even if a preset has been selected.

Alignment:
End to End: By default, Bowtie 2 performs end-to-end read alignments. It searches for alignments involving all of the read characters. It is also known as ‘untrimmed’ or ‘unclipped’ alignment.
Local: In this mode, Bowtie 2 might ‘trim’ or ‘clip’ read characters from one or both ends of the alignment if doing so maximizes the alignment score.

Alignment Options

Max # Mismatches: Sets the number of mismatches allowed in a seed alignment during multiseed alignment. It can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity.
Length of Seed Substrings: Sets the length of the seed substrings to align during multiseed alignment. Smaller values make alignment slower but more sensitive.
Interval Between Seed Substrings: Sets a function governing the interval between seed substrings to use during multiseed alignment. Since it's best to use longer intervals for longer reads, this parameter sets the interval as a function of the read length, rather than a single one-size-fits-all number. For instance, specifying "S,1,1.15" sets the interval function f to f(x) = 1 + 1.15* sqrt(x), where x is the read length.

To rapidly narrow the number of possible alignments that must be considered, Bowtie 2 begins by extracting substrings ("seeds") from the read and its reverse complement and aligning them in an ungapped fashion with the help of the FM Index. This is ‘multiseed alignment’.

Max # of non-A/C/G/T: Sets a function governing the maximum number of ambiguous characters (usually Ns and/or .s) allowed in a read as a function of reading length. Reads exceeding this ceiling are filtered out. For instance, specifying -L,0,0.15 sets the N-ceiling function f to f(x) = 0 + 0.15 * x, where x is the read length.
Include DP: ‘Pads’ dynamic programming problems by this number of columns on either side to allow gaps.
Disallow Gaps at Tips: Disallows gaps within this number of positions at the beginning or end of the read.
Ignore Qualities: When calculating a mismatch penalty, always consider the quality value at the mismatched position to be the highest possible, regardless of the actual value.
Do not Align Forward: If specified, Bowtie 2 will not attempt to align reads to the forward (Watson) reference strand.
Do not Align Reverse-Complement: If specified, Bowtie 2 will not attempt to align reads against the reverse-complement (Crick) reference strand.
Do not Allow 1 Upfront Mismatch: By default, Bowtie 2 will attempt to find either an exact or a 1-mismatch end-to-end alignment for the read before trying the multiseed heuristic. This option prevents Bowtie 2 from searching for 1-mismatch end-to-end alignments before using the multiseed heuristic.

Scoring Options

Match Bonus: Sets the match bonus. In local mode, this value is added to the alignment score for each position where a read aligns to a reference character and the characters match. This parameter is not used in the ‘End to End’ mode.
Max/Min Penalty: Sets the maximum and minimum mismatch penalties, both integers.
Penalty for non-A/C/G/Ts in Ref: Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as N.
Read Gap Open: Sets the read gap open (first value) and extend (second value) penalties.
Reference Gap Open: Sets the reference gap open (first value) and extend (second value) penalties.
Min Acceptable Align. Score: Sets a function governing the minimum alignment score needed for an alignment to be considered valid. This is a function of read length. For instance, specifying L, 0, -0.6 sets the minimum-score function f to f(x)=0+-0.6*x, where x is the read length.

Reporting Options

Reporting: Bowtie 2 searches for distinct, valid alignments for each read. The way in which they are reported can be adjusted:
Default: The best alignment found is reported (the best mapping quality).
Report up to X: Report up to X alignments per read. For reads that have more than X distinct, valid alignments, Bowtie 2 does not guarantee that the X alignments reported are the best possible in terms of alignment score.
Report all alignments: There is no upper limit on the number of alignments to search for. This mode can be very slow in repetitive genomes.
Report up to: When the 'Report up to X' mode is used, Bowtie 2 behaves differently. It searches for at most X distinct, valid alignments for each read. The search terminates when it can't find more distinct valid alignments, or when it finds X, whichever happens first. All alignments found are reported in descending order by alignment score. Each reported alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field.

For reads that have more than X distinct, valid alignments, Bowtie 2 does not guarantee that the X alignments reported are the best possible in terms of alignment score.

Effort Options

Give Up Extending After: Set a value up to which consecutive seed extension attempts can 'fail' before Bowtie 2 moves on, using the alignments found so far. A seed extension 'fails' if it does not yield a new best or a new second-best alignment.
Try Sets of Seeds: Set the maximum number of times Bowtie 2 will 're-seed' reads with repetitive seeds. When 're-seeding', Bowtie 2 simply chooses a new set of reads (same length, same number of mismatches allowed) at different offsets and searches for more alignments. A read is considered to have repetitive seeds if the total number of seed hits divided by the number of seeds that aligned at least once is greater than 300.

Paired-end Specific Options

Minimum Fragment Length: The minimum fragment length for paired-end alignments. E.g. if 60 is specified and a paired-end alignment consists of two 20 bp alignments in the appropriate orientation with a 20-bp gap between them, is considered valid. A 19-bp gap would not be valid in that case.
Maximum Fragment Length: The maximum fragment length for valid paired-end alignments. E.g., if 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, is considered valid. A 61-bp gap would not be valid in that case.
Read Order: The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand.
Forward / Reverse: If there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1, is that too valid.
Reverse / Forward: This mode requires that an upstream mate 1 be reverse-complemented and a downstream mate 2 be forward-oriented.
Forward / Forward: This mode requires both, an upstream mate 1 and a downstream mate 2 to be forward-oriented.
No Mixed: When Bowtie 2 cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that default behavior.
No Discordant: Bowtie 2 looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (Min and Max Fragment Length, and Read Order). This option disables that default behavior.
Dovetail: If one mate alignment extends past the beginning of the other such that the wrong mate begins upstream, consider that to be concordant.
No Contain: If one mate alignment contains the other, consider that to be non-concordant. By default, a mate can contain the other in a concordant alignment.
No Overlap: If one mate alignment overlaps the other at all, consider that to be non-concordant. By default, mates can overlap in a concordant alignment.

Output Options

Add Read Group Information: Include the ‘Read Group’ header (@RG) in output BAM files. This information may be required for downstream analysis of third-party tools. If this option is checked, the following read group tags will be included for each sample:
Identifier (ID), automatically generated.
The name of the sample (SM), inferred from file names.
Sequencing Platform (PL), provided by the user.
Sequencing Platform: Choose the sequencing platform which was used to obtain the input data. Consider that if this option is provided, all output BAMs will be tagged with the same platform.

Output

Alignment Files: Select a destination folder to save output BAM files.

Results

The main outputs are the BAM files. A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as the mapping position, and a variable number of optional fields for flexible or aligner-specific information.

SAM Format Description

QNAME: Query template (read) name.In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
FLAG: SAM flags summarize many properties of reads, represented by flag bits, into a single number:
Read is paired.
Read is mapped in a proper pair.
Read is unmapped.
Mate is unmapped.
Read reverse strand.
Mate reverse strand.
Read is from the first pair.
Read is from the second pair.
Alignment isn't primary.
Read fails platform/vendor quality checks.
Read is PCR or optical duplicate.
RNAME: Reference sequence name. If @SQ header lines are present, RNAME must be present in one of the SQ-SN tag.
POS: 1-based leftmost mapping position of the first CIGAR operation. The first base in a reference sequence has coordinate 1.
MAPQ: Mapping quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
CIGAR: A string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are:
M: Align match.
I: Insertion to the reference.
D: Deletion from the reference.
N: Skipped region from the reference.
S: Soft clipping.
H: Hard clipping.
P: Padding (silent deletion from padded reference).
=: Sequence match
X: Sequence mismatch
RNEXT: Reference sequence name of the primary alignment of the next read in the template. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base.
PNEXT: a 1-based position of the primary alignment of the next read in the template.
TLEN: Signed observed template length.
SEQ: Segment sequence.
QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format).

In addition to these 11 obligatory fields, optional fields may be included. All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string.

For more information about the SAM format, visit the SAM Format Specification Page.

You can check the meaning of a FLAG number using the SAM Flag Translator.

In addition, a report and two charts are generated with complementary information. The report (Figure 6) shows a summary of the DNA-Seq Alignment results. This page contains information about the reference genome sequences, the input FASTQ files, and a results overview. The last section is divided into several subsections: globals, paired information, ACTG content, coverage, mapping quality, insert size, mismatches, and indels.

The bar charts (Figure 7) show the number of mapped and unmapped reads of each input file.

**Figure 7:** Alignments per Category Charts

Finally, the Genome Browser allows you to visualize genomic coordinates (GFF/GTF) in a side-scrolling way. Several tracks can be added to the browser, the currently supported tracks are VCF, DNA Fasta, and BAM. The BAM track (Figure 8) shows the reads of a BAM file and if the sequence track is active, it will also highlight the differences between the read sequence and the sequence track.