DNA-Seq BWA
Introduction
The Burrows-Wheeler Alignment algorithm (BWA) is a read alignment package that is based on a backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. The BWA-MEM algorithm is used, which performs local alignment. It may produce multiple primary alignments for different parts of a query sequence.
Please cite BWA as:
Run DNA/RNA-Seq Alignment (BWA)
This functionality can be found under Genome Analysis → DNA-Seq Alignment → BWA. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).
Input
- Input Reads: Select the files containing sequencing reads. These files are assumed to be in FASTQ/FASTA format. Both, single and paired-end data are accepted.
-
Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
-
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
- Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.
- Reference Genome: Specify a FASTA file with the genome reference sequences. Multiple reference sequences (e.g. chromosomes or scaffolds) are allowed.
It is not recommended to provide masked genome sequences since the algorithm will force those reads that originate in repeats to map (falsely) somewhere else in the genome.
Algorithm Options
- Minimum Seed Length: Matches shorter than this value will be missed. The alignment speed is usually intensive to this value unless it significantly deviates 20.
- Band Width: Essentially, baps longer than this value will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option.
- Z-dropoff: Also known as Off-diagonal X-dropoff. Stop extension when the difference between the best and the current extension score is above |i-j|*A+Z-dropoff, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST’s X-dropoff except that it does not penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension but also reduces poor alignments inside a long good alignment.
- Trigger Re-seeding: Look for internal seeds inside a seed longer than {Minimum Seed Length} * Trigger Re-seeding. This is a key heuristic parameter for tuning the performance. Larger values yield fewer seeds, which leads to faster alignment speed but lower accuracy.
- Seed Occurrence: Seed occurrence for the 3rd round seeding.
- Skip Seeds: Discard a seed if it has more than this number of occurrences in the genome.
- Drop Chains: Drop chains shorter than this fraction of the longest overlapping chain.
- Discard Chains: Discard a chain if seeded bases shorter than this value.
- Mate Rescue Rounds: Perform at most this number of rounds of mate rescues for each read.
- Skip Mate Rescue: Skip the mate rescue procedure.
- Skip Pairing: In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair. The mate rescue is performed, unless the Skip Mate Rescue option is also in use.
Scoring Options
- Matching Score: Score for a sequence match.
- Mismatch Penalty: Penalty for a mismatch. The sequence error rate is approximately: {.75 * exp[-log(4) * Mismatch Penalty/Matching Score]}.
- Gap Open Penalty for Deletions: Gap open penalty for deletions.
- Gap Open Penalty for Insertions: Gap open penalty for insertions.
- Gap Extension Penalty for Deletions: Gap extension penalty for deletions. A gap of size k cost '{-O} + {-'Gap Extension Penalty'}*k'.
- Gap Extension Penalty for Insertions: Gap extension penalty for insertions. A gap of size k cost '{-O} + {-'Gap Extension Penalty'}*k'.
- 5'-end Clipping Penalty: Penalty for 5'-end clipping. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of the query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; the clipping penalty is not deducted.
- 3’-end Clipping Penalty: Penalty for 3'-end clipping.
- Unpaired Read Penalty: Penalty for an unpaired read pair.
Output Options
- Minimum Score: Minimum score to output.
- Mark Split Alignments as Primary: For split alignment, take the alignment with the smallest coordinate as primary.
- Not Modify mapQ of Supp. Alignments: Do not modify the mapping quality of supplementary alignments.
- Output All SE/Unpaired PE Alignments: Output all alignments for Single-End or unpaired Paired-End reads.
- Soft Clipping for Supplementary: Use soft clipping for supplementary alignments.
- Mark Shorter Split Hits as Secondary: Mark shorter split hits as secondary.
- Sort BAM File: Establish how output BAM files should be sorted.
-
Add Read Group Information: Include the ‘Read Group’ header (@RG) in output BAM files. This information may be required for downstream analysis of third-party tools. If this option is checked, the following read group tags will be included for each sample:
-
Identifier (ID), automatically generated.
- The name of the sample (SM), inferred from file names.
- Sequencing Platform (PL), provided by the user.
- Sequencing Platform: Choose the sequencing platform which was used to obtain the input data. Consider that if this option is provided, all output BAMs will be tagged with the same platform.
Output
- Alignment Files: Select a destination folder to save output BAM files.
Results
The main outputs are the BAM files. A BAM file (*.bam) is a compressed binary version (BGZF format) of a SAM file that is used to represent aligned sequences. SAM is a TAB-delimited text format consisting of a header section and an alignment section. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as the mapping position, and a variable number of optional fields for flexible or aligner-specific information.
- QNAME: Query template (read) name.In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
-
FLAG: SAM flags summarize many properties of reads, represented by flag bits, into a single number:
-
Read is paired.
- Read is mapped in a proper pair.
- Read is unmapped.
- Mate is unmapped.
- Read reverse strand.
- Mate reverse strand.
- Read is from the first pair.
- Read is from the second pair.
- Alignment isn't primary.
- Read fails platform/vendor quality checks.
- Read is PCR or optical duplicate.
- RNAME: Reference sequence name. If @SQ header lines are present, RNAME must be present in one of the SQ-SN tag.
- POS: 1-based leftmost mapping position of the first CIGAR operation. The first base in a reference sequence has coordinate 1.
- MAPQ: Mapping quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
-
CIGAR: A string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are:
-
M: Align match.
- I: Insertion to the reference.
- D: Deletion from the reference.
- N: Skipped region from the reference.
- S: Soft clipping.
- H: Hard clipping.
- P: Padding (silent deletion from padded reference).
- =: Sequence match
- X: Sequence mismatch
- RNEXT: Reference sequence name of the primary alignment of the next read in the template. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base.
- PNEXT: a 1-based position of the primary alignment of the next read in the template.
- TLEN: Signed observed template length.
- SEQ: Segment sequence.
- QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format).
In addition to these 11 obligatory fields, optional fields may be included. All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string.
For more information about the SAM format, visit the SAM Format Specification Page.
You can check the meaning of a FLAG number using the SAM Flag Translator.
In addition, a report and two charts are generated with complementary information. The report (Figure 5) shows a summary of the DNA-Seq Alignment results. This page contains information about the reference genome sequences, the input FASTQ files, and a results overview. The last section is divided into several subsections: globals, paired information, ACTG content, coverage, mapping quality, insert size, mismatches, and indels.
The bar charts (Figure 6) show the number of mapped and unmapped reads of each input file.
Finally, the Genome Browser allows you to visualize genomic coordinates (GFF/GTF) in a side-scrolling way. Several tracks can be added to the browser, the currently supported tracks are VCF, DNA Fasta, and BAM. The BAM track (Figure 7) shows the reads of a BAM file and if the sequence track is active, it will also highlight the differences between the read sequence and the sequence track.