ABySS
Introduction
ABySS 2.0 is a multistage de novo assembly pipeline consisting of unitig, contig, and scaffold stages.
- At the unitig stage, the program performs the initial assembly of sequences according to the De Bruijn graph assembly algorithm. The unitig stage loads the full set of k-mers from the input sequencing reads into a hash table and stores auxiliary data for each k-mer such as the number of k-mer occurrences in the reads and the presence/absence of possible neighbor k-mers in the De Bruijn graph.
- At the contig stage, the paired-end reads are aligned to the unitigs and the pairing information is used to orient and merge overlapping unitigs.
- At the scaffold stage, the mate-pair reads are aligned to the contigs to orient and join them into scaffolds, inserting runs of "N" characters at gaps in coverage and for unresolved repeats.
The main innovation of ABySS 2.0 is a Bloom filter-based implementation of the unitig assembly stage. It reduces the overall memory requirements, enabling assembly of large genomes. A Bloom filter is a compact data structure for representing a set of elements that supports operations of inserting elements and querying the presence of elements. The Bloom filter data structure consists of a bit vector and one or more hash functions, where the hash functions map each k-mer to a corresponding set of positions within the bit vector (bit signature for the k-mer).
During unitig assembly, two passes are made through the input sequencing reads:
- In the first pass, k-mers are extracted from the reads and are loaded into a Bloom filter. The program discards all k-mers with an occurrence count below a user-specified threshold (typically in the range of two to four). In this way, k-mers caused by sequencing errors are filtered out. The retained k-mers are known as solid k-mers.
- In the second pass, the program identifies reads that consist entirely of solid k-mers, and extend them left and right within the De Bruijn graph to create unitigs.
Please cite ABySS 2.0 as:
Run ABySS Assembly
This functionality can be found under Genome Analysis → DNA-Seq de novo Assembly → ABySS. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, and Figure 3).
Input
- Input Reads: First, choose the type of sequencing data. Then, select the files of this type of data for the assembly. Both paired-end and single-end short reads can be provided, and both types of data can be combined in the same run.
-
Additional Data: ABySS supports additional data types as supplementary information:
-
Additional Paired-end Libraries: Paired-end libraries that will be used only for merging unitigs into contigs and will not contribute toward the consensus sequence.
- Mate-pair Libraries: Mate-Pair libraries that will be used for scaffolding. Mate-Pair libraries that will be used for scaffolding. Mate-pair libraries do not contribute toward the consensus sequence.
- Linked Reads: Linked reads from 10x Genomics Chromium. The linked reads are used to correct assembly errors and scaffolding.
- Long Sequences Libraries: Provide long sequence libraries (such as RNA-Seq contigs) that will be used for rescaffolding. Long sequence libraries do not contribute toward the consensus sequence.
-
Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
-
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
- Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.
Configuration
- K-mer Size: The term k-mer refers to all possible subsequences of the given length that are contained in a read. In sequence assembly, k-mers are used during the construction of De Bruijn graphs. The choice of the k-mer size has many different effects on the sequence assembly, it is advisable to try different values and check the results to choose the best one. It is recommended to use odd values of at least half the length of the reads.
- Use paired De Bruijn graph: Assembly will be performed using a paired De Bruijn graph. In this mode, k-mer pairs are used, which consist of two equal-sized k-mers separated by a fixed distance. To assemble using the paired De Bruijn graph mode, specify the k-mer pair span (distance between k-mers).
- K-mer Pair Span: Set the span of a k-mer pair (distance between k-mers).
- Minimum Alignment Length: Establish the minimum alignment length of a read (bp). This means that there must be a perfect match of the established length between each read and its target contig.
- Hash Functions: Set the number of Bloom filter hash functions. K-mers from each input sequencing read are loaded into the Bloom filter by computing the hash values of each k-mer sequence and setting the corresponding bit.
- K-mer Count Threshold: Set the k-mer count threshold for Bloom filter assembly. Optimal values are typically in the range of 2-4. K-mers with an occurrence count below the threshold will be discarded.
Output
- Unitigs Fasta: Where to store the Fasta file containing the assembled unitigs.
- Contigs Fasta: Where to store the Fasta file containing the assembled contigs.
- Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffolds.
- Long Scaffolds Fasta: Where to store the Fasta file containing the assembled long scaffolds.
Note that this file is only generated if long sequence data were provided.
- Save Graph Files: Save the final repeat graphs in a .dot file. Graph files are generated for contigs and scaffolds.
- Graph Files: Select a folder to store the graph files (.dot).
Results
ABySS returns the assembled sequences in three FASTA files (four if long sequence libraries were provided). Each one corresponds to a different stage of the assembly procedure:
- Unitigs: Contains sequences assembled without using paired-end information. In case you provide only single-end data, this will be the only result file, since pairing information is required to assemble contigs.
- Contigs: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps, but no repeats.
- Scaffolds: Contains sequences assembled with paired information, scaffolding over sequencing coverage gaps and repeats.
- Long Scaffolds: Contains sequences that were obtained by rescaffolding using long sequences libraries.
Unitigs/contigs/scaffolds names in ABySS output FASTA files have the following format:
4 678 16718
- 4 is the name of the sequence.
- 678 is the sequence length in nucleotides.
- 16718 is the number of kmers that mapped to the sequence during assembly.
If the "Save Graph Files" option is checked, ABySS returns the sequence overlap graphs in Graphviz dot format. The GraphViz DOT syntax is well defined and implemented by a number of existing graph tools.
For further information about how ABySS represents a sequence overlap graph, please visit the ABySS File Formats page.
In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 4). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly:
- N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every unitig, contig or scaffold from longest to shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.
- L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.
- Bloom filter False Positive Rate (FPR): The Bloom filter can generate false positives when the bit signatures of different k-mers overlap by chance. This means that a certain fraction of k-mer queries will return true even though the k-mers do not exist in the input sequencing data. Users are recommended to target a Bloom filter false positive rate (FPR) smaller than 5%. Parameters such as the k-mer size, hash functions or k-mer count threshold can influence the false positive rate.
The Nx plot (Figure 5) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for unitigs, contigs and scaffolds.