SPAdes
Introduction
SPAdes is a de novo genome assembly pipeline that can deal with data coming from several sequencing technologies and supports hybrid and single-cell assemblies. The SPAdes assembly pipeline consists of four stages:
- Assembly graph construction. SPAdes uses the multisized de Bruijn graph, implements new bulge/tip removal algorithms, detects and removes chimeric reads, aggregates biread information into distance histograms, and allows to backtrack the performed graph operations.
- k-bimer adjustment: SPAdes derives accurate distance estimates between k-mers in the genome using joint analysis of distance histograms and paths in the assembly graph.
- Constructs the paired assembly graph: Inspired by Paired de Bruijn graphs (PDBG) approach.
- Contig construction: SPAde constructs DNA sequences of contigs and the mapping of reads to contigs by backtracking graph simplifications.
SPAdes uses a modification of Hammer for error correction and quality trimming prior to assembly.
In general, SPAdes uses two techniques for scaffolding.
- SPAdes tries to estimate the size of the gap separating contigs using read pairs.
- SPAdes, using the assembly graph, joins contigs that are separated by a complex tandem repeat, that cannot be resolved exactly, with a fixed gap size of 100 bp.
Contigs produced by SPAdes do not contain N symbols.
Please, cite SPAdes as:
- Nurk, Bankevich et al., 2013.
- Bankevich, Nurk et al., 2012.
- Antipov et al., 2015 (in case you perform hybrid assembly using PacBio or Nanopore reads).
- Prjibelski et al., 2014 (if you use multiple paired-end and/or mate-pair libraries).
- Vasilinetc et al., 2015 (if you use multiple paired-end and/or mate-pair libraries).
Run SPAdes Assembly
This functionality can be found under Genome Analysis → DNA-Seqde novoAssembly → SPAdes. The wizard allows to select input files and adjust analysis parameters (Figure 1, Figure 2, Figure 3, and Figure 4).
Input
-
Input Reads: Select the files containing the sequencing libraries (reads). The assembly strategy requires at least one of these types of sequencing libraries.
-
Illumina single-end, paired-end, or high-quality mate-pairs.
- IonTorrent single-end, paired-end, or high-quality mate-pairs.
- PacBio CCS reads (should be provided as single-end data).
These files are assumed to be in FASTQ format. For IonTorrent data, SPAdes supports unpaired reads in unmapped BAM format.
- IonTorrent Data: This option is required when assembling IonTorrent data. Illumina and IonTorrent libraries should not be assembled together. For IonTorrent data, SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by the Torrent Server).
- Single-cell Data: This option is required for Multiple Displacement Amplification (MDA) single-cell data assembly.
-
Paired-end Configuration: If paired-end reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
-
Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
- Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
For example, if the upstream file is SRR037717_1.fastq and the downstream SRR037717_2.fastq, "_1" should be established as the upstream pattern and "_2" as the downstream pattern.
Input 2
- Use Additional Mate-Pair Data: SPAdes supports mate-pair only assembly. However, high-quality mate-pair libraries are recommended in these cases. Here, regular mate-pair libraries can be provided as supplementary information. Upstream and downstream files will be distinguished using the pattern established on the previous page (Paired-end Configuration).
-
Use Data for Hybrid Assembly:
-
PacBio (CLR), Oxford Nanopore, and Sanger reads can be provided for hybrid assemblies (e.g. with Illumina or IonTorrent data). SPAdes uses this data for gap closure and repeat resolution.
- Contigs of the same genome (trusted) generated by other assemblers can be specified to merge them into SPAdes assembly.
- Less reliable contigs (untrusted) can be used only for gap closure and repeat resolution.
Only contigs of the same genome should be specified since SPAdes does not work with genomes of closely related species.
Configuration
-
Automatic K-mer Sizes: K-mer sizes are selected automatically based on the read length and data set type:
-
If single-cell data is provided, the default values are 21, 33 and 55.
- For multicell datasets, K values are automatically selected using maximum read length.
- K-mer Sizes: Define a comma-separated list of k-mer sizes to be used. These must be odd and less than 128. You can find recommendations about K-mer sizes in the SPAdes documentation.
- Read Error Correction: Performs a read error correction before assembly. Depending on the sequencing platform, the BayesHammer (Illumina) or the IonHammer (IonTorrent) tools are used for this task. This procedure is recommended to obtain high-quality assemblies but can be turned off if read error correction has been done previously.
- Mismatch Careful Mode: Tries to reduce the number of mismatches and short indels. It also runs MismactCorrector, a post-processing tool that uses BWA.
This option is recommended only for the assembly of small genomes. For large and medium-size eukaryotic genomes is not recommended.
- Read Coverage Cutoff: Configure the read coverage cutoff value that SPAdes will use to obtain the most reliable assembled sequences. Must be a positive decimal number, or automatic, or off. When set to "Automatic" SPAdes automatically computes coverage threshold using conservative strategy.
- Read Coverage Cutoff Value: If the "Defined by User" option is selected above, set a positive float value.
Output
- Contigs Fasta: Where to store the Fasta file containing the assembled contigs.
- Scaffolds Fasta: Where to store the Fasta file containing the assembled scaffold. Recommended for use as resulting sequences.
- Save Graph Files: Save the final graphs in a .gfa file. Two files are generated: assembly_graph_after_simplification.gfa and assembly_graph_with-scaffolds.gfa.
- Graph Files: Select a folder to store the graph files (.gfa).
Results
SPAdes returns the assembled sequences in two FASTA files:
- Contigs: Contains resulting contigs.
- Scaffolds: Contains resulting scaffolds (recommended for use as resulting sequences).
Contigs/scaffolds names in SPAdes output FASTA files have the following format:
NODE_3_length_237403_cov_243.207
- 3 is the number of the contig/scaffold.
- 237403 is the sequence length in nucleotides.
- 243.207 is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage.
If the "Save Graph Files" option is checked, SPAdes returns the assembly graph and scaffolds paths in GFA 1.0 format. The "assembly_graph_after_simplification.gfa" file correspond to contigs before repeat resolution (edges of the assembly graph). Paths corresponding to contigs after repeat resolution (scaffolding) are stored in "assembly_graph_with-scaffolds.gfa".
To view GFA files, the Bandage visualization tool is recommended.
In addition to the resulting FASTA files, a report and a chart are generated. The report shows a summary of the DNA-Seq De Novo Assembly results (Figure 5). This page contains information about the input sequencing data and a results overview. The Results Overview table shows a number of common statistics used to describe the quality of a sequence assembly (see the explanation in the previous section).
- N50: This statistic defines the assembly quality in terms of contiguity. N50 is calculated by first ordering every contig or scaffold from the longest to the shortest. Next, starting from the longest sequence, the lengths of each sequence are summed up, until this running sum equals one-half of the total length of all sequences in the assembly. The N50 of the assembly is the length of the shortest contig in this list. Higher values of N50 indicate a better assembly. Note that any Nx statistic is calculated in the same way, e.g. N75 is calculated summing up all the lengths until the sum equals 75% of the total length.
- L50: Defined as the smallest number of contigs whose lengths sum makes up half of the total assembly length.
The Nx plot (Figure 6) shows Nx values as x varies from 0 to 100 %. The Nx values are displayed for contigs and scaffolds.