Long Read Quality Assessment with LongQC

Introduction

LongQC is a computationally efficient, platform-independent QC tool to spot issues before a full analysis. The tool visualizes statistics designed for erroneous long read data to highlight potential problems originated from the biological samples as well as those introduced at the sequencing stage. It supports major TGS file formats. LongQC relies on k-mer based internal overlaps and skips alignment; therefore, it operates efficiently without reference genomes.

Please cite LongQC as:

LongQC: A Quality Control Tool for Third Generation Sequencing Long Read Data.
Yoshinori Fukasawa, Luca Ermini, Hai Wang, Karen Carty, Ming-Sin Cheung.
G3: Genes, Genomes, Genetics, 10(4): 1193-1196, 2020

Quality Assessment with LongQC

LongQC can be found in the General Tools Module of OmicsBox under FASTQ tools → Long Read Quality Assessment with LongQC. The wizard allows the selection of several sequencing files to be assessed, the possibility to save an optional output result and some analysis parameters (Figure 1).

Input

LongQC analyses TGS reads files from either Pacbio or Oxford Nanopore sequencing technologies. These files should be formatted as the following standard types:

Fasta file: Plain text file containing all the reads sequences obtained in the sequencing process.
FASTQ file: Plain text file that store both the nucleotide sequences of the reads and their per base quality score. It has become the de facto standard for storing the output of the sequencing process.
PacBio BAM file: Binary and compressed container format for storing PacBio sequencing reads. It is based on the specifications for BAM/SAM, although it does not contain information about any of alignment. More information about this format can be found in PacBio documentation.

Take into account that LongQC uses per base sequencing quality scores in some of its analyses. If these values are not provided, part of the analyses will not be performed.

Configuration

There are several parameters that can be tuned to perform a more accurate analysis. These parameters should be adjusted according the type of the input reads:

Sequencing Technology: Applies the specific configuration of internal parameters that best suit with each sequencing technology. This parameter is particularly relevant in adapters analysis since it selects the concrete sequences of the adapters.
Transcript Mode: This parameter enables a concrete LongQC configuration that fits the specific features of transcript (RNA, cDNA) data.
Short Mode: This parameter activate a highly sensitive setting for very short and erroneous reads.

LongQC works even if the selection of the parameters does not fit the input. However, this could lead to incorrect results or warnings.

Output

LongQC provides an optional FASTQ file that contains the trimmed reads after the adapter detection analysis.

The parameters that control the output are the following:

Save trimmed: Activate to save the trimmed reads. When active, destination folder selection is available.
Directory to save the trimmed reads: Where the destination folder can be selected.

Results

LongQC provides the following outputs:

Table with the most relevant statistics about each sample (Figure 2).
Report with a information of general statistics, warnings/errors, GC composition and the Adapter detection results (Figure 3).
Charts:
Adapters detection (Line Plot/Histogram).
Sequences Composition:
- GC content plot (Line Plot/Histogram).
- Masked bases plot (Line Plot/Histogram).
- Reads Coverage:
- Binned coverages plot (Line Plot/Histogram).
- Coverage over length plot (Line Plot/Box Plot).
- Reads Length (Line Plot/Histogram).
- Reads Quality:
- Quality over length plot (Line Plot/Box Plot).
- Quality over coverage plot (Box Plot).

Box Plots and Histograms are only available for representing a single sample in each chart. Line Plots allow comparing several samples in the same chart.

File:
Trimmed reads (optional).

Table with general statistics

It contains the next columns (Figure 2):

Tags: It indicates with a colored label if the analysis of a sample has returned a warning, an error, or none of them (correct). These warnings and errors refer to problems detected by LongQC in the samples. Minor issues are considered "warnings", while significant issues should lead to an "error".
Sample: Name of the file without extensions.
Total Number of bp: Number of the total base pairs from every read sequence in the file.
Mean Read Length: Lengths average from all reads in the file.
N50: Statistic commonly used to assess the quality of a genome assembly. Here, it represent a length-weighted median.
Longest Read: Length (in number of bp) of the longest read in the sample.
Number of reads: Total number of reads sequences in the file.
% Non-sense Reads: Fraction (in percentage) of the reads having a coverage value of 0.

Coverage is calculated by mapping all reads between them. Therefore, non-sense reads are unique reads that cannot be mapped onto any other sequences in the same file.

% Q>7 Bases: Fraction (in percentage) of bases having a quality value of 7 or higher. This column is optional since PacBio Sequel technologies do not provide any per-base quality score.

A quality value of 7 represents an error rate of 20%.

Warnings: Brief description of all warnings and errors found in the sample.

LongQC Report

The report contains the next sections:

General Statistics: Some of the statistics showed in the table (Figure 2).
GC Content Statistics: Mean and Standard deviation of the GC content. The sample’s mean should be close to the mean GC content of the organism genome.
Warnings/Errors: Table with all warnings and errors detected in all samples. The table only appears if any kind of issue has been detected.
Adapter Statistics: This table is shown if adapters sequences have been detected in, at least, one sample. It contains the following columns:
Number of trimmed reads: The number of reads having adapter like (75% or higher identity) sequences in either terminals. If this is unexpectedly low and trimming was not conducted, it infers that adapter ligation step had some problems.
Max seq identity: Maximum value of identity between adapter sequence and sequences. This value should be quite high (90%) if adapter still exists in a dataset.
Average trimmed length: The average end position of aligned sequences. This should be consistent with the kit description and peak in the flanking region analysis plots.
Parameters: Execution parameters of the analysis.

Charts

LongQC charts can be accessed through the side panel action buttons (Figure 3). All buttons display an specific wizard where several plotting options can be selected.

All charts are grouped in the following sections:

Adapters Detection: These charts represent the count of specific fragment sequences that match adapter sequences and their distance to 5' and 3' terminals. If the tool detects any adapter, it should show a peak distinct from 0..
There are two options to display the adapters detection results:
- Histogram: Very useful since it allows to change the size of the bins. Only available for one sample (Figure 4).
- Line Plot: Suitable for comparing two or more samples, bin sizes are fixed (Figure 5).
- Sequences Composition: There are two types of charts regarding the sequence composition of the reads:
GC Content Plot: It displays the GC content distribution of the samples. It should show one peak if the quality of the sample is correct. This peak should be near the average GC content of the studied genome. The presence of more than one peak could imply the existence of contaminant sequences from other organisms. However, samples from metagenomics analysis could show more than one peak. The options available to plot the GC content are the Histogram and the Line Plot (Figure 6).
Masked Bases Plot: This chart shows the distribution of the per-read sequence complexity score computed by DUST algorithm. The presence of high fractions of masked bases could point to some problems in the sequencing process. The options available, in this case, are the Histogram and the Line Plot (Figure 7).
Reads Coverage: These charts represent, in two different ways, the distribution of calculated coverage on every sample. The two groups of plots within this section are:
Binned Coverage Plot: It displays the binned coverage distribution for all samples. It should show only one peak if the sample has no issues. However, metagenomics samples can present more than one peak and still be correct. It can be plotted as a Histogram or a Line Plot (Figure 8).
Coverage over Length Plot: It shows the fluctuation of coverage over the length of the reads. Significant divergences between length intervals could tell about contamination, low-quality library, or overloading in PacBio, according to the LongQC authors. Coverage over length could be plotted as:
- Box Plot: Representing a box for each length interval. More informative and robust but only available for one sample.
- Line Plot: Much less accurate than box plot, but it allows comparing between samples (Figure 9).
- Reads Length: These charts represent the length distribution of the samples. This distribution heavily depends on the technology and the type of sequenced molecule (genomic DNA or RNA). Samples from the same experiment and organism should display highly similar distribution curves. These distributions could be represented either as a Histogram (Figure 10) or Line Plot.
- Reads Quality: These charts are only available if the sample files provide any kind of per-base quality score. These scores represent the accuracy of the base calling for each base pair.
Quality over Length Plot: It shows the quality fluctuation over the length of the reads. If the data is correct, quality should have a similar behavior between length intervals. Quality over length could be plotted as Box Plots (Figure 11) or Line Plots.
Quality over Coverage Box Plot: This chart represents in two boxes the distribution of per-read average quality values according to their coverage value (Figure 12). Ideally, normal reads should have a better overall quality than non-sense reads. Also, the median of the non-sense reads should be in the red region. If both medians are similar, there are two possible scenarios:
- The two medians are located in the green region: The coverage along the dataset may be low. Sequencing depth should be improved.
- The two medians are located in the red region: There are issues with the quality of the data that could affect downstream analysis.