Skip to content

Contaminant Removal

Introduction

When working with host-associated studies, it is often necessary to isolate the host-related DNA from the sequencing data.

It is very likely that reads that contain host DNA can not be classified with Kraken 2 and simply increase the noise of the dataset. Mapping the read data to the host genome can help to reduce the number of unclassified reads. This process can be repeated to refine the data every time a bit more (e.g. with different phylogenetically close target genomes).

  • Sequencing Data: Choose the type of input data: fasta, single-end, or paired-end. If paired-end is selected, two files per sample are required and the file pattern has to be provided.
  • Reads:Select files that contain the desired input data.
  • Paired-end configuration: When working with paired-end libraries, a so-called pattern has to be established to help the software distinguish between upstream and downstream read files. Per default, we assume the following pattern:

  • upstream: SampleA_1.fastq

  • downstream: SampleA_2.fastq

For SRR037717_1.fastq and SRR037717_2.fastq as up and downstream files, please select "_1" and "_2" respectively for the patterns.

  • Database Index: Choose from one of the included target genomes (Homo sapiens, Mus musculus, PhiX, etc.), or select Create database from genome and provide your own genome.
  • Target Genome: Select the target genome in Fasta format to map the selected read data against.

Figure 1

  • Save Results: Choose whether to save only contaminant, non-contaminant, or both types of sequences.

Figure 2

References

  • Langmead B. and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-9.
  • Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G. and Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078-9.
  • Okonechnikov K., Conesa A. and Garcia-Alcalde F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics (Oxford, England), 32(2), 292-4.