Demultiplexing with Cutadapt
Introduction
Demultiplexing, or Barcode Splitting, is the step in processing where you use the barcode information to know which sequences came from which sample after they had all been sequenced together. Barcodes refer to the unique sequences ligated to each of your individual samples’ genetic material before the samples got all mixed together. Depending on your sequencing facility, you may get your reads already split into individual fastq files, or they may be lumped together all in one fastq file with barcodes still attached for you to do the splitting. If this is the case, you should also have a mapping or barcode file telling you which barcodes correspond with which samples.
This tool takes FASTA/FASTQ files and splits them into several smaller files based on barcode matching. Cutadapt is used for this task.
Cutadapt Wizard
Page 1 - Input
Parameters
Input Reads- Select the FastQ/A files that contain sequences that have attached barcodes that link those sequences to the respective samples. Single-End and Paired-End files are allowed.
Paired-end Configuration - If Paired-End reads are provided, a pattern to distinguish upstream files from downstream files is required. The provided patterns are searched in the filenames right before the extension. The beginning of the filenames should be the same for both files of each sample.
- Upstream Files Pattern: Establish the pattern to recognize upstream FASTQ files.
- Downstream Files Pattern: Establish the pattern to recognize downstream FASTQ files.
Barcode File - Select the mapping file that establishes the connection between each barcode and sample. In the case of Paired-End, Barcodes from this file will be matched against the Upstream Files.
Downstream Barcodes - Select to activate the barcode search in the Downstream Files. Only allowed if Paired-End data has been selected as input reads.
Downstream Barcode File - A text file containing the barcodes to be mapped against the Downstream Files.
Barcode File Format
Barcode sequences can be provided in three different formats:
1. Two-Columns TXT/CSV file: Barcode files are simple TXT files or CSV/TSV files. Each line should contain an identifier (descriptive name for the barcode), and the barcode itself (A/C/G/T), separated by a TAB character. Example:
BC1 GATCT
BC2 ATCGT
BC3 GTGAT
BC4 TGTCT
2. Three-Columns TXT/CSV file: This format is similar to the previous one but has a third column containing the names of the files where you want to look for each barcode. Example:
BC1 GATCT filename1
BC2 ATCGT filename1
BC3 GTGAT filename2
BC3 GTGAT filename3
BC4 TGTCT filename3
In this case, each barcode will be searched only in the files indicated in the third column.
3. Fasta file: Barcode sequences are contained in a fasta file preceded by its barcode IDs as fasta headers (Having ">" as a first character). Example:
>BC1
GATCT
>BC2
ATCGT
>BC3
GTGAT
For each barcode and file, a new FASTA/FASTQ file will be created (with the barcode's identifier as part of the output file name). Sequences matching the barcode will be stored in the appropriate file. The name of the new files will contain the name of the original input file as well.
Running the above example (assuming the barcode file contains the above barcodes), will create the following files:
[filename]-BC1.fastq.gz
[filename]-BC2.fastq.gz
[filename]-BC3.fastq.gz
[filename]-BC4.fastq.gz
[filename]-unknown.fastq.gz
Take into account that, in this case, .fastq.gz has been chosen as the files suffix.
The 'unknown' file will contain all sequences that didn't match any barcode.
Page 2 - Configuration
Adapter Position - Match the barcodes at the beginning (5') of the sequences, at the end (3'), or anywhere along the upstream sequences.
Downstream Adapter Position -Only if the downstream adapter file has been provided. It allows indication of the position to look for the barcodes in the downstream input files.
Paired-End Adapter Strategy - If both upstream and downstream barcode files have been provided, it allows you to choose between 2 search strategies:
- Unique Dual Indices: Cutadapt only looks for the R1 and R2 barcodes in pairs. That is, the first R1 barcode is always used with the first R2 barcode, and so on.
- Combinatorial Dual Indices: Cutadapt uses all possible combinations between R1 barcodes and R2 barcodes.
Allowed Errors - Maximum number of allowed errors (mismatches and indels, if allowed) for barcodes, ranging from 0 to 10.
Allow Indels - Enable considering insertions and deletions as allowed errors.
Action - It allows us to indicate what to do with the matched sequences. It admits 4 different options:
- Trim: Cutadapt removes the matched sequences from the original input sequences.
- None: It does not modify the matched sequences.
- Mask: Write N characters at the positions where adapters have been found.
- Lowercase: Transform the matched section to lowercase. Leaves the rest of the input sequences as Uppercase.
Reverse Complement Search - Check to search the adapter sequences and their reverse complement across the input sequences. If unchecked, Cutadapt will only search barcodes in the same orientation that input sequences (from 5' to 3').
Output Configuration
Save Unmatched Sequences - Check to save all unmatched sequences in an 'unknown' FastQ/A file.
Include Sample Name - Check to include the input file name as a prefix of the output files.
Disabling this option generates output files with the following file name structure: [BarcodeID].fastq.gz
In this case, all reads from all input files that match with a single barcode will be placed in the same output file.
File Format - This parameter allows the selection of the output format between fasta and fastq. Additionally, it indicates the degree of compression of the output files (.gz or not).
Example
In Figure 3, black lines symbolize sequencing reads, while colored boxes denote barcodes. The representation is divided into three sections:
A) Input Files: Comprising sequencing Fastq files (1-3) and barcode files.
B) Output files resulting from splitting the sequences by barcodes and input files.
C) Output files resulting from splitting the sequences only by barcodes.
Page 3 - Output
Output Folder - Define a folder to save the results.
Save Counts Table - Check to save a table containing the results of matching all provided barcodes in each input file.
Counts File - Define a file name to save the barcode counts.
Cutadapt Results
- Report with information for each input sample regarding the proportion of reads matched with any provided barcode.
-
Two Charts:
-
Matches per Category Chart: Stacked bar plot representing the absolute number of reads in every input file and the number of them matched by any provided barcode.
- Relative Matches per Category Chart: Stacked bar plot similar to the previous chart with the relative number of reads per sample file (Figure 5). Useful when the number of reads diverges largely between input files.
- Output FastQ/A files containing all matched reads demultiplexed with their adapters trimmed. The demultiplexed reads can be grouped into these files by barcode and input file if the "Include Sample Name" parameter has been checked. Otherwise, they will be grouped only by the provided barcodes, even if they come from different input files.
- Counts table in a tabular TXT file. This file includes the count of all barcode matches along all the input samples. It also includes the total number of matches per sample and the number of unmatched sequences. It is formatted as a table, having the barcodes as rows and the samples as columns. Furthermore, it is compatible with any spreadsheet program.
References
Please cite Cutadapt as:
Cutadapt removes adapter sequences from high-throughput sequencing reads.
Marcel Martin. EMBnet.journal, 17(1):10-12, May 2011.
DOI: http://dx.doi.org/10.14806/ej.17.1.200