Variant Calling using Freebayes
Introduction
Freebayes is a variant calling tool characterized by its capability to use it with polyploid genomes.
This algorithm is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
Run Freebayes for Variant Calling
Freebayes can be found under Genetic Variation → Variant Calling → Freebayes.The wizard consists of 3 pages and allows to define the input and output options as well as the analysis parameters (Figure 2, Figure 3 and Figure 4).
Input
- BAM files: alignment files in BAM format. To obtain them, you must align FASTQ files using a DNA-Seq Alignment Strategy, like BWA (highly recommended) or Bowtie 2.
- Reference Genome: FASTA file with the reference genome.
- Group Experiment File (optionally): tab-delimited file with sample names in one column and population names in another. If this file is added, the population-based bayesian inference model will then be partitioned on the basis of the populations.
Make sure that read alignment was executed using the same reference genome as the one that is used here as input.
Configuration 1
In this page, the preprocessing step using Picard and ploidy parameters for Freebayes can be set.
- Remove Duplicates: mark this option if you have Whole Genome Sequencing or Whole Exome Sequencing in order to remove PCR duplicates. For GBS or RADSeq dataset, this option is not recommended.
- Samples with Mixed Ploidy: check this option if you want to perform variant calling in samples with different ploidy (e.g., a diploid and an hexaploid sample).
- Ploidy: sets the species ploidy for the analysis. This option will be only enabled when you are not going to perform mixed-ploidy variant calling.
- Copy Number Variation File: this text file consists of two columns with no header. In the first column, the sample name of each individual (i.e., the BAM file name without the ".bam" extension) must appear, and in the second one, the copy number must be shown (just a number, e.g., 1 for haploids, 2 for diploids, 3 for triploids, etc.)
- Calculate Genotype Quality: genotype Quality in Freebayes is the likelihood that a genotype is correct. Although it is recommended to let this parameter in true, please consider to switch it to false when you are doing polyploid variant calling (specially with hexaploids) and with several samples (more than 10).
- Minimum Alternate Fraction: require at least this fraction of observations supporting an alternate allele within a single individual in order to evaluate the position.
- Minimum Alternate Count: the same as before but in absolute numbers.
- Minimum Alternate Quality Sum: require at least this count of observations supporting an alternate allele within the total population in order to use the allele in analysis.
In polyploid variant calling, it is recommended to set higher thresholds for Min. Alternate Fraction and Count. Alternatively, Min. Alternate Quality Sum can be raised too, which may be more flexible.
Configuration 2
In this page, other parameters for Freebayes can be set.
- Minimum Mapping Quality: exclude alignments for the analysis if they have less than this value of mapping quality.
- Minimum Base Quality: exclude alleles for the analysis if they have less than this value of base quality.
- Minimum Allele Quality Sum: exclude alleles for the analysis if the sum of the base quality of the supporting observations is lower than this value.
- Minimum Allele Mapping Quality Sum: exclude alleles for the analysis if the sum of the mapping quality of the corresponding alignments of the supporting observations is lower than this value.
- Mismatch Base Quality: base quality to call a mismatch.
- Minimum Coverage: coverage needed to process a site.
- Maximum Coverage: downsample per-sample coverage to this level if it is greater than this coverage.
- Use Mapping Quality: use mapping quality of alleles when calculating data likelihoods.
- P-value: report sites if the probability that there is a polymorphism at the site is greater than N. Note that post-filtering is generally recommended over the use of this parameter.
Output
- Set Name for VCF: VCF filename.
- Directory to Save the VCF: directory to save the VCF file.
Results
Variant Calling has the following outputs:
- VCF file with all the found variants.
-
Report with summary details:
-
Information about the resulting VCF: information about the types of variant found and the number of alleles per variant.
- Adjusted parameters: as you might want to repeat the variant calling with other parameters, it is important to keep this table for reproducibility.
Just in case you repeat the Variant Calling Analysis with BCFtools, please keep in mind that Freebayes is able to separate MNPs from SNPs, although BCFtools is not able to do it, and MNPs are registered as different SNPs. Nevertheless, it is no of great importance.
- Distribution charts of different quality variables found in the VCF. This charts might be important to know how to filter the VCF subsequently:
Depth Histogram: In Freebayes, Depth (the DP field) means the total read depth at the locus, that is to say, the number of times that site was read, but not necessarily that variant (also the reference nucleotide, or other variants).
Proportion 'Quality/Depth' Histogram: the quality column of VCF files is the Phred-scaled probability that the site has no variant. Nevertheless, it is better to rely on this quality normalized by depth.
Mapping Quality in Alternate Alleles Histogram: the MQ value in the info field of a BCFtools VCF file relates to the average of all mapping qualities of the reads supporting the variant.