Segmental duplications are two or more regions of the genome at least 1 kb in length with >90% sequence similarity. Segmental duplications are important to analyze due to their clinical significance in human disease. Segmental duplications present challenges for variant calling from next-generation sequence (NGS) data because short reads can map to more than one reference location.
The DRAGEN Virtual Long-Read Detection (VLRD) Pipeline is an advanced algorithm that calls variants in segmental duplications from short read sequence data. The DRAGEN VLRD Pipeline has much greater accuracy in segmental duplications than standard variant callers and works by jointly calling all regions that are similar. During mapping and alignment of NGS data, DRAGEN VLRD analyzes all sequence data, even those with low MAPQ scores as seen in segmental duplication regions due to their similarity. The DRAGEN VLRD Pipeline then solves for the four most likely haplotypes that originate in these regions of interest and proceeds to variant calling.
DRAGEN VLRD Pipeline
The DRAGEN VLRD Pipeline accepts FASTQ/BAM/CRAM and produces a VLRD-specific VCF. During mapping and aligning, the DRAGEN VLRD algorithm does not filter out reads with the low MAPQ scores typically found in sequences containing segmental duplications. All reads are considered jointly and are assigned a location based on maximum likelihood estimates. During variant calling, a BED file is used to delineate duplications that are >400 bp. The VLRD algorithm is then run on the duplicate regions and calls SNP and INDEL variants, which are produced in a VLRD-specific VCF file.
DRAGEN VLRD Pipeline Results
ROC curves comparing DRAGEN VLRD with the GATK variant caller for two homologous regions. Regions in the hs37d5 reference genome with >98% similarity were analyzed. Random variants were introduced into the genome based on real sample data provided by 10x Genomics and synthetic 100-bp paired end reads were generated from the reference. Results show that the DRAGEN VLRD Variant Caller has better sensitivity and accuracy in duplications compared to GATK.