Detailed evaluation of cancer sequencing pipelines in different microenvironments and heterogeneity levels

The importance of next generation sequencing (NGS) rises in cancer research as accessing this key technology becomes easier for researchers. The sequence data created by NGS technologies must be processed by various bioinformatics algorithms within a pipeline in order to convert raw data to meaningful information. Mapping and variant calling are the two main steps of these analysis pipelines, and many algorithms are available for these steps. Therefore, detailed benchmarking of these algorithms in different scenarios is crucial for the efficient utilization of sequencing technologies. In this study, we compared the performance of twelve pipelines (three mapping and four variant discovery algorithms) with recommended settings to capture single nucleotide variants. We observed significant discrepancy in variant calls among tested pipelines for different heterogeneity levels in real and simulated samples with overall high specificity and low sensitivity. Additional to the individual evaluation of pipelines, we also constructed and tested the performance of pipeline combinations. In these analyses, we observed that certain pipelines complement each other much better than others and display superior performance than individual pipelines. This suggests that adhering to a single pipeline is not optimal for cancer sequencing analysis and sample heterogeneity should be considered in algorithm optimization.

___

  • Alioto TS, Buchhalter I, Derdak S, Hutter B, Eldridge MD et al. (2015). A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature Communications 6 (1): 1-13.
  • Baltzer F (1964). Theodor Boveri. Science 144 (3620): 809-815.
  • Baysan M, Woolard K, Cam MC, Zhang W, Song H et al. (2017). Detailed longitudinal sampling of glioma stem cells in situ reveals Chr7 gain and Chr10 loss as repeated events in primary tumor formation and recurrence. International Journal of Cancer 141 (10): 2002-2013.
  • Bohnert R, Vivas S, Jansen G (2017). Comprehensive benchmarking of SNV callers for highly admixed tumor data. PLOS ONE 12 (10): e0186175.
  • Cai L, Yuan W, Zhang Z, He L, Chou K-C (2016). In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data. Scientific Reports 6: 36540.
  • Chen S, Zhou Y, Chen Y, Gu J (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34 (17): i884-i890.
  • Chen Z, Yuan Y, Chen X, Chen J, Lin S et al. (2020). Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Scientific Reports 10 (1): 1-9.
  • Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology 31 (3): 213-219.
  • Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C et al. (2018). Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Systems 6 (3): 271-281.
  • Fang H, Wu Y, Narzisi G, ORawe JA, Barrón LTJ et al. (2014). Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Medicine 6 (10): 89.
  • Ghoneim DH, Myers JR, Tuttle E, Paciorkowski AR (2014). Comparison of insertion/deletion calling algorithms on human next-generation sequencing data. BMC research notes 7 (1): 864.
  • Hasan MS, Wu X, Zhang L (2015). Performance evaluation of indel calling tools using real short-read data. Human Genomics 9 (1): 20.
  • Hatem A, Bozdağ D, Toland AE, Çatalyürek ÜV (2013). Benchmarking short sequence mapping tools. BMC Bioinformatics 14 (1): 184.
  • Hofmann AL, Behr J, Singer J, Kuipers J, Beisel C et al. (2017). Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics 18 (1): 1-15.
  • Hwang K-B, Lee I-H, Li H, Won D-G, Hernandez-Ferrer C et al. (2019). Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Scientific Reports 9 (1): 1-10.
  • Hwang S, Kim E, Lee I, Marcotte EM (2015). Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports 5 (1): 1-8.
  • Kim BY, Park JH, Jo HY, Koo SK, Park MH (2017). Optimized detection of insertions/deletions (INDELs) in whole-exome sequencing data. PLoS One 12 (8): e0182272.
  • Kim S, Scheffler K, Halpern AL et al. (2018). Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods 15 (8): 591–594.
  • Kim SY, Jacob L, Speed TP (2014). Combining calls from multiple somatic mutation-callers. BMC Bioinformatics 15 (1): 154.
  • Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD et al. (2012). VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research 22 (3): 568-576.
  • Koboldt DC (2020). Best practices for variant calling in clinical sequencing. Genome Medicine 12 (1): 1-13.
  • Krøigård AB, Thomassen M, Lænkholm A-V, Kruse TA, Larsen MJ (2016). Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLOS ONE 11 (3): e0151664.
  • Langmead B, Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9 (4): 357.
  • Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE et al. (2012). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28 (3): 311-317.
  • Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 (14): 1754-1760.
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The sequence alignment/map format and SAMtools. Bioinformatics 25 (16): 2078-2079.
  • Mukherjee S (2010). The emperor of all maladies: a biography of cancer. Simon and Schuster.
  • Narzisi G, O’rawe JA, Iossifov I, Fang H, Lee YH et al. (2014). Accurate de novo and transmitted indel detection in exomecapture data using microassembly. Nature Methods 11 (10): 1033-1036.
  • O’Rawe J, Jiang T, Sun G, Wu Y, Wang W et al. (2013). Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5 (3): 28.
  • Pirooznia M, Kramer M, Parla J, Goes FS, Potash JB et al. (2014). Validation and assessment of variant calling pipelines for nextgeneration sequencing. Human Genomics 8 (1): 14.
  • Rashid M, Robles-Espinoza CD, Rust AG, Adams DJ (2013). Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes. Bioinformatics 29 (17): 2208-2210.
  • Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S et al. (2013). A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29 (18): 2223-2230.
  • Siegel RL, Miller KD, Jemal A (2019). Cancer statistics, 2019. CA: A Cancer Journal for Clinicians 69 (1): 7-34.
  • Stephens ZD, Hudson ME, Mainzer LS, Taschuk M, Weber MR et al. (2016). Simulating next-generation sequencing datasets from empirical mutation and sequencing models. PLOS One 11 (11): e0167047.
  • Tucker T, Marra M, Friedman JM (2009). Massively parallel sequencing: the next big thing in genetic medicine. The American Journal of Human Genetics 85 (2): 142-154.
  • Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G et al. (2013). From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics 43 (1): 11-10.
  • Wang Q, Jia P, Li F, Chen H, Ji H et al. (2013). Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Medicine 5 (10): 91.
  • Wang Q, Kotoula V, Hsu PC Papadopoulou K, Ho JW, Fountzilas, G, Giannoulatou E. (2019). Comparison of somatic variant detection algorithms using Ion Torrent targeted deep sequencing data. BMC Medical Genomics 12 (9): 1-11.
  • Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. (2013). The cancer genome atlas pan-cancer analysis project. Nature Genetics 45 (10): 1113.
  • Xu C (2018). A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Computational and Structural Biotechnology Journal 16: 15-24.
  • Zhang J, Baran J, Cros A, Guberman JM, Haider S et al. (2011). International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database 201.