Characterizing microsatellite polymorphisms using assembly-based and mapping-based tools

Characterizing microsatellite polymorphisms using assembly-based and mapping-based tools

Microsatellite polymorphism has always been a challenge for genome assembly and sequence alignment due to sequencingerrors, short read lengths, and high incidence of polymerase slippage in microsatellite regions. Despite the information they carry beingvery valuable, microsatellite variations have not gained enough attention to be a routine step in genome sequence analysis pipelines.After the completion of the 1000 Genomes Project, which aimed to establish the most detailed genetic variation catalog for humans,the consortium released only two microsatellite prediction sets generated by two tools. Many other large research efforts have failedto shed light on microsatellite variations. We evaluated the performance of three different local assembly methods on three differentexperimental settings, focusing on genotype-based performance, coverage impact, and preprocessing including flanking regions. Allthese experiments supported our initial expectations on assembly. We also demonstrate that overlap-layout-consensus (OLC)-basedassembly methods show higher sensitivity to microsatellite variant calling when compared to a de Bruijn graph-based approach. Weconclude that assembly with OLC is the better method for genotyping microsatellites. Our pipeline is available at https://github.com/gulfemd/STRAssembly.

___

  • Alkan C, Coe BP, Eichler EE (2011). Genome structural variation discovery and genotyping. Nature Review Genetics 12: 363- 376.
  • Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27: 573-580.
  • Bloom BH (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13: 422-426.
  • Cao MD, Balasubramanian S, Bodén, M (2015). Sequencing technologies and tools for short tandem repeat variation detection. Briefings in Bioinformatics 16: 193-204
  • Cherukuri Y, Janga SC (2016). Benchmarking of de novo assembly algorithms for nanopore data reveals optimal performance of OLC approaches. BMC Genomics 17 (Suppl. 7): 507.
  • Chikhi R, Limasset A, Medvedev P (2016). Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32: i201-i208.
  • Chikhi R, Rizk G (2013). Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology 8: 22.
  • Doi K, Monjo T, Hoang PH, Yoshimura J, Yurino H et al. (2014). Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing. Bioinformatics 30: 815- 822.
  • Gill P (2002). Role of short tandem repeat DNA in forensic casework in the UK–past, present, and future perspectives. BioTechniques 32: 366-368.
  • Gymrek M, Golan D, Rosset S, Erlich Y (2012). lobSTR: A short tandem repeat profiler for personal genomes. Genome Research 22: 1154-1162.
  • Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B et al. (2016). Abundant contribution of short tandem repeats to gene expression variation in humans. Nature Genetics 48: 22-29.
  • Highnam G, Franck C, Martin A, Stephens C, Puthige A et al. (2013). Accurate human microsatellite genotypes from highthroughput resequencing data using informed error profiles. Nucleic Acids Research 41: e32.
  • Kavak P, Lin YY, Numanagic I, Asghari H, Güngör T et al. (2017). Discovery and genotyping of novel sequence insertions in many sequenced individuals. Bioinformatics 33: i161-i169.
  • Kojima K, Kawai Y, Misawa K, Mimori T, Nagasaki M (2016). STRrealigner: a realignment method for short tandem repeat regions. BMC Genomics 17: 991.
  • Kozlowski P, Sobczak K, Krzyzosiak WJ (2010). Trinucleotide repeats: triggers for genomic disorders? Genome Medicine 2: 29.
  • Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J et al. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Research 29: 4633-4642.
  • Li H, Glusman G, Hu H, Shankaracharya F, Caballero J et al. (2014). Relationship estimation from whole-genome sequence data. PLoS Genetics 10: e1004144.
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The sequence alignment/map format and SAMtools. Bioinformatics 25: 2078-2079.
  • Litt M, Hauge X, Sharma V (1993). Shadow bands seen when typing polymorphic dinucleotide repeats: some causes and cures. BioTechniques 15: 280-284.
  • Luo R, Liu B, Xie Y, Li Z, Huang W et al. (2012). SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1: 18.
  • Miller JR, Koren S, Sutton G (2010). Assembly algorithms for nextgeneration sequencing data. Genomics 95: 315-327.
  • Mongelli A, Sarro L, Rizzo E, Nanetti L, Meucci N et al. (2018). Multiple system atrophy and CAG repeat length: A genetic screening of polyglutamine disease genes in Italian patients. Neuroscience Letters 678: 37-42.
  • Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J et al. (2017). DNA sequencing at 40: past, present and future. Nature 550: 345-353.
  • Simpson JT, Durbin R (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22: 549-556.
  • Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM et al. (2009). AbySS: a parallel assembler for short read sequence data. Genome Research 19: 1117-1123.
  • The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 (7571): 68-74.
  • Treangen TJ, Salzberg SL (2012). Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13: 36-46.
  • Usdin K (2008). The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome Research 18: 1011-1019.
  • Velasco A, James BT, Wells VD, Girgis HZ (2018). Look4TRs: A de novo tool for detecting simple tandem repeats using selfsupervised hidden Markov models. bioRxiv, 449801.
  • Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R et al. (2014). Comprehensive variation discovery in single human genomes. Nature Genetics 46: 1350-1355.
  • Willems T, Gymrek M, Highnam G, The 1000 Genomes Consortium, Mittelman D et al. (2014). The landscape of human STR variation. Genome Research 24: 1894-1904.
  • Zerbino DR, Birney E (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18: 821-829