Characterizing microsatellite polymorphisms using assembly-based and mapping-based tools
Microsatellite polymorphism has always been a challenge for genome assembly and sequence alignment due to sequencing errors, short read lengths, and high incidence of polymerase slippage in microsatellite regions. Despite the information they carry being very valuable, microsatellite variations have not gained enough attention to be a routine step in genome sequence analysis pipelines. After the completion of the 1000 Genomes Project, which aimed to establish the most detailed genetic variation catalog for humans, the consortium released only two microsatellite prediction sets generated by two tools. Many other large research efforts have failed to shed light on microsatellite variations. We evaluated the performance of three different local assembly methods on three different experimental settings, focusing on genotype-based performance, coverage impact, and preprocessing including flanking regions. All these experiments supported our initial expectations on assembly. We also demonstrate that overlap-layout-consensus (OLC)-based assembly methods show higher sensitivity to microsatellite variant calling when compared to a de Bruijn graph-based approach. We conclude that assembly with OLC is the better method for genotyping microsatellites. Our pipeline is available at https://github.com/ gulfemd/STRAssembly.
___
- Alkan C, Coe BP, Eichler EE (2011). Genome structural variation
discovery and genotyping. Nature Review Genetics 12: 363-
376.
- Benson G (1999). Tandem repeats finder: a program to analyze DNA
sequences. Nucleic Acids Research 27: 573-580.
- Bloom BH (1970). Space/time trade-offs in hash coding with
allowable errors. Communications of the ACM 13: 422-426.
- Cao MD, Balasubramanian S, Bodén, M (2015). Sequencing
technologies and tools for short tandem repeat variation
detection. Briefings in Bioinformatics 16: 193-204
- Cherukuri Y, Janga SC (2016). Benchmarking of de novo assembly
algorithms for nanopore data reveals optimal performance of
OLC approaches. BMC Genomics 17 (Suppl. 7): 507.
- Chikhi R, Limasset A, Medvedev P (2016). Compacting de Bruijn
graphs from sequencing data quickly and in low memory.
Bioinformatics 32: i201-i208.
- Chikhi R, Rizk G (2013). Space-efficient and exact de Bruijn
graph representation based on a Bloom filter. Algorithms for
Molecular Biology 8: 22.
- Doi K, Monjo T, Hoang PH, Yoshimura J, Yurino H et al. (2014).
Rapid detection of expanded short tandem repeats in personal
genomics using hybrid sequencing. Bioinformatics 30: 815-
822.
- Gill P (2002). Role of short tandem repeat DNA in forensic
casework in the UK–past, present, and future perspectives.
BioTechniques 32: 366-368.
- Gymrek M, Golan D, Rosset S, Erlich Y (2012). lobSTR: A short
tandem repeat profiler for personal genomes. Genome
Research 22: 1154-1162.
- Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B et al. (2016).
Abundant contribution of short tandem repeats to gene
expression variation in humans. Nature Genetics 48: 22-29.
- Highnam G, Franck C, Martin A, Stephens C, Puthige A et al.
(2013). Accurate human microsatellite genotypes from highthroughput resequencing data using informed error profiles.
Nucleic Acids Research 41: e32.
- Kavak P, Lin YY, Numanagic I, Asghari H, Güngör T et al. (2017).
Discovery and genotyping of novel sequence insertions in
many sequenced individuals. Bioinformatics 33: i161-i169.
Kojima K, Kawai Y, Misawa K, Mimori T, Nagasaki M (2016). STRrealigner: a realignment method for short tandem repeat
regions. BMC Genomics 17: 991.
- Kozlowski P, Sobczak K, Krzyzosiak WJ (2010). Trinucleotide
repeats: triggers for genomic disorders? Genome Medicine 2:
29.
- Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J et al.
(2001). REPuter: the manifold applications of repeat analysis
on a genomic scale. Nucleic Acids Research 29: 4633-4642.
- Li H, Glusman G, Hu H, Shankaracharya F, Caballero J et al. (2014).
Relationship estimation from whole-genome sequence data.
PLoS Genetics 10: e1004144.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The
sequence alignment/map format and SAMtools. Bioinformatics
25: 2078-2079.
- Litt M, Hauge X, Sharma V (1993). Shadow bands seen when typing
polymorphic dinucleotide repeats: some causes and cures.
BioTechniques 15: 280-284.
- Luo R, Liu B, Xie Y, Li Z, Huang W et al. (2012). SOAPdenovo2:
an empirically improved memory-efficient short-read de novo
assembler. GigaScience 1: 18.
- Miller JR, Koren S, Sutton G (2010). Assembly algorithms for nextgeneration sequencing data. Genomics 95: 315-327.
- Mongelli A, Sarro L, Rizzo E, Nanetti L, Meucci N et al. (2018).
Multiple system atrophy and CAG repeat length: A genetic
screening of polyglutamine disease genes in Italian patients.
Neuroscience Letters 678: 37-42.
- Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J
et al. (2017). DNA sequencing at 40: past, present and future.
Nature 550: 345-353.
- Simpson JT, Durbin R (2012). Efficient de novo assembly of large
genomes using compressed data structures. Genome Research
22: 549-556.
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM et al. (2009).
AbySS: a parallel assembler for short read sequence data.
Genome Research 19: 1117-1123.
- The 1000 Genomes Project Consortium (2015). A global reference
for human genetic variation. Nature 526 (7571): 68-74.
- Treangen TJ, Salzberg SL (2012). Repetitive DNA and next-generation
sequencing: computational challenges and solutions. Nature
Reviews Genetics 13: 36-46.
- Usdin K (2008). The biological effects of simple tandem repeats:
lessons from the repeat expansion diseases. Genome Research
18: 1011-1019.
- Velasco A, James BT, Wells VD, Girgis HZ (2018). Look4TRs: A
de novo tool for detecting simple tandem repeats using selfsupervised hidden Markov models. bioRxiv, 449801.
- Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R et al. (2014).
Comprehensive variation discovery in single human genomes.
Nature Genetics 46: 1350-1355.
- Willems T, Gymrek M, Highnam G, The 1000 Genomes Consortium,
Mittelman D et al. (2014). The landscape of human STR
variation. Genome Research 24: 1894-1904.
- Zerbino DR, Birney E (2008). Velvet: algorithms for de novo short
read assembly using de Bruijn graphs. Genome Research 18:
821-829.