Evaluation of genome scaffolding tools using pooled clone sequencing

DNA sequencing technologies hold great promise in generating information that will guide scientists to understand how the genome affects human health and organismal evolution. The process of generating raw genome sequence data becomes cheaper and faster, but more error-prone. Assembly of such data into high-quality finished genome sequences remains challenging. Many genome assembly tools are available, but they differ in terms of their performance and their final output. More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. Here we evaluate the accuracies of several genome scaffolding algorithms using two different types of data generated from the genome of the same human individual: whole genome shotgun sequencing (WGS) and pooled clone sequencing (PCS). We observe that it is possible to obtain better assemblies if PCS data are used, compared to using only WGS data. However, the current scaffolding algorithms are developed only for WGS, and PCS-aware scaffolding algorithms remain an open problem.

___

  • Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander, E (2002). ARACHNE: A whole-genome shotgun assembler. Genome Res 12: 177-189.
  • Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano, W (2011). Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27: 578-579.
  • Chaisson M, Pevzner P, Tang H (2004). Fragment assembly with short reads. Bioinformatics 20: 2067-2074.
  • Donmez N, Brudno M (2013). SCARPA: scaffolding reads with practical algorithms. Bioinformatics 29: 428-434.
  • ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57-74.
  • Eslami Rasekh M, Chiatante G, Miroballo M, Tang J, Ventura M, Amemiya CT, Eichler EE, Antonacci F, Alkan C (2017). Discovery of large genomic inversions using long range information. BMC Genomics 18: 65.
  • Gao S, Sung WK, Nagarajan N (2011). Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 18: 1681-1691.
  • Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S et al (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. P Natl Acad Sci USA 108: 1513-1518.
  • Hunt M, Newbold C, Berriman M, Otto TD (2014). A comprehensive evaluation of assembly scaffolding tools. Genome Biology 15: R42.
  • International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431: 931-945.
  • Kececioglu JD, Myers EW (1995). Combinatorial algorithms for DNA sequence assembly. Algorithmica 13: 7.
  • Kitzman JO, Mackenzie AP, Adey A, Hiatt JB, Patwardhan RP, Sudmant PH, Ng SB, Alkan C, Qiu R, Eichler EE et al (2011). Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol 29: 59-63.
  • Mardis ER (2008). The impact of next-generation sequencing technology on genetics. Trends Genet 24: 133-141.
  • Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z et al (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376-380.
  • Mostovoy Y, Levy-Sakin M, Lam J, Lam ET, Hastie AR, Marks P, Lee J, Chu C, Lin C, Džakula Ž et al (2016). A hybrid approach for de novo human genome sequence assembly and phasing. Nat Methods 13: 587-590.
  • Mullikin JC, Ning Z (2003). The phusion assembler. Genome Res 13: 81-90.
  • Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L (2014). BESST– efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 15: 281.
  • Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero, J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T (2012). Insights into hominid evolution from the gorilla genome sequence. Nature 483: 169-175.
  • Shendure J, Ji H (2008). Next-generation DNA sequencing. Nat Biotechnol 26: 1135-1145.
  • Simpson JT, Durbin R (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22: 549-556.
  • Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I (2009). ABySS: a parallel assembler for short read sequence data. Genome Res 19: 1117-1123.
  • Steinberg KM, Schneider VA, Alkan C, Montague MJ, Warren WC, Church DM, Wilson RK (2017). Building and improving reference genome assemblies. P IEEE 105: 422-435.
  • Sutton GG, White O, Adams MD, Kerlavage AR (1995). TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology 1: 9-19.
  • The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526: 68-74.
  • Treangen TJ, Salzberg SL (2012). Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13: 36-46.
  • Zerbino DR, Birney E (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821- 829.
  • Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, Puiu D, Roberts M, Wegrzyn JL, de Jong PJ et al (2014). Sequencing and assembly of the 22-gb loblolly pine genome. Genetics 196: 875-890.