Automatic characterization of copy number polymorphism using high throughput sequencing

Genome structural variation, broadly defined as alterations longer than 50 bp, are important sources for genetic variation among humans, including those that cause complex diseases such as autism, developmental delay, and schizophrenia. Although there has been considerable progress in characterizing structural variation since the beginnings of the 1000 Genomes Project, one form of structural variation called segmental duplications SDs remained largely understudied in large cohorts. This is mostly because SDs cannot be accurately discovered using the alignment files generated with standard read mapping tools. Instead, they can only be found when multiple map locations are considered. There is still a single algorithm available for SD discovery, which includes various tools and scripts that are not portable and are difficult to use. Additionally, this algorithm relies on a priori information for regions where no structural variations are discovered in large number of genomes. Therefore, there is a need for fully automated, portable, and user-friendly tools to make SD characterization a part of genome analyses. Here we introduce such an algorithm and efficient implementation, called mrCaNaVaR, that aims to fill this gap in genome analysis toolbox

___

  • [1] Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 2011; 12 (6): 443-451. doi: 10.1038/nrg2986
  • [2] Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Research 2006; 16 (9): 1182-1190. doi: 10.1101/gr.4565806
  • [3] Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Research 2012; 22 (6): 1154-1162. doi: 10.1101/gr.135780.111
  • [4] Eslami Rasekh M, Chiatante G, Miroballo M, Tang J, Ventura M et al. Discovery of large genomic inversions using long range information. BMC Genomics 2017; 18 (1): 65. doi: 10.1186/s12864-016-3444-1
  • [5] Talkowski ME, Ernst C, Heilbut A, Chiang C, Hanscom C et al. Next-generation sequencing strategies enable routine detection of balanced chromosome rearrangements for clinical diagnostics and genetic research. American Journal of Human Genetics 2011; 88 (4): 469-481. doi: 10.1016/j.ajhg.2011.03.013
  • [6] Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK et al. Detection of large-scale variation in the human genome. Nature Genetics 2004; 36 (9): 949-951. doi: 10.1038/ng1416
  • [7] Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA et al. Copy number variation: new insights in genome diversity. Genome Research 2006; 16 (8): 949-961. doi: 10.1101/gr.3677206
  • [8] Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nature Reviews Genetics 2002; 3 (5): 370-379. doi: 10.1038/nrg798
  • [9] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010; 467 (7319): 1061-1073. doi: 10.1038/nature09534
  • [10] The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 2012; 491 (7422): 56-65. doi: 10.1038/nature11632
  • [11] The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015; 526 (7571): 68-74. doi: 10.1038/nature15393
  • [12] Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA et al. Fine-scale structural variation of the human genome. Nature Genetics 2005; 37 (7): 727-732. doi: 10.1038/ng1562
  • [13] Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nature Reviews Genetics 2011; 12 (5): 363-376. doi: 10.1038/nrg2958
  • [14] Eichler EE, Flint J, Gibson G, Kong A, Leal SM et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics 2010; 11 (6): 446-450. doi: 10.1038/nrg2809
  • [15] Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N et al. Mapping and sequencing of structural variation from eight human genomes. Nature 2008; 453 (7191): 56-64. doi: 10.1038/nature06862
  • [16] Medvedev P, Brudno M. Ab initio whole genome shotgun assembly with mated short reads. In: RECOMB 2008 International Conference on Research in Computational Molecular Biology; Singapore; 2008. pp. 50-64. doi: 10.1007/978-3-540-78839-3_5
  • [17] Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Research 2009; 19 (7): 1270-1278. doi: 10.1101/gr.088633.108
  • [18] Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D et al. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 2010; 26 (12): i350-i357. doi: 10.1093/bioinformatics/btq216
  • [19] Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research 2011; 21 (6): 974- 984. doi: 10.1101/gr.114876.110
  • [20] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009; 25 (21):2865-2871. doi: 10.1093/bioinformatics/btp394
  • [21] Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 2010; 26 (10): 1277-1283. doi: 10.1093/bioinformatics/btq152
  • [22] Soylev A, Kockan C, Hormozdiari F, Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods 2017; 129: 3-7. doi: 10.1016/j.ymeth.2017.05.030
  • [23] Soylev A, Le T, Amini H, Alkan C, Hormozdiari F. Discovery of tandem and interspersed segmental duplications using high throughput sequencing. Bioinformatics 2019 (in press). doi: 10.1093/bioinformatics/btz237
  • [24] Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biology 2014 15 (6): R84. doi: 10.1186/gb-2014-15-6-r84
  • [25] Eisfeldt J, Vezzi F, Olason P, Nilsson D, Lindstrand A. TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Research 2017; 6: 664. doi: 10.12688/f1000research.11168.2
  • [26] Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2016; 32 (8): 1220-1222. doi: 10.1093/bioinformatics/btv710
  • [27] Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Research 2001; 11 (6): 1005-1017. doi: 10.1101/gr.187101
  • [28] Yang Y, Chung EK, Wu YL, Savelli SL, Nagaraja HN et al. Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. American Journal of Human Genetics 2007; 80 (6): 1037-1054. doi: 10.1086/518257
  • [29] Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annual Review of Medicine 2010; 61: 437-455. doi: 10.1146/annurev-med-100708-204735
  • [30] Girirajan S, Dennis MY, Baker C, Malig M, Coe BP et al. Refinement and discovery of new hotspots of copy-number variation associated with autism spectrum disorder. American Journal of Human Genetics 2013; 92 (2): 221-237. doi: 10.1016/j.ajhg.2012.12.016
  • [31] Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Reviews Genetics 2006; 7 (7): 552-564. doi: 10.1038/nrg1895
  • [32] Sudmant PH, Huddleston J, Catacchio CR, Malig M, Hillier LW et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Research 2013; 23 (9): 1373-1382. doi: 10.1101/gr.158543.113
  • [33] Mills RE, Walter K, Stewart C, Handsaker RE, Chen K et al. Mapping copy number variation by population-scale genome sequencing. Nature 2011; 470 (7332): 59-65. doi: 10.1038/nature09708
  • [34] Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N et al. Global diversity, population stratification, and selection of human copy-number variation. Science 2015; 349 (6253): aab3761. doi: 10.1126/science.aab3761
  • [35] Chiang DY, McCarroll SA. Mapping duplicated sequences. Nature Biotechnology 2009; 27 (11): 1001-1002. doi: 10.1038/nbt1109-1001
  • [36] Firtina C, Alkan C. On genomic repeats and reproducibility. Bioinformatics 2016; 32 (15): 2243-2247. doi: 10.1093/bioinformatics/btw139
  • [37] Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics 2009; 41 (10): 1061-1067. doi: 10.1038/ng.437
  • [38] Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M et al. Diversity of human copy number variation and multicopy genes. Science 2010; 330 (6004): 641-646. doi: 10.1126/science.1197005
  • [39] Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL et al. Great ape genetic diversity and population history. Nature 2013; 499 (7459): 471-475. doi: 10.1038/nature12228
  • [40] Bickhart DM, Hou Y, Schroeder SG, Alkan C, Cardone MF et al. Copy number variation of individual cattle genomes using next-generation sequencing. Genome Research 2012; 22 (4): 778-790. doi: 10.1101/gr.133967.111
  • [41] Liu S, Kang X, Catacchio CR, Liu M, Fang L et al. Computational detection and experimental validation of segmental duplications and associated copy number variations in water buffalo (Bubalus bubalis). Functional & Integrative Genomics 2019; 19 (3): 409-419. doi: 10.1007/s10142-019-00657-4
  • [42] Freedman AH, Gronau I, Schweizer RM, Ortega-Del Vecchyo D, Han E et al. Genome sequencing highlights the dynamic early history of dogs. PLoS Genetics 2014; 10 (1): e1004016. doi: 10.1371/journal.pgen.1004016
  • [43] Tamazian G, Simonov S, Dobrynin P, Makunin A, Logachev A et al. Annotated features of domestic cat - Felis catus genome. Gigascience 2014; 3: 13. doi: 10.1186/2047-217X-3-13
  • [44] Cardone MF, D’Addabbo P, Alkan C, Bergamini C, Catacchio CR et al. Inter-varietal structural variation in grapevine genomes. The Plant Journal 2016; 88 (4):648-661. doi: 10.1111/tpj.13274
  • [45] Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE et al. mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Research 2014; 42: W494-W500. doi: 10.1093/nar/gku370
  • [46] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25 (16): 2078-2079. doi: 10.1093/bioinformatics/btp325
  • [47] Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 2011; 21 (5): 734-740. doi: 10.1101/gr.114819.110
  • [48] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013; arXiv:13033997.
  • [49] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods 2012; 9: 357-359. doi: 10.1038/nmeth.1923
  • [50] Cleveland WS. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 1979; 74 (368): 829-836. doi: 10.1080/01621459.1979.10481038