Onur ÇAKIRGÖZ, Süleyman SEVİNÇ

Varyasyon Bazlı Kişisel Genetik Verilerin İlişkisel Veritabanı ile Organizasyonu

İlişkisel veritabanları halihazırda birçok hastanede ve klinikte hasta kayıtlarını ve tahlil sonuçlarını depolamak için etkin bir şekilde kullanılmaya devam etmektedir. Sekanslama teknolojilerinin gelişmesiyle birlikte sekanslama maliyetleri önemli bir ölçüde düşmüştür. Bunun yanında, kişiselleştirilmiş tıp uygulamalarının sayısı her geçen gün artmaktadır ve buna bağlı olarak depolanması ve sorgulanması gereken kişisel genetik verilerin boyutu da yükselmektedir. Her ne kadar ilişkisel veritabanları hasta kayıtlarını ve tahlil sonuçlarını depolamak için uygun olsa da kişisel genetik verilerin verimli bir şekilde depolanması için ek tasarımlara ve çözümlere ihtiyaç vardır. Bu çalışmada, varyasyon bazlı kişisel genetik verilerin ilişkisel veritabanına entegrasyonu için yeni bir çözüm önerilmektedir. Bu çözüm kapsamında, hem yapısal olmayan hem de yapısal varyasyon tipleri için formatlar geliştirilmiştir ve sıkıştırma algoritmaları kullanılmıştır. Önerilen yöntem 1000 Genom Projesi’nin yayınlamış olduğu 2504 kişiye ait gerçek veriler ile test edilmiştir. Yapılan analizler sonucunda, önerilen yöntemin ham sekans verisini saklamak için gereken alana kıyasla çok daha az bir alana ihtiyaç duyduğu görülmüştür.

Organization of Variation Based Personal Genetic Data with Relational Database

Relational databases are currently being used effectively in many hospitals and clinics to store patient records and assay results. With the rapid development of sequencing technologies, sequencing costs have declined considerably. In addition, the number of personalized medicine practices is increasing day by day, and accordingly the size of the personal genetic data that needs to be stored and questioned is also increasing. Although relational databases are appropriate for storing patient records and assay results, additional designs and solutions are needed to efficiently store personal genetic data. In this study, a novel solution is proposed for the integration of variation-based personal genetic data into relational database. Within the scope of this solution, formats for both non-structural and structural variation types have been developed and compression algorithms have been used. The proposed method was tested with real data of 2504 people, published by 1000 Genome Project. As a result of the analyzes made, it was seen that the proposed method requires much less space than the space required to store the raw sequence data.

Keywords:

relational database, data format genetic data, variation,

PDF

___

[1] International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome”, Nature, 431(7011), 931-945, 2004.
International HapMap Consortium, “A second generation human haplotype map of over 3.1 million SNPs”, Nature, 449, 851–861, 2007.
1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing”, Nature, 467(7319), 1061-1073, 2010.
1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes”, Nature, 491(7422), 56-65, 2012.
1000 Genomes Project Consortium, “A global reference for human genetic variation”, Nature, 526(7571), 68-74, 2015.
P. H. Sudmant, et al., “An integrated map of structural variation in 2,504 human genomes”, Nature, 526(7571), 75-81, 2015.
B. Alberts, et al., Molecular Biology of the Cell. Garland Science, New York, A.B.D., 2007.
M. M. Alves, et al., “Contribution of rare and common variants determine complex diseases— Hirschsprung disease as a model”, Developmental biology, 382(1), 320-329, 2013.
W. P. Gilks, J. K. Abbott, E. H. Morrow, “Sex differences in disease genetics: evidence, evolution, and detection”, Trends in Genetics, 30(10), 453-463, 2014.
J. Hardy, A. Singleton, “Genomewide association studies and human disease”, N. Engl. J. Med, 360, 1759–1768, 2009.
W. L. Lowe, T. E. Reddy, “Genomic approaches for understanding the genetics of complex disease”, Genome research, 25(10), 1432-1441, 2015.
C. Katsios, D. H. Roukos, “Individual genomes and personalized medicine: life diversity and complexity”, Personalized Medicine, 7(4), 347-350, 2010.
M. A. Hamburg, F. S. Collins, “The path to personalized medicine”, New England Journal of Medicine, 363(4), 301-304, 2010.
G. S. Ginsburg, J. J. McCarthy, “Personalized medicine: revolutionizing drug discovery and patient care”, TRENDS in Biotechnology, 19(12), 491-496, 2001.
N. J. Schork, “Personalized medicine: time for one- person trials”. Nature, 520(7549), 609-611, 2015.
E. L. Van Dijk, H. Auger, Y. Jaszczyszyn, C. Thermes, “Ten years of next-generation sequencing technology”, Trends in genetics, 30(9), 418-426, 2014.
Internet: Fasta Format, https://en.wikipedia.org/wiki/FASTA_format, 20.04.2018.
A. Löytynoja, N. Goldman, “An algorithm for progressive multiple alignment of sequences with insertions”, Proceedings of the National academy of sciences of the United States of America, 102(30), 10557-10562, 2005.
H. Li, N. Homer, “A survey of sequence alignment algorithms for next-generation sequencing”, Briefings in bioinformatics, 11(5), 473- 483, 2010.
T. Lassmann, E. L. Sonnhammer, “Kalign–an accurate and fast multiple sequence alignment algorithm”, BMC bioinformatics, 6(1), 2005.
O. Çakırgöz, Organization and Processing of Personal Genetic Data for Clinical Use, Phd Thesis, Dokuz Eylül University, The Graduate School of Natural and Applied Sciences, 2017.
S. Grümbach, F. Tahi, “Compression of DNAsequences”, Proceedings of the IEEE Data Compression Conference (DCC), 340–350, 1993.
X. Chen, et al., “DNACompress: fast and effective DNA sequence compression”, Bioinformatics, 18(12), 1696-1698, 2002.
B. Behzadi, F. L. Fessant, “DNA compression challenge revisited: a dynamic programming approach”, CPM, Springer, 190–200, 2005.
M. D. Cao, et al., “A simple statistical algorithm for biological sequence compression”, Proceedings of the IEEE Data Compression Conference (DCC), 43–52, 2007.
Internet: The Variation data as VCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/201 30502/, 23.07.2016.
Internet: The Variation data as BCF files, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/201 30502/supporting/bcf_files, 23.07.2016.
Internet: The VCF File Format, https://github.com/samtools/hts-specs, 19.03.2016
J. Hammer, M. Schneider, “Genomics Algebra: A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information”, Proceedings of the 2003 CIDR Conference, 2003.
S. Tata, Declarative querying for biological sequences, Phd Thesis, The University of Michigan, Michigan, 2007.
V. Bafna, et al., “Abstractions for genomics”, Communications of the ACM, 56(1), 83- 93, 2013.
T. J. Pemberton, Z. A. Szpiech, “Relationship between Deleterious Variation, Genomic Autozygosity, and Disease Risk: Insights from The 1000 Genomes Project”, The American Journal of Human Genetics, 102(4), 658-675, 2018.
J. S. A. Ramos, et al., “Unraveling CYP2E1 haplotypes in alcoholics from Central Brazil: a comparative study with 1000 genomes population”, Environmental Toxicology and Pharmacology, 62, 30-39, 2018.
K. Okamura, et al., “Lists of HumanMethylation450 BeadChip probes with nucleotide-variant information obtained from the Phase 3 data of the 1000 Genomes Project”, Genomics data, 7, 67-69, 2016.
K. Nunes, et al., “HLA imputation in an admixed population: An assessment of the 1000 Genomes data as a training set”, Human immunology, 77(3), 307- 312, 2016.
S. Demircioğlu, S. Özdemir, “İlişkisel Veri Tabanlarında Anahtar Kelime Arama”, Bilişim Teknolojileri Dergisi, 5(3), 51-56, 2012.
S. Öztürk, H. Atmaca, “İlişkisel ve İlişkisel Olmayan (NoSQL) Veri Tabanı Sistemleri Mimari Performansının Yönetim Bilişim Sistemleri Kapsamında İncelenmesi”, Bilişim Teknolojileri Dergisi, 10(2), 199-209, 2017, DOI: 10.17671/gazibtd.309303.
A. Haltaş, A. Alkan, “Medlıne Veritabanı Üzerinde Bulunan Tıbbi Dökümanların Kanser Türlerine Göre Otomatik Sınıflandırılması”, Bilişim Teknolojileri Dergisi, 9(2), 181-186, 2016.