Karmaşık Hastalıkların Teşhisinde Veri Madenciliği Yöntemlerinin Başarım Karşılaştırması

Bütünsel genom ilişkilendirme çalışmalarında (BGİÇ) ortaya çıkan verilerin yüksek miktarda ve çok boyutlu olması, profillerin hastalıklarla ilişkilendirilmesi ve buradan teşhise gidilmesi sırasında farklı veri madenciliği yöntemlerinin kullanılması ile mümkün olmaktadır. Yapılan çalışmada 1025 vaka ve 531 kontrolden oluşan melonom veri kümesi ile farklı etnik kökenli 2325 vaka ve 2350 kontrolden oluşan ve prostat kanseri veri kümesi kullanılmıştır. Bu hastalıklarla ilgili profiller Karar Ağacı, Naive Bayes, Destek Vektör Makinası gibi farklı veri madenciliği yöntemleri ile incelenmiştir. Her iki hastalık için de destek vektör makinası kullanılan yöntemler arasında en iyi başarımı sağlamıştır. İlgili yöntem prostat kanseri veri kümesinde %75.68’lık bir kesinlik değeri sunarken, melonom veri kümesi için %78,6’lik bir kesinlik değeri yakalamıştır.

Anahtar Kelimeler:

veri madenciliği, karar ağacı, destek vektör makinesi, naive bayes, kanser, bütünsel genom ilişkilendirme

Performance Comparison of Data Mining Methods in Diagnosis of Complex Diseases

The data used in Genome Wide Association studies is vast in amount and high dimensional. Therefore, different data mining methods are used in order to find the relations between profiles and diseases. These methods are then used for diagnostic models. In this study two different data sets were used. The melonoma data set consists of 1025 cases and 531 controls. The multi ethnic prostate cancer data set consists of 2325 cases and 2350 controls. The underlying SNPs were searched by different data mining methods such as Decision Trees, Naive Bayes and Support Vector Machines. For both diseases support vector machine presented the best performance results. This method presented 75.68% of accuracy for prostate cancer data and 78.6% of accuracy for melonoma.

Keywords:

data mining, decision tree, support vector machine, naive bayes, cancer, genome wide association,

PDF

___

Abeel T., Helleputte T., Van de Peer Y., Dupont P., Saeys Y., 2010. Robust Biomarker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods. Advanced Access Publication. Bioinformatics. 26(3):392–398
Anunciacao O., Gomes B.C., Vinga S., Gaspar J., Oliveira A.L., Rueff J., 2010. A Data Mining Approach for the Detection of High-Risk
Breast Cancer Groups. In: Rocha M.P., Riverola F.F., Shatkay H., Corchado J.M. Eds. Advances in Bioinformatics. Advances in Intelligent and Soft Computing, Springer, Berlin, Heidelberg. 74: 43-51
Baudat G., Anouar F.M., 2001. Kernel-Based Methods and Function Approximation. Interna-tional Joint Conference on Neural Networks. July 15-19. Washington D.C., USA
Ben-Hur A., Weston J., 2010. A User's Guide to Support Vector Machines. In: Carugo O., Eisenhaber F. Eds. Data Mining Techniques for the Life Sciences. Methods in Molecular Biology (Methods and Protocols), Humana Press. 609:223-239
Benoudjit N., Verleysen M., 2003. On The Kernel Widths in Radial-Basis Function Networks. Neural ProcessingLetters 18: 139–154
Chuang L.Y., 2011. Support Vector Machine-Based Prediction for Oral Cancer Using Four SNPs in DNA Repair Genes. Proceedings of International Multiconference of Engineers and Computer Scientists. March 16-18. Hong Kong, China
Coelho R., Basgalupp M.P., Carvalho A., Freitas A.A., 2012. Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactıons on Systems, Man, and Cybernetıcs—Part C: Applıcatıons and Revıews. 42(3): 291-312
Demsar J., Curk T., Erjavec A., Gorup C., Hocevar T., Milutinovic M., Možina M., Polajnar M., Toplak M., Starič A., Štajdohar M., Umek L., Žagar L., Žbontar J., Žitnik M., Zupan B., 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research: 234 – 2353. Domingos P., Pazzani M., 1997. On The Optimality of the Simple Bayesian Classifier Under Zero-One Loss. Machine Learning. 29(2):103–130
Easton D.F., Eeles R.A., 2008. Genome-Wide Association Studies in Cancer. Oxford Journals Life Sciences and Medicine Human Molecular Genetics. 17(R2):R109-R115
Fiaschi L., Garibaldi J. M., Krasnogor N., 2009. A Framework for the Application of Decision Trees to the Analysis of SNPs Data. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 30 March – 2 April. Nashville, TN, USA
Gerstenblith M.R., Shi J., LAndi M.T., 2010. Genome-Wide Association Studies of Pigmentation and Skin Cancer: A Review and Meta-Analysis. Pigment Cell & Melanoma Research. 23(5): 587–606
Guillaume L., Palmer C.D., Young T., Ejebe K.G., Allayee H., Benjamin E.J., 2011. Genome Wide Association Study of Coronary Heart Disease and Its Risk Factors in 8,090 African Americans: The NHLBI CARe Project. Plos Genetics 7(2): e1001300
Hofmann T., Scholkopf B., Smola A .J., 2008. Kernel Methods in Machine Learning. The Annals of Statistics. 36(3):1171-1220
Horng J.T., Hu K.C., Wu L.C., Huang H P., Lin F.M., Huang S.L., Lai H.C., Chu T.Y., 2004. Identifying The Combination of Genetic Factors That Determine Susceptibility to Cer-vical Cancer. IEEE Transactions on Information Technology in Biomedicine. 8(1): 59-66
Huang J., Lin A., Narasimhan B., Quertermous T., Hsiung C.A., Ho L.T., Grove J.S., Oliver M., Ranade K., Risch N.J., Olshen R.A., 2004. Tree-structured supervised learning and the genetics of hypertension. Proceedings of the National Academy of Sciences of the United States of America. July 12. 101(29):10529–10534
Huang L. C., Hsu S. Y., Lin E., 2009. A Comparison of Classification Methods for Predicting Chronic Fatigue Syndrome Based on Genetic Data. Journal of Translational Medicine. 7:81
Jakkula E., Leppa V., Sulonen A.K., Varil T., 2010. Genome-wide Association Study in a -Risk Isolate for Multiple Sclerosis Reveals Associated Variants in STAT3 Gene. The American Journal of Human Genetics. 86: 285–291
Jesus K., Juan C. F.L., Enrique H.L., 2007. GPDTI: A Genetic Programming Decision Tree InductionMethod to Find Epistatic Effects in Common Complex Diseases. Bioinformat-ics.123(13):167-174
Jiang X., Barmada M. M., Visweswaran S., 2010. Identifying Genetic Interactions in Genome-Wide Data Using Bayesian Networks. Genet Epidemiol, 34(6): 575–581
JiaoY., Chen R., Ke X.,Cheng L., ChuK., Lun Z., Herskovits E.H., 2011. Predictive Models for Subtypes of Autism Spectrum Disorder Based on Single-Nucleotide Polymorphisms and Magnetic Resonance Imaging. Advances in Medical Sciences. 56: 334-342
Klein R.J., Zeiss C., Chew E.Y., Tsai J.Y., Sackler R.S., Haynes C., Henning A.K., SanGiovanni J.P., Mane S.M., Mayne S.T., Bracken M.B., Ferris F.L., Ott J., Barnstable C., Hoh J., 2005. Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science. 308 (5720): 385–9
Lee J. C., Parkes M., 2011. Genome-Wide Association Studies and Crohn’s Disease. Oxford Journals Life Sciences Briefings in Functional Genomics. 10(2):71-76
Lin H., Lin C., 2003. A Study on Sigmoid Kernels for SVM and the Training of non- PSD Kernels by SMO-type Methods. Technical report.
Listgarten J., Damaraju S., Poulin B., Cook L., 2011. Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clinical Cancer Reseach. 10:2725–2737
Malovini A., Barbarini N., Bellazzi R., Michelis F., 2014. Hierarchical Naive Bayes for Genetic Association Studies. BMC Bioinformatics. 13(Suppl 14): S6
Muller K. R., Mika S., Ratsch G., Tsuda K., Scholkopf B., 2005. An Introduction to Kernel-Based Learning Algorthims. IEEE Transactions on Neural Networks. 12(2): 181–201
Park J., Sandberg I.W., 1991. Universal Approximation Using Radial-Basis-Function Networks. Neural Comput. 3:246–257 Quinlan J.R., 1986. Induction of Decision Trees. Machine Learning. 1(1):81-106
Reddy MV, Wang H., Liu S., Bode B., Reed J.C., Steed R.D., Anderson S.W., Steed L., Hopkins D., She J.X., 2011. Association between Type 1 Diabetes and GWAS SNPs in the Southeast US Caucasian Population. Genes and Immunity. 12(3):208-212
Roberts J.M., Redman C.W. G., 1993. Pre-Eclampsia: More Than Pregnancy-Induced Hypertension. The Lancet. 341(8858):1447 – 1451
Rokach, L., Maimon, O., 2002. Top-Down Induction of Decision Trees Classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 35(4):476- 487
Sambo F., Trifoglio E., Di Camillo B., Toffolo G.M., Cobelli C., 2012. Bag of Naïve Bayes: Biomarker Selection and Classification from Genome-Wide SNP Data. BMC Bioinformatics. 13(Suppl 14):S2
Scott L. J., Muglia P., Kong X.Q., 2009. Genome-Wide Association and Meta-Analysis of Bipolar Disorder in Individuals of European Ancestry. PNAS. 106 (18): 7501–7506
Stahl E. A., Raychaudhuri S., Remmers E.F., 2010. Genome-Wide Association Study Meta-Analysis Identifies Seven New Rheumatoid Arthritis Risk Loci. Nature Genetics 42(10):508–514
Turner S. D., Dudek S. M., Ritchie M. D., 2010. ATHENA: A Knowledge-Based Hybrid Backpropagation-Grammatical Evolution Neural Network Algorithm for Discovering Epistasis among Quantitative Trait Loci. BioData Mining 3:5
Uhmn S., Kim D.H., Ko Y.W., Cho S., Cheong J., Kim J., 2009. A Study on Application of Single Nucleotide Polymorphism and Machine Learning Techniques to Diagnosis of Chronic Hepatitis. Expert Systems. 26(1)
Ustünkar G, Aydın Son Y., 2011. METU-SNP: An Integrated Software System for SNP-Complex Disease Association Analysis. J Integr Bioinform, 8(1):187
Vapnik V., Cortes C., 1995. Support-Vector Networks. Machine Learning. 20(3):273-297
Wei W., Visweswaran S., Cooper G. F., 2011. The Application of Naive Bayes Model Averaging to Predict Alzheimer's disease from Genome-Wide Data. JAm Med Inform Assoc. 18(4): 370–375
Wei Z., Wang K., Qu H.Q., Zhang H., 2009. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. Plosone. 5(10): e1000678
Xiao R., Wang J., Zhang F., 2010. An Approach to Incremental SVM Learning Algorithm. 12th IEEE Proceedings on Tools with Artificial Intelligence. 268-273
Yeager M., Orr N., Hayes R.B., 2007. Genome-Wide Association Study of Prostate Cancer Identifies a Second Risk Locus at 8q24. Nature Genetics 39: 645 – 649
Yücebaş S. C., Aydın Son Y., 2014. A Prostate Cancer Model Build by a Novel SVM ID3 Hybrid Feature Selection Method Using Both Genotyping and Phenotype Data from dbGaP. PLoS ONE 9(3): e91404
Zhou N., Wang L., 2007. Effective Selection of Informative SNPs and Classification on the Hapmap Genotype Data. BMC Bioinformatics.8:484