Karmaşık Hastalıkların Teşhisinde Veri Madenciliği Yöntemlerinin Başarım Karşılaştırması

Bütünsel genom ilişkilendirme çalışmalarında (BGİÇ) ortaya çıkan verilerin yüksek miktarda ve çok boyutlu olması, profillerin hastalıklarla ilişkilendirilmesi ve buradan teşhise gidilmesi sırasında farklı veri madenciliği yöntemlerinin kullanılması ile mümkün olmaktadır. Yapılan çalışmada 1025 vaka ve 531 kontrolden oluşan melonom veri kümesi ile farklı etnik kökenli 2325 vaka ve 2350 kontrolden oluşan ve prostat kanseri veri kümesi kullanılmıştır. Bu hastalıklarla ilgili profiller Karar Ağacı, Naive Bayes, Destek Vektör Makinası gibi farklı veri madenciliği yöntemleri ile incelenmiştir. Her iki hastalık için de destek vektör makinası kullanılan yöntemler arasında en iyi başarımı sağlamıştır. İlgili yöntem prostat kanseri veri kümesinde %75.68’lık bir kesinlik değeri sunarken, melonom veri kümesi için %78,6’lik bir kesinlik değeri yakalamıştır. 

Performance Comparison of Data Mining Methods in Diagnosis of Complex Diseases

The data used in Genome Wide Association studies is vast in amount and high dimensional. Therefore, different data mining methods are used in order to find the relations between profiles and diseases. These methods are then used for diagnostic models. In this study two different data sets were used. The melonoma data set consists of 1025 cases and 531 controls. The multi ethnic prostate cancer data set consists of 2325 cases and 2350 controls. The underlying SNPs were searched by different data mining methods such as Decision Trees, Naive Bayes and Support Vector Machines. For both diseases support vector machine presented the best performance results. This method presented 75.68% of accuracy for prostate cancer data and 78.6% of accuracy for melonoma.  


  • Abeel T., Helleputte T., Van de Peer Y., Dupont P., Saeys Y., 2010. Robust Biomarker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods. Advanced Access Publication. Bioinformatics. 26(3):392–398
  • Anunciacao O., Gomes B.C., Vinga S., Gaspar J., Oliveira A.L., Rueff J., 2010. A Data Mining Approach for the Detection of High-Risk
  • Breast Cancer Groups. In: Rocha M.P., Riverola F.F., Shatkay H., Corchado J.M. Eds. Advances in Bioinformatics. Advances in Intelligent and Soft Computing, Springer, Berlin, Heidelberg. 74: 43-51
  • Baudat G., Anouar F.M., 2001. Kernel-Based Methods and Function Approximation. Interna-tional Joint Conference on Neural Networks. July 15-19. Washington D.C., USA
  • Ben-Hur A., Weston J., 2010. A User's Guide to Support Vector Machines. In: Carugo O., Eisenhaber F. Eds. Data Mining Techniques for the Life Sciences. Methods in Molecular Biology (Methods and Protocols), Humana Press. 609:223-239
  • Benoudjit N., Verleysen M., 2003. On The Kernel Widths in Radial-Basis Function Networks. Neural ProcessingLetters 18: 139–154
  • Chuang L.Y., 2011. Support Vector Machine-Based Prediction for Oral Cancer Using Four SNPs in DNA Repair Genes. Proceedings of International Multiconference of Engineers and Computer Scientists. March 16-18. Hong Kong, China
  • Coelho R., Basgalupp M.P., Carvalho A., Freitas A.A., 2012. Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactıons on Systems, Man, and Cybernetıcs—Part C: Applıcatıons and Revıews. 42(3): 291-312
  • Demsar J., Curk T., Erjavec A., Gorup C., Hocevar T., Milutinovic M., Možina M., Polajnar M., Toplak M., Starič A., Štajdohar M., Umek L., Žagar L., Žbontar J., Žitnik M., Zupan B., 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research: 234 – 2353. Domingos P., Pazzani M., 1997. On The Optimality of the Simple Bayesian Classifier Under Zero-One Loss. Machine Learning. 29(2):103–130
  • Easton D.F., Eeles R.A., 2008. Genome-Wide Association Studies in Cancer. Oxford Journals Life Sciences and Medicine Human Molecular Genetics. 17(R2):R109-R115
  • Fiaschi L., Garibaldi J. M., Krasnogor N., 2009. A Framework for the Application of Decision Trees to the Analysis of SNPs Data. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 30 March – 2 April. Nashville, TN, USA
  • Gerstenblith M.R., Shi J., LAndi M.T., 2010. Genome-Wide Association Studies of Pigmentation and Skin Cancer: A Review and Meta-Analysis. Pigment Cell & Melanoma Research. 23(5): 587–606
  • Guillaume L., Palmer C.D., Young T., Ejebe K.G., Allayee H., Benjamin E.J., 2011. Genome Wide Association Study of Coronary Heart Disease and Its Risk Factors in 8,090 African Americans: The NHLBI CARe Project. Plos Genetics 7(2): e1001300
  • Hofmann T., Scholkopf B., Smola A .J., 2008. Kernel Methods in Machine Learning. The Annals of Statistics. 36(3):1171-1220
  • Horng J.T., Hu K.C., Wu L.C., Huang H P., Lin F.M., Huang S.L., Lai H.C., Chu T.Y., 2004. Identifying The Combination of Genetic Factors That Determine Susceptibility to Cer-vical Cancer. IEEE Transactions on Information Technology in Biomedicine. 8(1): 59-66
  • Huang J., Lin A., Narasimhan B., Quertermous T., Hsiung C.A., Ho L.T., Grove J.S., Oliver M., Ranade K., Risch N.J., Olshen R.A., 2004. Tree-structured supervised learning and the genetics of hypertension. Proceedings of the National Academy of Sciences of the United States of America. July 12. 101(29):10529–10534
  • Huang L. C., Hsu S. Y., Lin E., 2009. A Comparison of Classification Methods for Predicting Chronic Fatigue Syndrome Based on Genetic Data. Journal of Translational Medicine. 7:81
  • Jakkula E., Leppa V., Sulonen A.K., Varil T., 2010. Genome-wide Association Study in a -Risk Isolate for Multiple Sclerosis Reveals Associated Variants in STAT3 Gene. The American Journal of Human Genetics. 86: 285–291
  • Jesus K., Juan C. F.L., Enrique H.L., 2007. GPDTI: A Genetic Programming Decision Tree InductionMethod to Find Epistatic Effects in Common Complex Diseases. Bioinformat-ics.123(13):167-174
  • Jiang X., Barmada M. M., Visweswaran S., 2010. Identifying Genetic Interactions in Genome-Wide Data Using Bayesian Networks. Genet Epidemiol, 34(6): 575–581
  • JiaoY., Chen R., Ke X.,Cheng L., ChuK., Lun Z., Herskovits E.H., 2011. Predictive Models for Subtypes of Autism Spectrum Disorder Based on Single-Nucleotide Polymorphisms and Magnetic Resonance Imaging. Advances in Medical Sciences. 56: 334-342
  • Klein R.J., Zeiss C., Chew E.Y., Tsai J.Y., Sackler R.S., Haynes C., Henning A.K., SanGiovanni J.P., Mane S.M., Mayne S.T., Bracken M.B., Ferris F.L., Ott J., Barnstable C., Hoh J., 2005. Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science. 308 (5720): 385–9
  • Lee J. C., Parkes M., 2011. Genome-Wide Association Studies and Crohn’s Disease. Oxford Journals Life Sciences Briefings in Functional Genomics. 10(2):71-76
  • Lin H., Lin C., 2003. A Study on Sigmoid Kernels for SVM and the Training of non- PSD Kernels by SMO-type Methods. Technical report.
  • Listgarten J., Damaraju S., Poulin B., Cook L., 2011. Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clinical Cancer Reseach. 10:2725–2737
  • Malovini A., Barbarini N., Bellazzi R., Michelis F., 2014. Hierarchical Naive Bayes for Genetic Association Studies. BMC Bioinformatics. 13(Suppl 14): S6
  • Muller K. R., Mika S., Ratsch G., Tsuda K., Scholkopf B., 2005. An Introduction to Kernel-Based Learning Algorthims. IEEE Transactions on Neural Networks. 12(2): 181–201
  • Park J., Sandberg I.W., 1991. Universal Approximation Using Radial-Basis-Function Networks. Neural Comput. 3:246–257 Quinlan J.R., 1986. Induction of Decision Trees. Machine Learning. 1(1):81-106
  • Reddy MV, Wang H., Liu S., Bode B., Reed J.C., Steed R.D., Anderson S.W., Steed L., Hopkins D., She J.X., 2011. Association between Type 1 Diabetes and GWAS SNPs in the Southeast US Caucasian Population. Genes and Immunity. 12(3):208-212
  • Roberts J.M., Redman C.W. G., 1993. Pre-Eclampsia: More Than Pregnancy-Induced Hypertension. The Lancet. 341(8858):1447 – 1451
  • Rokach, L., Maimon, O., 2002. Top-Down Induction of Decision Trees Classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 35(4):476- 487
  • Sambo F., Trifoglio E., Di Camillo B., Toffolo G.M., Cobelli C., 2012. Bag of Naïve Bayes: Biomarker Selection and Classification from Genome-Wide SNP Data. BMC Bioinformatics. 13(Suppl 14):S2
  • Scott L. J., Muglia P., Kong X.Q., 2009. Genome-Wide Association and Meta-Analysis of Bipolar Disorder in Individuals of European Ancestry. PNAS. 106 (18): 7501–7506
  • Stahl E. A., Raychaudhuri S., Remmers E.F., 2010. Genome-Wide Association Study Meta-Analysis Identifies Seven New Rheumatoid Arthritis Risk Loci. Nature Genetics 42(10):508–514
  • Turner S. D., Dudek S. M., Ritchie M. D., 2010. ATHENA: A Knowledge-Based Hybrid Backpropagation-Grammatical Evolution Neural Network Algorithm for Discovering Epistasis among Quantitative Trait Loci. BioData Mining 3:5
  • Uhmn S., Kim D.H., Ko Y.W., Cho S., Cheong J., Kim J., 2009. A Study on Application of Single Nucleotide Polymorphism and Machine Learning Techniques to Diagnosis of Chronic Hepatitis. Expert Systems. 26(1)
  • Ustünkar G, Aydın Son Y., 2011. METU-SNP: An Integrated Software System for SNP-Complex Disease Association Analysis. J Integr Bioinform, 8(1):187
  • Vapnik V., Cortes C., 1995. Support-Vector Networks. Machine Learning. 20(3):273-297
  • Wei W., Visweswaran S., Cooper G. F., 2011. The Application of Naive Bayes Model Averaging to Predict Alzheimer's disease from Genome-Wide Data. JAm Med Inform Assoc. 18(4): 370–375
  • Wei Z., Wang K., Qu H.Q., Zhang H., 2009. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. Plosone. 5(10): e1000678
  • Xiao R., Wang J., Zhang F., 2010. An Approach to Incremental SVM Learning Algorithm. 12th IEEE Proceedings on Tools with Artificial Intelligence. 268-273
  • Yeager M., Orr N., Hayes R.B., 2007. Genome-Wide Association Study of Prostate Cancer Identifies a Second Risk Locus at 8q24. Nature Genetics 39: 645 – 649
  • Yücebaş S. C., Aydın Son Y., 2014. A Prostate Cancer Model Build by a Novel SVM ID3 Hybrid Feature Selection Method Using Both Genotyping and Phenotype Data from dbGaP. PLoS ONE 9(3): e91404
  • Zhou N., Wang L., 2007. Effective Selection of Informative SNPs and Classification on the Hapmap Genotype Data. BMC Bioinformatics.8:484