Makine Öğrenmesi İle Yumurtalık Kanseri Tanısı İçin İkili Korelasyona Dayalı Öznitelik Seçim Yöntemi Uygulaması

Makine öğrenmesi sınıflandırma problemlerinin birçoğu yüksek boyuta sahip olup, veri kümesindeki özniteliklerden görece önemli olanların belirlenmesi amacıyla verimli ve etkili değişken seçim algoritmalarına ihtiyaç vardır. Gen verileri de yapısı gereği çok sayıda değişken içerdiği için değişken seçim uygulamalarında sıklıkla tercih edilir. Ayrıca gen seçimi kanser tespitinde büyük rol oynadığı literatürde yer alan çalışmalardan bilinmektedir. Erken dönemde tedavi başarısı oldukça yüksek olan kanser türlerinden birisi de yumurtalık (ovarian) kanseridir. Bu amaçla çalışmada erişime açık bir veri kümesi olan yumurtalık kanseri veri kümesi kullanılarak, kanser teşhisinde yüksek tanımlayıcılığa sahip genlerin seçilmesi amaçlanmıştır. Çalışmada, sınıflandırma için literatürde çok yeni olan ikili korelasyona (pairwise correlation) dayalı öznitelik seçim yöntemi kullanılmıştır. Uygulamada, ilk olarak değişken seçim uygulaması yapılmış ve kanser tanımlayıcılığı en yüksek olan 38 gen belirlenmiştir. Daha sonra sekiz farklı sınıflandırma algoritması kullanılarak sınıflandırma işlemi yapılmıştır. Sınıflandırma işlemi sonrası en düşük sınıflandırma başarısı %96.44 doğruluk değeri ile Ekstra Ağaç sınıflandırma algoritması için gerçekleşirken, en yüksek sınıflandırma başarısı ise %100 doğruluk değeri ile Çok Katmanlı Algılayıcı, Stokastik Gradyan İniş, Lojistik Regresyon ve Destek Vektör Makinesi sınıflandırıcıları kullanılarak elde edilmiştir. Literatürde değişken seçimi konusunda yapılan çok sayıda çalışma olmasına rağmen bu çalışma mevcut yöntemle ilgili yapılan ilk uygulama özelliği taşımaktadır. Bu anlamda literatüre katkı sağlayacağı düşünülmektedir.

An Application of the Feature Selection Method Based on Pairwise Correlation for Diagnosis of Ovarian Cancer with Machine Learning

Many machine learning classification problems have high dimensions, and efficient and effective feature selection algorithms are needed to determine the relatively essential features in the dataset. Gene data is often preferred in feature selection applications because it contains many features due to its structure. In addition, it is known from studies in the literature that gene selection plays a significant role in cancer detection. One of the cancer types with very high treatment success in the early period is ovarian cancer. For this purpose, it was aimed to select genes with high descriptiveness in cancer diagnosis by using the ovarian cancer dataset, which is a publicly available dataset. In this study, the feature selection method based on pairwise correlation, which is very new in the literature, was used for classification. Firstly, a feature selection application was made, and 38 genes with the highest cancer descriptors were determined. Then, the classification process was carried out using eight different classification algorithms. After the classification process, the lowest success was for the Extra Tree classification algorithm (with 96.44% accuracy), while the highest was for the Multi-Layer Perceptron, Stochastic Gradient Descent, Logistic Regression, and Support Vector Machine (with 100% accuracy). Although there are many studies on feature selection in the literature, this study is the first application of the current method. In this sense, it is thought that it will contribute to the literature.

___

  • Al-Murad, A., & Hossain, M. F. (2021). An integrated feature selection method for neural network to classify ovarian cancer. In 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0 (ACMI), 1–6.
  • Başeğmez, H., Sezer, E., & Erol, Ç. S. (2021). Optimization for Gene Selection and Cancer Classification. 21. https://doi.org/10.3390/proceedings2021074021
  • Baxter, C. W., Zhang, Q., Stanley, S. J., Shariff, R., Tupas, R.-R., & Stark, H. L. (2011). Drinking water quality and treatment: the use of artificial neural networks. Canadian Journal of Civil Engineering, 28(S1), 26–35. https://doi.org/10.1139/L00-053
  • Bayes, T., and Price, R. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370-418.
  • Belciug, S., & Gorunescu, F. (2018). Learning a single-hidden layer feed forward neural network using a rank correlation-based strategy with application to high dimensional gene expression and proteomic spectra datasets in cancer detection. Journal of Biomedical Informatics, 83, 159–166.
  • Belciug, S., & Ivanescu, R. C. (2019). A Bayesian framework for extreme learning machine with application for automated cancer detection. Annals of the University of Craiova, Mathematics and Computer Science Series, 46(1), 189–202.
  • Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227. https://doi.org/10.1007/S11749-016-0481-7/FIGURES/4
  • Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245–271.
  • Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory.
  • Bottou, L. (1991). Stochastic gradient learning in neural networks. Proceedings of Neuro-Nimes, 91(8), 1–12. Breiman, L. (2001). Random Forests. Machine learning, 45, 5-32.
  • Chin, L., Hansen, R. N., & Carlson, J. J. (2020). Economic burden of metastatic ovarian cancer in a commercially insured population: A retrospective cohort analysis. Journal of Managed Care and Specialty Pharmacy, 26(8), 962–970. https://doi.org/10.18553/JMCP.2020.26.8.962/ASSET/IMAGES/SMALL/FIG1.GIF
  • Demircioğlu, H., & Bilge, H. (2015). Yumurtalık kanseri veri kümesindeki gen ifadelerinin veri madenciliği ile analizi. Marmara Fen Bilimleri Dergisi, 27(4), 125–134.
  • Elhoseny, M., Bian, G.-B., Lakshmanaprabu, S. K., Shankar, K., Singh, K. A. K., & Wu, W. (2019). Effective features to classify ovarian cancer data in internet of medical things. Computer Networks, 159, 147–156.
  • Fayyad, U. M., & Irani, K. B. (1993). Multi-lnterval Discretization of Continuous-Valued Attributes for Classification Learning. Thirteenth International Joint Conference on Artificial Intelligence, 1022–1027.
  • Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63, 3–42.
  • Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. https://doi.org/10.1007/s10994-006-6226-1
  • Ghosh, M., Adhikary, S., Ghosh, K. K., Sardar, A., Begum, S., & Sarkar, R. (2019). Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Medical and Biological Engineering and Computing, 57(1), 159–176. https://doi.org/10.1007/s11517-018-1874-4
  • Globocan. (2020). World Ovarian Cancer Coalition. Ovarian Cancer Key Stats. https://worldovariancancercoalition.org/about-ovarian-cancer/key-stats/ Date accessed: 24/01/2023.
  • Hall, M. A. (1999). Correlation-based Feature Selection for Machine Learning [The University of Waikato]. https://www.cs.waikato.ac.nz/~mhall/thesis.pdf
  • Hart, P. E., Stork, D. G., & Duda, R. O. (2000). Pattern classification. Hoboken: Wiley.
  • Jiménez, F., Sánchez, G., Palma, J., Miralles-Pechuán, L., & Botía, J. A. (2022). Multivariate feature ranking with high-dimensional data for classification tasks. IEEE Access, 10, 60421-60437.
  • Kilicarslan, S., Adem, K., & Celik, M. (2020). Diagnosis and classification of cancer using hybrid model based on relieff and convolutional neural network. Medical Hypotheses, 109577.
  • Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3, 1787–1797.
  • Liu, H., ve Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Içinde H. Liu ve H. Motoda (Ed.), Feature Selection for Knowledge Discovery and Data Mining. Springer, New York. https://doi.org/10.1007/978-1-4615-5689-3
  • Liu, Q., Gu, Q., & Wu, Z. (2017). Feature selection method based on support vector machine and shape analysis for high-throughput medical data. Computers in Biology and Medicine, 91, 103–111.
  • Liu, Y. (2012). Dimensionality reduction and main component extraction of mass spectrometry cancer data. Knowledge-Based Systems, 26, 207–215.
  • Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia, 18(60), 1–8.
  • Ozer, M., Isler, Y., & Ozer, H. (2004). A computer software for simulating single-compartmental model of neurons. Computer Methods and Programs in Biomedicine, 75(1), 51–57. https://doi.org/10.1016/J.CMPB.2003.08.002
  • Özkan, Y., & Erol, Ç. (2015). Biyoenformatik DNA mikrodizi: veri madenciliği. Papatya Yayıncılık Eğitim.
  • Quinlan, J. R. (1987). C4.5: Programs for Machine Learning. Morgan Kaufmann.
  • Rahman, M. A., Muniyandi, R. C., Islam, K. T., & Rahman, M. M. (2019). Ovarian cancer classification accuracy analysis using 15-neuron artificial neural networks model. 2019 IEEE Student Conference on Research and Development (SCOReD), 33–38.
  • Sezer, E., & Çakir, Ö. (2022). A Feature Selection Application for Classification: A Banking Application. Dicle University Journal of Economics and Administrative Sciences, 12(24), 480–498.
  • Talbi, E. G., Jourdan, L., Garcia-Nieto, J., & Alba, E. (2008). Comparison of population based meta heuristics for feature selection: Application to microarray data classification. In 2008 IEEE/ACS International Conference on Computer Systems and Applications, 45–52.
  • The American Cancer Society. (2023). Key Statistics for Ovarian Cancer. https://www.cancer.org/cancer/ovarian-cancer/about/key-statistics.html
  • Ubaidillah, S. H. S. A., Sallehuddin, R., & Ali, N. A. (2013). Cancer detection using artifical neural network and support vector machine: A comparative study. Jurnal Teknologi, 65(1), 73–81.
  • Yeşilbaş, D., & Güven, A. (2021). Kütle Spektrometresi Verileri Kullanılarak Yumurtalık Kanserinin Yapay Sinir Ağlarıyla Sınıflandırılması. Çukurova Üniversitesi Mühendislik Fakültesi Dergisi, 36(3), 781–790.
  • Yesilkaya, B., Perc, M., & Isler, Y. (2022). Manifold learning methods for the diagnosis of ovarian cancer. Journal of Computational Science, 63. https://doi.org/10.1016/j.jocs.2022.101775
  • Zhu, Z., Ong, Y.-S., & Dash, M. (2007). Markov Blanket-Embedded Genetic Algorithm for Gene Selection. Pattern Recognition, 40(11), 3236–3248. https://csse.szu.edu.cn/staff/zhuzx/Datasets.html