Türkiye’de ikinci el araçların büyük veri ve makine öğrenme teknikleriyle analizi ve fiyat tahmini

Türkiye’de ikinci el araç piyasası her zaman hareketli olmuştur. İkinci el araç piyasasında marka, model, yakıt türü gibi özelliklerin ne kadar yoğunlukta olduğu, ne kadar fiyata etki ettiği gibi faktörler analiz edilerek, bu bilgiler kullanışlı hale getirilebilir. Araçların çeşitli özelliklerine göre fiyatları değişmektedir. Fiyatları tahmin edebilmek için makine öğrenme teknikleri kullanılabilir ve kullanıcıların araç satarken veya alırken fiyat belirlemelerine yardımcı olabilir. Fiyat tahmini, veri madenciliğinin bir görevi olan fonksiyon tahmini veya regresyon sınıfına girmektedir. İkinci el araç sayısı oldukça fazla olduğundan dolayı bu çalışmada analizler yapılırken büyük veri sistemleri kullanılmıştır. Apache Spark ve makine öğrenme kütüphanesi bunun için oldukça kullanışlıdır. Fiyat tahmini için doğrusal regresyon, karar ağacı regresyonu, rastgele orman regresyonu, GBT regresyonu, izotonik regresyon algoritmaları kullanılmıştır. Kullanılan algoritmalar ile araçların fiyat tahmini yapılmıştır ve en yüksek başarıyı 21435,09 RMSE ve 0,887 R2 değerleriyle rastgele orman algoritması elde etmiştir. Rasgele orman algoritması ve diğer algoritmalarla elde edilen RMSE ve R2 değerleri arasında anlamlı bir farklılık olup olmadığını kontrol için yapılan istatistiksel testler sonucunda, rasgele orman algoritması ile elde edilen sonuçların daha iyi olduğu sonucuna ulaşılmıştır. Rasgele orman algoritmasının daha iyi sonuçlar vermesinin nedeni, algoritmanın birden çok karar ağacı üzerinden eğitim gerçekleştirmesi, esnekliği ve güçlü hiper parametrelere sahip olmasıdır.

Anahtar Kelimeler:

Büyük veri, Apache Spark, Regresyon algoritmaları

Analysis and price prediction of secondhand vehicles in Türkiye with big data and machine learning techniques

The secondhand vehicle market in Türkiye has always been active. In the secondhand vehicle market, information such as brand, model, and fuel type can be analyzed, and this information can be made useful. Prices vary according to the various features of the vehicles. Machine learning techniques can predict prices and help users set prices when selling or buying vehicles. Price prediction falls under regression. Since the number of secondhand vehicles is quite high, big data systems are used. Apache Spark and its machine learning library are quite useful for this. Linear regression, decision tree regression, random forest regression, GBT regression, and isotonic regression algorithms are used for price prediction. The random forest algorithm achieved the highest success for the price prediction with 21435.09 RMSE and 0.887 R2 values. As a result of the statistical tests performed to check the significant difference between the RMSE and R2 values obtained with the random forest algorithm and other algorithms, it is concluded that the results obtained with the random forest algorithm are statistically better than other algorithms. The random forest algorithm gives better results because the algorithm performs training over multiple decision trees, its flexibility, and strong hyperparameters.

Keywords:

Apache Spark, Big data, Regression algorithms,

PDF

___

Elshawi R., Sakr S., Talia D., Trunfio P., Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service, Big Data Research, 14, 1–11, 2018.
Lu R., Zhu H., Liu X., Liu J.K., Shao J., Toward efficient and privacy-preserving computing in big data era, IEEE Network, 28 (4), 46–50, 2014.
García S., Ramírez-Gallego S., Luengo J., Benítez J.M., Herrera F., Big data preprocessing: methods and prospects, Big Data Analytics, 1 (1), 9, 2016.
Concolato C.E., Chen L.M., Data Science: A New Paradigm in the Age of Big-Data Science and Analytics, New Mathematics and Natural Computation, 13 (02), 119–143, 2017.
Reyes-Ortiz J.L., Oneto L., Anguita D., Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf, Procedia Computer Science, 53, 121–130, 2015.
Işik K., Ulusoy S.K., Determining the factors that affect the production time in the metal industry utilizing data mining methods, Journal of the Faculty of Engineering and Architecture of Gazi University, 36 (4), 1949–1962, 2021.
Apache SparkTM - Lightning-Fast Cluster Computing
Duque Barrachina A., O’Driscoll A., A big data methodology for categorising technical support requests using Hadoop and Mahout, Journal Of Big Data, 1 (1), 1, 2014.
Sarker I.H., Machine Learning: Algorithms, Real-World Applications and Research Directions, SN Computer Science, 2 (3), 160, 2021.
Mohammed M., Khan M.B., Bashier E.B.M., Machine Learning: Algorithms and Applications. CRC Press: Boca Raton, 2016.
Portugal I., Alencar P., Cowan D., The use of machine learning algorithms in recommender systems: A systematic review, Expert Systems with Applications, 97, 205–227, 2018.
Ahmed H., Younis E.M., Ali A.A., Predicting Diabetes using Distributed Machine Learning based on Apache Spark, 2020 International Conference on Innovative Trends in Communication and Computer Engineering (ITCE), 44–49, 2020.
Oo M.C.M., Thein T., An efficient predictive analytics system for high dimensional big data, Journal of King Saud University - Computer and Information Sciences, 2019.
Río S. del, López V., Benítez J.M., Herrera F., On the use of MapReduce for imbalanced big data using Random Forest, Information Sciences, 285, 112–137, 2014.
Sağlamlar H., Multi center polyhedral conic classifiers that can classify complex data, Journal of the Faculty of Engineering and Architecture of Gazi University, 36 (4), 1817–1830, 2021.
HimaBindu G., Raghu Kumar Ch., Hemanand Ch., Rama Krishna N., Hybrid clustering algorithm to process big data using firefly optimization mechanism, Materials Today: Proceedings, 2020.
Tao Q., Gu C., Wang Z., Jiang D., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, 393, 234–244, 2020.
Alnafessah A., Casale G., Artificial neural networks based techniques for anomaly detection in Apache Spark, Cluster Computing, 1–16, 2019.
Lu W., Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework, Journal of Grid Computing, 18 (2), 239–250, 2020.
Cui X., Zhu P., Yang X., Li K., Ji C., Optimized big data K-means clustering using MapReduce, The Journal of Supercomputing, 70 (3), 1249–1259, 2014.
Shang H., Lu D., Zhou Q., Early warning of enterprise finance risk of big data mining in internet of things based on fuzzy association rules, Neural Computing and Applications, 2020.
Moens S., Aksehirli E., Goethals B., Frequent Itemset Mining for Big Data, 2013 IEEE International Conference on Big Data, 111–118, 2013.
Zhang F., Liu M., Gui F., Shen W., Shami A., Ma Y., A distributed frequent itemset mining algorithm using Spark for Big Data analytics, Cluster Computing, 18 (4), 1493–1501, 2015.
Nodarakis N., Sioutas S., Tsakalidis A.K., Tzimas G., Large Scale Sentiment Analysis on Twitter with Spark., EDBT/ICDT Workshops, 1–8, 2016.
El Alaoui I., Gahi Y., Messoussi R., Chaabi Y., Todoskoff A., Kobi A., A novel adaptable approach for sentiment analysis on big social data, Journal of Big Data, 5, 12, 2018.
Hasan R.A., Alhayali R.A.I., Zaki N.D., Ali A.H., An adaptive clustering and classification algorithm for Twitter data streaming in Apache Spark, Telkomnika, 17 (6), 3086–3099, 2019.
Altintaş V., Albayrak M., Topal K., Topic modeling with latent Dirichlet allocation for cancer disease posts, Journal of the Faculty of Engineering and Architecture of Gazi University, 36 (4), 2183–2196, 2021.
Syed D., Refaat S.S., Abu-Rub H., Performance evaluation of distributed machine learning for load forecasting in smart grids, 2020 Cybernetics & Informatics (K&I), 1–6, 2020.
Taşyürek M., Çeli̇k M., FastGTWR: A fast geographically and temporally weighted regression approach, Journal of the Faculty of Engineering and Architecture of Gazi University, 36 (2), 715–726, 2021.
Arslan S., Aslan S., A new lattice based artificial bee colony algorithm for EEG noise minimization, Journal of the Faculty of Engineering and Architecture of Gazi University, 38 (1), 15–28, 2022.
Xu Y., Liu H., Long Z., A distributed computing framework for wind speed big data forecasting on Apache Spark, Sustainable Energy Technologies and Assessments, 37, 100582, 2020.
Manogaran G., Lopez D., Spatial cumulative sum algorithm with big data analytics for climate change detection, Computers & Electrical Engineering, 65, 207–221, 2018.
Montgomery D.C., Peck E.A., Vining G.G., Introduction to linear regression analysis. John Wiley & Sons, 2012.
Özel S.Ö., Çabuk S., Estimation of ill-posed linear deterministic regression model: generalized maximum entropy and bayesian approach, Journal of the Faculty of Engineering and Architecture of Gazi University, 37 (2), 815–824, 2022.
Bisong E., Linear Regression, in Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, Bisong E, Editor. Apress: Berkeley, CA. 231–241, 2019.
Xu M., Watanachaturaporn P., Varshney P.K., Arora M.K., Decision tree regression for soft classification of remote sensing data, Remote Sensing of Environment, 97 (3), 322–336, 2005.
Gökdemr A., Çalhan A., Deep learning and machine learning based anomaly detection in internet of things environments, Journal of the Faculty of Engineering and Architecture of Gazi University, 37 (4), 1945–1956, 2022.
Veri Madenciliği’nde Karar Ağaçları, MSHOWTO Topluluğu ve Bilişim Portalı, 2020.
Zhang Y., Haghani A., A gradient boosting method to improve travel time prediction, Transportation Research Part C: Emerging Technologies, 58, 308–324, 2015.
Shoaran M., Haghi B.A., Taghavi M., Farivar M., Emami-Neyestanak A., Energy-efficient classification for resource-constrained biomedical applications, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8 (4), 693–707, 2018.
Segal M.R., Machine Learning Benchmarks and Random Forest Regression, 2004.
Barlow R.E., Brunk H.D., The isotonic regression problem and its dual, Journal of the American Statistical Association, 67 (337), 140–147, 1972.
Isotonic regression, Wikipedia, 2020.