Zülfü ALANOĞLU, M. Ali AKCAYOL

Web Tarayıcılarında Tohum URL Seçimi ve Performans Analizi: Kapsamlı Bir İnceleme

Web, İnternet üzerinde yayınlanan çeşitli türden bilgilerin bulunduğu bir veri deposudur. Bu bilgileri üzerinde bulunduran ve birbirlerine köprülerle bağlı olan yapılara web sayfaları denir. Web tarayıcıları, web sayfaları üzerindeki köprüleri kullanarak Web’i tarayan ve sayfaları indiren programlardır. Bir arama motorunun performansı da web tarayıcısının performansına bağlıdır. Web tarayıcılarının performans metrikleri, kapsamı ve tohum URL seçim yöntemleri performansı etkileyen en önemli faktörlerdir. Bu çalışmada, genel, odaklanmış, artırılmış, gizli, mobil ve dağıtılmış olmak üzere altı kategoride sınıflandırdığımız web tarayıcılarının performansları, kapsamları ve tohum URL kullanım yöntemleri hakkında kapsamlı bir inceleme ve analiz yapılmıştır. Ayrıca her bir tarayıcının çeşitli çalışmalarda yapılmış performans ölçütleri karşılaştırılmıştır.

Anahtar Kelimeler:

Web tarayıcıları, Web sayfaları, Kapsam genişletme, Tohum URL

Seed URL Selection and Performance Analysis in Web Crawlers: A Comprehensive Review

Web is a data repository where various types of information posted on the internet are found. Structures that contain this information and are connected to each other by hyperlinks are called web pages. Web crawlers are programs that browse the web and download pages using hyperlinks on web pages. The performance of a search engine also depends on the performance of the web crawler. Performance metrics, scope, and seed URL selection methods of the web browsers are the most important factors affecting performance. In this study, a comprehensive review and analysis of the performances, scopes and seed URL usage methods of the web crawlers, classified in six categories as general, focused, incremental, hidden, mobile and distributed, was carried out. In addition, the performance criteria of each crawlers in various studies were compared.

Keywords:

Web crawlers, Web pages, Scope expansion, Seed URL,

PDF

___

[1] S. Stergiou and K. Tsioutsiouliklis, "Set cover at web scale," 2015, pp. 1125-1133.
[2] J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," 1998, vol. 98: Citeseer, pp. 668-677.
[3] S. Zheng, P. Dmitriev, and C. L. Giles, "Graph based crawler seed selection," 2009, pp. 1089-1090.
[4] P. Dmitriev, "Host-based seed selection algorithm for web crawlers," ed: Google Patents, 2010.
[5] S. Daneshpajouh, M. M. Nasiri, and M. Ghodsi, "A Fast Community Based Algorithm for Generating Web Crawler Seeds Set," 2008, pp. 98-105.
[6] B. Ganguly and R. Sheikh, "A review of focused web crawling strategies," International Journal of Advanced Computer Research, vol. 2, no. 4, p. 261, 2012.
[7] F. M. J. M. Shamrat, Z. Tasnim, A. K. M. S. Rahman, N. I. Nobel, and S. A. Hossain, "An effective implementation of web crawling technology to retrieve data from the world wide web (WWW)," International Journal of Scientific & Technology Research, vol. 9, no. 01, pp. 1252-1256, 2020.
[8] L. Jiang and H. Zhang, "Multi-agent based individual web spider system," 2010: IEEE, pp. 177-181.
[9] S.-B. Chan and H. Yamana, "The method of improving the specific language focused crawler," 2010.
[10] J. Choudhary and D. Roy, "Priority based semantic web crawler," International Journal of Computer Applications, vol. 81, no. 15, pp. 10-13, 2013.
[11] P. N. Priyatam, A. Dubey, K. Perumal, S. Praneeth, D. Kakadia, and V. Varma, "Seed selection for domain-specific search," 2014, pp. 923-928.
[12] L. M. Sanagavarapu, S. Sarangi, and V. Varma, "Fine grained approach for domain specific seed URL extraction," 2018.
[13] R. Janbandhu, P. Dahiwale, and M. M. Raghuwanshi, "Analysis of web crawling algorithms," International Journal on Recent and Innovation Trends in Computing and Communication, vol. 2, no. 3, pp. 488-492, 2014.
[14] G. Gossen, E. Demidova, and T. Risse, "The iCrawl Wizard–supporting interactive focused crawl specification," 2015: Springer, pp. 797-800.
[15] A. C. Nwala, M. C. Weigle, and M. L. Nelson, "Scraping SERPs for archival seeds: it matters when you start," 2018, pp. 263-272.
[16] M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta, "The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora," Language Resources and Evaluation, vol. 43, pp. 209-226, 09/01 2009, doi: 10.1007/s10579-009-9081-4.
[17] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: scaling to 6 billion pages and beyond," ACM Transactions on the Web (TWEB), vol. 3, no. 3, pp. 1-34, 2009.
[18] M. Baker and M. Akcayol, "Priority queue based estimation of importance of web pages for web crawlers," International Journal of Electrical and Computer Engineering, vol. 9, no. 1, pp. 330-342, 2017.
[19] M. Thangaraj and P. G. Sivagaminathan, "An Improved Generic Crawler using Poisson Fit Distribution," Communications, vol. 6, pp. 7-13, 2016.
[20] A. Heydon and M. Najork, "Mercator: A scalable, extensible Web crawler," World Wide Web, vol. 2, no. 4, pp. 219-229, 1999/12/01 1999, doi: 10.1023/A:1019213109274.
[21] L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank citation ranking: Bringing order to the web," Stanford InfoLab, 1999.
[22] S. Chakrabarti, M. Berg, and B. Dom, "Focused crawling: A new approach to topic-specific Web resource discovery," Computer Networks, vol. 31, pp. 1623-1640, 04/13 2000, doi: 10.1016/S1389-1286(99)00052-3.
[23] A. Gupta and P. Anand, Focused web crawlers and its approaches. 2015, pp. 619-622.
[24] S. Batsakis, E. G. M. Petrakis, and E. Milios, "Improving the performance of focused web crawlers," Data & Knowledge Engineering, vol. 68, no. 10, pp. 1001-1013, 2009/10/01/ 2009, doi: https://doi.org/10.1016/j.datak.2009.04.002.
[25] M. S. Safran, A. Althagafi, and D. Che, "Improving Relevance Prediction for Focused Web Crawlers," in 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, 30 May-1 June 2012 2012, pp. 161-166, doi: 10.1109/ICIS.2012.61.
[26] G. H. Agre and N. V. Mahajan, "Keyword focused web crawler," in 2015 2nd International Conference on Electronics and Communication Systems (ICECS), 26-27 Feb. 2015 2015, pp. 1089-1092, doi: 10.1109/ECS.2015.7124749.
[27] M. Kumar, A. Bindal, R. Gautam, and R. Bhatia, "Keyword query based focused Web crawler," Procedia Computer Science, vol. 125, pp. 584-590, 2018/01/01/ 2018, doi: https://doi.org/10.1016/j.procs.2017.12.075.
[28] M. S. Safran, A. Althagafi, and D. Che, "Improving relevance prediction for focused Web crawlers," 2012 2012: IEEE, pp. 161-166.
[29] Y. Du, Y. Hai, C. Xie, and X. Wang, "An approach for selecting seed URLs of focused crawler based on user-interest ontology," Applied Soft Computing, vol. 14, pp. 663-676, 2014/01/01/ 2014, doi: https://doi.org/10.1016/j.asoc.2013.09.007.
[30] K. S. S. Prabha, C. Mahesh, and S. P. Raja, "An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm," Cybernetics and Information Technologies, vol. 21, no. 2, pp. 105-120, 2021.
[31] W. Wang, X. Chen, Y. Zou, H. Wang, and Z. Dai, "A Focused Crawler Based on Naive Bayes Classifier," in 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, 2-4 April 2010 2010, pp. 517-521, doi: 10.1109/IITSI.2010.30.
[32] L. Ying, X. Zhou, J. Yuan, and Y. Huang, A Novel Focused Crawler Based on Breadcrumb Navigation. 2012, pp. 264-271.
[33] N. Luo, W. L. Zuo, F. Y. Yuan, and C. L. Zhang, "A new method for focused crawler cross tunnel," in Rough Sets and Knowledge Technology, Proceedings, vol. 4062, 2006, ch. 1st International Conference on Rough Sets and Knowledge Technology, pp. 632-637.
[34] P. Bedi, A. Thukral, H. Banati, A. Behl, and V. Mendiratta, "A Multi-Threaded Semantic Focused Crawler," Journal Of Computer Scıence And Technology, vol. 27, no. 6, pp. 1233-1242, NOV 2012, doi: 10.1007/s11390-012-1299-8.
[35] N. Le Huy Hien, T. Tien, and N. V.H, "Web Crawler: Design And Implementation For Extracting Article-Like Contents," Cybernetics and Physics, vol. 9, pp. 144-151, 11/20 2020, doi: 10.35470/2226-4116-2020-9-3-144-151.
[36] D. k. Sharma and M. A. Khan, "SAFSB: A self-adaptive focused crawler," in 2015 1st International Conference on Next Generation Computing Technologies (NGCT), 4-5 Sept. 2015 2015, pp. 719-724, doi: 10.1109/NGCT.2015.7375215.
[37] H. Dong and F. K. Hussain, "Self-Adaptive Semantic Focused Crawler for Mining Services Information Discovery," IEEE Transactions on Industrial Informatics, vol. 10, no. 2, pp. 1616-1626, 2014, doi: 10.1109/TII.2012.2234472.
[38] Q. Zhu, "An Algorithm OFC for the Focused Web Crawler," in 2007 International Conference on Machine Learning and Cybernetics, 19-22 Aug. 2007 2007, vol. 7, pp. 4059-4063, doi: 10.1109/ICMLC.2007.4370856.
[39] G. A. F. Alfarisy and F. A. Bachtiar, "Focused web crawler for Indonesian recipes," in 2017 International Conference on Sustainable Information Engineering and Technology (SIET), 24-25 Nov. 2017 2017, pp. 196-202, doi: 10.1109/SIET.2017.8304134.
[40] T. Suebchua, A. Rungsawang, and H. Yamana, "Adaptive Focused Website Segment Crawler," in 2016 19th International Conference on Network-Based Information Systems (NBiS), 7-9 Sept. 2016 2016, pp. 181-187, doi: 10.1109/NBiS.2016.5.
[41] J. Hernandez, H. M. Marin-Castro, and M. Morales-Sandoval, "A Semantic Focused Web Crawler Based on a Knowledge Representation Schema," Applied Sciences, vol. 10, no. 11, 2020, doi: 10.3390/app10113837.
[42] J. Cho and H. Garcia-Molina, "Estimating frequency of change," ACM Transactions on Internet Technology (TOIT), vol. 3, no. 3, pp. 256-290, 2003.
[43] S. Sharma and P. Gupta, "The anatomy of web crawlers," in International Conference on Computing, Communication & Automation, 15-16 May 2015 2015, pp. 849-853, doi: 10.1109/CCAA.2015.7148493.
[44] M. Singh and B. Varnica, "Web crawler: Extracting the web data," International Journal of Computer Trends and Technology, vol. 13, no. 3, pp. 132-137, 2014.
[45] A. Gupta and A. Dixit, "A novel user trend-based priority assigner and URL scheduler for dynamic incremental crawling," Concurrency and Computation: Practice and Experience, https://doi.org/10.1002/cpe.6555 vol. n/a, no. n/a, p. e6555, 2021/08/08
[46] G. Pavai and T. V. Geetha, "Improving the freshness of the search engines by a probabilistic approach based incremental crawler," Information Systems Frontiers, vol. 19, no. 5, pp. 1013-1028, 2017/10/01 2017, doi: 10.1007/s10796-016-9701-7.
[47] A. S. R. Santos, C. R. de Carvalho, J. M. Almeida, E. S. de Moura, A. S. da Silva, and N. Ziviani, "A genetic programming framework to schedule webpage updates," Information Retrieval Journal, vol. 18, no. 1, pp. 73-94, 2015.
[48] Q. Tan and P. Mitra, "Clustering-based incremental web crawling," ACM Transactions on Information Systems (TOIS), vol. 28, no. 4, pp. 1-27, 2010.
[49] Z. Shi, M. Shi, and W. Lin, "The Implementation of Crawling News Page Based on Incremental Web Crawler," in 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD), 12-14 Dec. 2016 2016, pp. 348-351, doi: 10.1109/ACIT-CSII-BCD.2016.073.
[50] Y. Nagar and N. Singhal, "A users search history based approach to manage revisit frequency of an Incremental Crawler," International Journal of Computer Applications, vol. 63, no. 3, 2013.
[51] M. Pavkovic and J. Protic, "SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval," IEEE Access, vol. 7, pp. 126941-126961, 2019, doi: 10.1109/ACCESS.2019.2939872.
[52] R. Madaan, A. Dixit, A. K. Sharma, and K. K. Bhatia, "A framework for incremental hidden web crawler," International Journal on Computer Science and Engineering, vol. 2, no. 3, pp. 753-758, 2010.
[53] C. Bouras, V. Poulopoulos, and A. Thanou, "Creating a polite adaptive and selective incremental crawler," in IADIS International Conference 2005, 2005 2005, vol. 1: Citeseer, pp. 307-314.
[54] M. Kumar, R. Bhatia, and D. Rattan, "A survey of Web crawlers for information retrieval," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 7, no. 6, p. e1218, 2017.
[55] P. Zerfos, J. Cho, and A. Ntoulas, "Downloading textual hidden web content through keyword queries," in Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05), 7-11 June 2005 2005, pp. 100-109, doi: 10.1145/1065385.1065407.
[56] S. Kaur and G. Geetha, "SIMHAR - Smart Distributed Web Crawler for the Hidden Web Using SIM+Hash and Redis Server," IEEE Access, vol. 8, pp. 117582-117592, 2020, doi: 10.1109/ACCESS.2020.3004756.
[57] S. Gupta and K. K. Bhatia, "HiCrawl: A Hidden Web Crawler for Medical Domain," in 2013 International Symposium on Computational and Business Intelligence, 24-26 Aug. 2013 2013, pp. 152-157, doi: 10.1109/ISCBI.2013.39.
[58] K. K. Bhatia, A. K. Sharma, and R. Madaan, "AKSHR: A novel framework for a Domain-specific Hidden Web Crawler," in 2010 First International Conference On Parallel, Distributed and Grid Computing (PDGC 2010), 28-30 Oct. 2010 2010, pp. 307-312, doi: 10.1109/PDGC.2010.5679916.
[59] S. Raghavan and H. Garcia-Molina, "Crawling the hidden web," Stanford, 2000.
[60] P. Liakos, A. Ntoulas, A. Labrinidis, and A. Delis, "Focused crawling for the hidden web," World Wide Web, vol. 19, no. 4, pp. 605-631, 2016/07/01 2016, doi: 10.1007/s11280-015-0349-x.
[61] M. Kumar and R. Bhatia, "Design of a mobile Web crawler for hidden Web," in 2016 3rd International Conference on Recent Advances in Information Technology (RAIT), 3-5 March 2016 2016, pp. 186-190, doi: 10.1109/RAIT.2016.7507899.
[62] Y. Li, Y. Wang, and J. Du, "E-FFC: an enhanced form-focused crawler for domain-specific deep web databases," Journal of Intelligent Information Systems, vol. 40, no. 1, pp. 159-184, 2013.
[63] A. I. El-desouky, H. A. Ali, and S. M. El-ghamrawy, "An Automatic Label Extraction Technique for Domain-Specific Hidden Web Crawling (LEHW)," in 2006 International Conference on Computer Engineering and Systems, 5-7 Nov. 2006 2006, pp. 454-459, doi: 10.1109/ICCES.2006.320490.
[64] L. Jiang, Z. Wu, Q. Zheng, and J. Liu, Learning Deep Web Crawling with Diverse Features. 2009, pp. 572-575.
[65] T. A. Patil and S. Chobe, "Web Crawler for Searching Deep Web Sites," in 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), 17-18 Aug. 2017 2017, pp. 1-5, doi: 10.1109/ICCUBEA.2017.8463648.
[66] Q. Zheng, Z. Wu, X. Cheng, L. Jiang, and J. Liu, "Learning to crawl deep web," Information Systems, vol. 38, no. 6, pp. 801-819, 2013/09/01/ 2013, doi: https://doi.org/10.1016/j.is.2013.02.001.
[67] S. Anbukodi and K. M. Manickam, "Reducing web crawler overhead using mobile crawler," in 2011 International Conference on Emerging Trends in Electrical and Computer Technology, 23-24 March 2011 2011, pp. 926-932, doi: 10.1109/ICETECT.2011.5760252.
[68] R. Nath and S. Bal, "A novel mobile crawler system based on filtering off non-modified pages for reducing load on the network," Int. Arab J. Inf. Technol., vol. 8, no. 3, pp. 272-279, 2011.
[69] H. Takeno, M. Muto, N. Fujimoto, and K. Hagihara, "Developing a Web Crawler for Massive Mobile Search Services," in 7th International Conference on Mobile Data Management (MDM'06), 10-12 May 2006 2006, pp. 44-44, doi: 10.1109/MDM.2006.69.
[70] Y. Li, Y. Wang, and E. Tian, "A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases," in 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 4-7 Dec. 2012 2012, vol. 1, pp. 656-663, doi: 10.1109/WI-IAT.2012.103.
[71] Y. J. Du, Y. Xu, and M. Wang, "A Novel Cooperatıon And Competıtıon Strategy Among Multı-Agent Crawlers " Computıng And Informatıcs, vol. 35, no. 5, pp. 1050-1078, 2016.
[72] S. Deshmukh and K. Vishwakarma, "A Survey on Crawlers used in developing Search Engine," in 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), 6-8 May 2021 2021, pp. 1446-1452, doi: 10.1109/ICICCS51141.2021.9432368.
[73] V. Shkapenyuk and T. Suel, "Design and implementation of a high-performance distributed web crawler," 2002: IEEE, pp. 357-368.
[74] J. F. Cai and H. Zhang, "Dis-Dyn Crawler: A Distributed Crawler for Dynamic Web Page," presented at the Proceedıngs Of The 4th Internatıonal Conference On Mechatronıcs, Materıals, Chemıstry And Computer Engıneerıng 2015 (ICMMCCE 2015), 2015.
[75] J. K. Yu, M. R. Li, and D. Y. Zhang, "A Distributed Web Crawler Model based on Cloud Computing," presented at the Proceedıngs Of The 2nd Informatıon Technology And Mechatronıcs Engıneerıng Conference (ITOEC 2016), 2016.
[76] D. L. Quoc, C. Fetzer, P. Felber, R. É, V. Schiavoni, and P. Sutra, "UniCrawl: A Practical Geographically Distributed Web Crawler," in 2015 IEEE 8th International Conference on Cloud Computing, 27 June-2 July 2015 2015, pp. 389-396, doi: 10.1109/CLOUD.2015.59.
[77] Q. Pu, "The Design and Implementation of a High-Efficiency Distributed Web Crawler," in 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech), 8-12 Aug. 2016 2016, pp. 100-104, doi: 10.1109/DASC-PICom-DataCom-CyberSciTec.2016.34.
[78] X. X. Liu and Z. P. Jin, "ChainMR Crawler: A Distributed Vertical Crawler Based on MapReduce," presented at the Securıty, Prıvacy And Anonymıty In Computatıon, Communıcatıon And Storage (SPACCS 2016), 2016.
[79] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "Ubicrawler: A scalable fully distributed web crawler," Software: Practice and Experience, vol. 34, no. 8, pp. 711-726, 2004.
[80] S. K. Bal and G. Geetha, "Smart distributed web crawler," in 2016 International Conference on Information Communication and Embedded Systems (ICICES), 25-26 Feb. 2016 2016, pp. 1-5, doi: 10.1109/ICICES.2016.7518893.
[81] M. E. ElAraby, H. M. Moftah, S. M. Abuelenin, and M. Z. Rashad, "Elastic Web crawler service-oriented architecture over cloud computing," Arabian Journal for Science and Engineering, vol. 43, no. 12, pp. 8111-8126, 2018.
[82] D. Gunawan, A. Amalia, and A. Najwan, "Improving data collection on article clustering by using distributed focused crawler," Data Science: Journal of Computing and Applied Informatics, vol. 1, no. 1, pp. 1-12, 2017.
[83] H. T. Yani Achsan and W. C. Wibowo, "A Fast Distributed Focused-Web Crawling," Annals of DAAAM & Proceedings, vol. 24, no. 1, 2013.
[84] C. Tsai, T. Ku, P. Yang, and M. Chen, "A distributed multi-tasking job scheduling mechanism for web crawlers," in 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), 11-14 Aug. 2014 2014, pp. 243-248, doi: 10.1109/SOCPAR.2014.7008013.
[85] Y. Shi and T. Zhang, "Design and implementation of a scalable distributed web crawler based on Hadoop," in 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), 10-12 March 2017 2017, pp. 537-541, doi: 10.1109/ICBDA.2017.8078691.
[86] K. P. Zhu, Z. M. Xu, X. L. Wang, and Y. M. Zhao, "A full distributed Web crawler based on structured network," presented at the Informatıon Retrıeval Technology, 2008.
[87] L. Fei, F. Y. Ma, Y. M. Ye, M. L. Li, and J. D. Yu, "Distributed high-performance web crawler based on peer-to-peer network," in Parallel And Dıstrıbuted Computıng: Applıcatıons And Technologıes, Proceedıngs, vol. 3320, 2004, pp. 50-53.
[88] F. Ye, Z. Jing, Q. Huang, C. Hu, and Y. Chen, "The Research and Implementation of a Distributed Crawler System Based on Apache Flink," in Algorithms and Architectures for Parallel Processing, Cham, T. Hu, F. Wang, H. Li, and Q. Wang, Eds., 2018// 2018: Springer International Publishing, pp. 90-98.
[89] L. Su and F. Wang, "Web crawler model of fetching data speedily based on Hadoop distributed system," 2016: IEEE, pp. 927-931.