A fast text similarity measure for large document collections using multireference cosine and genetic algorithm

One of the critical factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate DND documents from the index, a search engine needs a swift and reliable DND text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms, are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced, which is fast, scalable, and reliable and needs less storage space. The proposed method is examined on standard text document datasets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles, and other similar datasets. The results are promising and comparable with the best cutting-edge algorithms considering accuracy and performance. The proposed method is based on the idea of using reference texts to generate signatures for text documents. The novelty of this paper is the use of genetic algorithms to generate better reference texts.

___

  • [1] Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Computer Networks and ISDN Systems. 1997; 29 (8-13): 1157-1166.
  • [2] Fetterly D, Manasse M, Najork M. On the evolution of clusters of near-duplicate web pages. Journal of Web Engineering 2003; 2 (4): 228-246.
  • [3] Henzinger M. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Las Vegas, NV, USA; 2006. pp. 284-291.
  • [4] Manku GS, Jain A, Das Sarma A. Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web; Las Vegas, NV, USA; 2007. pp. 141-150.
  • [5] Xiao C, Wang W, Lin X, Yu JX, Wang G. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 2011; 36 (3): 15.
  • [6] Conrad JG, Guo XS, Schriber CP. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management; Las Vegas, NVa, USA; 2003. pp. 443-452.
  • [7] Theobald M, Siddharth J, Paepcke A. Spotsigs: Robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Las Vegas, NV, USA; 2008. pp. 563-570.
  • [8] Mohammadi H, Nikoukaran A. Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections. arXiv preprint arXiv:181003099. 2018.
  • [9] Ntoulas A, Cho J, Olston C. What’s new on the web? The evolution of the web from a search engine perspective. In: Proceedings of the 13th International Conference on World Wide Web; Las Vegas, NV, USA; 2004. pp. 1-12.
  • [10] Croft WB, Metzler D, Strohman T. Search Engines: Information Retrieval in Practice. New York, NY, USA: Addison-Wesley Reading, 2010.
  • [11] Koppula HS, Leela KP, Agarwal A, Chitrapura KP, Garg S et al. Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining; Las Vegas, NV, USA; 2010. pp. 381-390.
  • [12] Bar-Yossef Z, Keidar I, Schonfeld U. Do not crawl in the dust: different urls with similar text. ACM Transactions on the Web 2009; 3 (1): 3.
  • [13] Dasgupta A, Kumar R, Sasturkar A. De-duping URLs via rewrite rules. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Las Vegas, NV, USA; 2008. pp. 186-194.
  • [14] Shivakumar N, Garcia-Molina H. SCAM: A copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries; Austin, TX; 1995. pp. 1-4.
  • [15] Manber U. Finding similar files in a large file system. Usenix Winter 1994; 94: 1-10.
  • [16] Heintze N, Labs B. Scalable document fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce; Oakland, CA, USA; 1996. pp. 1-12.
  • [17] Brin S, Davis J, Garcia-Molina H. Copy detection mechanisms for digital documents. In: ACM Special Interest Group on Management of Data (SIGMOD) of the Association for Computing Machinery Record; Las Vegas, NV, USA; 1995. pp. 398-409.
  • [18] Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing; Las Vegas, NV, USA; 1998. pp. 604-613.
  • [19] Schleimer S, Wilkerson DS, Aiken A. Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data; Las Vegas, NV, USA; 2003. pp. 76-85.
  • [20] Chowdhury A, Frieder O, Grossman D, McCabe MC. Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 2002; 20 (2): 171-191.
  • [21] Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data; Las Vegas, NV, USA; 2004. pp. 743-754.
  • [22] Charikar MS. Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing; Las Vegas, NV, USA; 2002. pp. 380-388.
  • [23] Pamulaparty L, Rao CG. A novel approach to perform document clustering using effectiveness and efficiency of simhash. International Journal of Engineering and Advanced Technology 2013; 2 (3): 312-315.
  • [24] Hajishirzi H, Yih Wt, Kolcz A. Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Las Vegas, NV, USA; 2010. pp. 419-426.
  • [25] Lin J. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Las Vegas, NV, USA; 2009. pp. 155-162.
  • [26] Vernica R, Carey MJ, Li C. Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data; Las Vegas, NV, USA; 2010. pp. 495-506.
  • [27] Pamulaparty L, Rao CG, Rao MS. A near-duplicate detection algorithm to facilitate document clustering. International Journal of Data Mining & Knowledge Management Process 2014; 4 (6): 39.
  • [28] Varol C, Hari S. Detecting near-duplicate text documents with a hybrid approach. Journal of Information Science 2015; 41 (4): 405-414.
  • [29] Zhang X, Yao Y, Ji Y, Fang B. Effective and fast near duplicate detection via signature-based compression metrics. Mathematical Problems in Engineering 2016; 2016: 1-12.
  • [30] Cilibrasi R, Vitányi PM. Clustering by compression. IEEE Transactions on Information Theory 2005; 51 (4): 1523-1545.
  • [31] Li M, Paul MP. An Introduction to Kolmogorov Complexity and Its Applications. New York, NY, USA: Springer Publishing Company, 2008.
  • [32] Liu H, Hu Z, Mian A, Tian H, Zhu X. A new user similarity model to improve the accuracy of collaborative filtering. Knowledge-Based Systems 2014; 56: 156-166.
  • [33] Ide N. The American National Corpus: Then, now, and tomorrow. In: Selected Proceedings of the 2008 HC-SNet Workshop on Designing an Australian National Corpus; Australia; 2009. pp. 108-113.
  • [34] Lang K. Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings; Tahoe City, CA, USA; 1995. pp. 331-339.
  • [35] Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A et al. Stanford WebBase components and appfcations. ACM Transactions on Internet Technology 2006; 6 (2): 153-186.