New metrics for clustering of identical products over imperfect data

This paper introduces the concept of product identity-clustering based on new similarity metrics and new performance metrics for web-crawled products. Product identity-clustering is defined here as the clustering of identical products, e.g., for price comparison purposes. Products blindly crawled over web sources, e.g., online marketplaces, have different description formats, where the features describing the same products differ in both number and representation formats. This problem causes imperfect feature vectors, where the vectors are considered to be not uniform in length and structure, with the features of various data types (numeric, categorical), and unknown vector structures. Furthermore, the product information usually contains redundant, missing, or faulty data, which are regarded as noise here. Product identity-clustering becomes a challenge when the vectors' metadata are previously unknown and the imperfect nature of the feature vectors is considered with the occurrence of noise. In this paper, the product identity-clustering concept is introduced as a new mining metric in e-commerce. Then novel similarity metrics are introduced to improve the product identity-clustering performance of legacy metrics. Finally, novel performance metrics are proposed to measure the performance of the identity-clustering algorithms. Using these metrics, a comparison of the legacy-based similarity metrics (Euclidian, cosine, etc.) and the proposed similarity metrics is given. The results show that legacy metrics are not successful in discriminating identical web-crawled products and the proposed metrics enable better achievement in the product identity-clustering problem.

New metrics for clustering of identical products over imperfect data

This paper introduces the concept of product identity-clustering based on new similarity metrics and new performance metrics for web-crawled products. Product identity-clustering is defined here as the clustering of identical products, e.g., for price comparison purposes. Products blindly crawled over web sources, e.g., online marketplaces, have different description formats, where the features describing the same products differ in both number and representation formats. This problem causes imperfect feature vectors, where the vectors are considered to be not uniform in length and structure, with the features of various data types (numeric, categorical), and unknown vector structures. Furthermore, the product information usually contains redundant, missing, or faulty data, which are regarded as noise here. Product identity-clustering becomes a challenge when the vectors' metadata are previously unknown and the imperfect nature of the feature vectors is considered with the occurrence of noise. In this paper, the product identity-clustering concept is introduced as a new mining metric in e-commerce. Then novel similarity metrics are introduced to improve the product identity-clustering performance of legacy metrics. Finally, novel performance metrics are proposed to measure the performance of the identity-clustering algorithms. Using these metrics, a comparison of the legacy-based similarity metrics (Euclidian, cosine, etc.) and the proposed similarity metrics is given. The results show that legacy metrics are not successful in discriminating identical web-crawled products and the proposed metrics enable better achievement in the product identity-clustering problem.

___

  • Chen LS, Hsu FH, Chen MC, Hsu YC. Developing recommender systems with the consideration of product profitability for sellers. Information Sciences 2008; 178: 1032–1048.
  • Prince SJD, Elder JH. Bayesian identity clustering. In: Proceedings of the 2010 Canadian Conference on Computer and Robot Vision; June 2010; Ottawa, Canada: IEEE. pp. 32–39.
  • Alieva R, Pedryczb W, Fazlollahid B, Huseynova OH, Alizadehe AV, Guirimove BG. Fuzzy logic-based generalized decision theory with imperfect information. Information Sciences 2012; 189: 18–42.
  • Alpko¸cak A, Ceylan M. Effects of diacritics on Turkish information retrieval. Turk J Elec Eng & Comp Sci 2012; 20: 787–804.
  • Park S, Kim W, Lee S, Bang S. Product matching through ontology mapping in comparison shopping. In: Pro- ceedings of IIWAS; 4–6 December 2006; Yogyakarta, Indonesia: ACS. pp. 39–49.
  • Walther M, Jackel N, Schuster D, Schill A. Enabling product comparisons on unstructured information using ontology matching. Advances in Intelligent and Soft Computing 2011; 86: 183–193.
  • Tiwari N, Garg S, Tiwari N. Document clustering using k-means, heuristic k-means and fuzzy c-means. In: Pro- ceedings of the International Conference on Computational Intelligence and Communication Systems; 7–9 October 2011; Gwalior, India: IEEE. pp. 297–301.
  • Biricik G, Diri B, S¨onmez AC. Abstract feature extraction for text classification. Turk J Elec Eng & Comp Sci 2012; 20: 1137–1159.
  • Ahmad A, Dey L, Halawani SM. A rule-based method for identifying the factor structure in customer satisfaction. Information Sciences 2012; 198: 118–129.
  • Rajimol A, Raju G. Fol-mine - a more efficient method for mining web access pattern. In: Proceedings of the Advances in Computing and Communications; 22–24 July 2011; Kochi, India: SBH. pp. 253–262.
  • Toma A, Constantinescu R, Nastase F. Recommendation system based on the clustering of frequent sets. WSEAS Transactions on Information Science and Applications 2009; 6: 715–724.
  • Berendsen R, Kovachev B, Nastou EP, Rijke MD, Weerkamp W. Result disambiguation in web people search. In: Proceedings of the 34th European Conference on Advances in Information Retrieval; 1–5 April 2012; Barcelona,
  • Spain: SBH. pp. 146–157.
  • Galitsky B, Rosa JL. Concept-based learning of human behavior for customer relationship management. Information Sciences 2011; 181: 2016–2035.
  • Gang L, Fei L. Application of a clustering method on sentiment analysis. Journal of Information Science 2012; 38: 127–139.
  • Jin CN, Tun TT. Effectiveness of web search results for genre and sentiment classification. Journal of Information Science 2009; 35: 709–726.
  • Hana J, Dongwook S, Joongmin C. Ferom: feature extraction and refinement for opinion mining. ETRI Journal 2011; 33: 720–730.
  • Marcu D, Popescu A. Extracting product features and opinions from reviews. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing; 6–8 October 2005; Stroudsburg, PA, USA: ACL. pp. 339–346.
  • Shu Z, Wenjie J, Yingju X, Yao M, Hao Y. Morpheme-based product features categorization in Chinese reviews mining. In: Proceedings of the 6th International Conference on Advanced Information Management and Service; December 2010; Seoul, South Korea: IEEE. pp. 324–329.
  • Somprasertsri G, Lalitrojwong P. A maximum entropy model for product feature extraction in online customer reviews. In: Proceedings of IEEE Conference on Cybernetics and Intelligent Systems; 13–15 July 2008; Las Vegas,
  • NV, USA: IEEE. pp. 575–580.
  • Zhongwu Z, Bing L, Hua X, Peifa J. Clustering product features for opinion mining. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining; 9–12 February 2011; Hong Kong, China: ACM. pp. 347–354.
  • Ponmuthuramalingaz P, Devi T. Effective term based text clustering algorithms. International Journal on Computer Science and Engineering 2010; 2: 1665–1673.
  • Zheng HT, Kang BY, Kim HG. Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences 2009; 179: 2249–2262.
  • C¸ elik T, Yetgin Z. Change detection without difference image computation based on multiobjective cost function optimization. Turk J Elec Eng & Comp Sci 2011; 19: 941–956.
  • Yetgin Z. Unsupervised change detection of satellite images using local gradual descent. IEEE Transactions on Geoscience and Remote Sensing 2012; 50: 1919–1929.
  • Handl J, Knowles J, Kell DB. Sumplementary material to computational cluster validation in post-genomic data analysis. Bioinformatics 2005; 00: 1–3.