Sohail ASGHAR, Ijaz HUSSAIN

Incremental author name disambiguation using author profile models and self-citations

Author name ambiguity in bibliographic databases (BDs) such as DBLP is a challenging problem thatdegrades the information retrieval quality, citation analysis, and proper attribution to the authors. It occurs whenseveral authors have the same name (homonym) or when an author publishes under several name variants (synonym).Traditionally, much research has been conducted to disambiguate whole bibliographic database at once whenever somenew citations are added in these BDs. However, it is more time-consuming and discards the manual disambiguationeffects (if any). Only a few incremental author name disambiguation methods are proposed but these methods producefragmented clusters which lower their accuracy. In this paper, a method, called CAND, that uses author profile modelsand self-citations for incremental author name disambiguation is proposed. CAND introduces name indices that enhancethe overall system response by comparing the newly inserted references to the indexed author clusters. Author profilemodels are generated for the existing authors in BDs which help in disambiguating the newly inserted references. Acomparator function is proposed to resolve the incremental author name ambiguity which utilizes the most strongbibliometric features such as coauthor, titles, author profile models, and self-citations. Two real-world data sets, onefrom Arnetminer and the other from BDBComp, are used to validate CAND’s performance. Experimental results showthat CAND’s performance is overall better than the existing state-of-the-art incremental author name disambiguationmethods.

PDF

___

[1] Shin D, Kim T, Choi J, Kim J. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 2014; 100(1): 15-50. doi: 10.1007/s11192-014-1289-4
[2] Han H, Xu W, Zha H, Giles CL. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM symposium on Applied computing; Santa Fe, NM, USA; 2005. pp. 1065-1069.
[3] Han D, Liu S, Hu Y, Wang B, Sun Y. Elm-based name disambiguation in bibliography. World Wide Web 2015; 18 (2): 253-263. doi: 10.1007/s11280-013-0226-4
[4] Bollen J, Rodriguez MA, Van de Sompel H, Balakireva LL, Hagberg A. The largest scholarly semantic network ever. In: Proceedings of the 16th international conference on World Wide Web 2007; Banff, Alberta, Canada. pp. 1247-1248.
[5] Jinha AE. Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing 2010; 23(3): 258-263. doi: 10.1087/20100308
[6] Hussain I, Asghar S. A survey of author name disambiguation techniques: 2010–2016. Knowledge Engineering Review 2017; 32. doi: 10.1017/S0269888917000182
[7] De Carvalho A P, Ferreira A A, Laender A H, Gon calves M A. Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management 2011; 2(3): 289.
[8] Esperidiao LVB, Ferreira AA, Laender AH, Goncalves MA, Gomes DM et al. Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management 2014; 5 (3): 293.
[9] Santana AF, Gonçalves MA, Laender AH, Ferreira AA. Incremental author name disambiguation by exploiting domain-specific heuristics. Journal of Association of Information Science and Technology 2017; 68(4): 931-945. doi: 10.1002/asi.23726
[10] Hussain I, Asghar S. DISC: Disambiguating homonyms using graph structural clustering. Journal of Information Science 2018; Journal of Information Science, 44(6), 830-847. doi: 10.1177/0165551518761011
[11] Wang J, Berzins K, Hicks D, Melkers J, Xiao F et al. A boosted-trees method for name disambiguation. Scientometrics 2012; 93 (2): 391-411. doi: 10.1007/s11192-012-0681-1
[12] Tran HN, Huynh T, Do T. Author name disambiguation by using deep neural network. In: Asian Conference on Intelligent Information and Database Systems 2014; Cham; 2014. pp. 123-132.
[13] Shoaib M, Daud A, Khiyal M. Improving Similarity Measures for Publications with Special Focus on Author Name Disambiguation. Arabian Journal for Science and Engineering 2015; 40(6) : 1591-1605. doi: 10.1007/s13369-015- 1636-7
[14] Tang J, Fong AC, Wang B, Zhang J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 2012; 24 (6): 975-987. doi: 10.1109/TKDE.2011.13
[15] Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AH. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology 2010; 61(9): 1853-1870. doi: 10.1002/asi.21363
[16] Wu H, Li B, Pei Y, He J. Unsupervised author disambiguation using dempster-shafer theory. Scientometrics 2014; 101 (3): 1955-1972. doi: 10.1007/s11192-014-1283-x
[17] Hussain I, Asghar S. Resolving namesakes using the author’s social network. Turkish Journal of Electrical Engineering & Computer Science 2018; 26(1): 554-569. doi:10.3906/elk-1702-293
[18] Hussain I, Asghar S. Author name disambiguation by exploiting graph structural clustering and hybrid similarity. Arabian Journal for Science and Engineering 2018; 1-17. doi: 10.1007/s13369-018-3099-0
[19] Onodera N, Iwasawa M, Midorikawa N, Yoshikane F, Amano K et al. A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology 2011; 62 (4): 677-690. doi: 10.1002/asi.21491
[20] Imran M, Gillani S, Marchese M. A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. D-Lib Magazine 2013; 19 (9): 1. doi: 10.1045/september2013-imran
[21] Zhu J, Yang Y, Xie Q, Wang L, Hassan SU. Robust hybrid name disambiguation framework for large databases. Scientometrics 2014; 98 (3): 2255-2274. doi: 10.1007/s11192-013-1151-0
[22] Louppe G, Al-Natsheh HT, Susik M, Maguire EJ. Ethnicity sensitive author disambiguation using semi-supervised learning. In: International Conference on Knowledge Engineering and the Semantic Web 2016; Springer, Cham, 2016. pp. 272-287.
[23] Fan X, Wang J, Pu X, Zhou L, Lv B. On graph-based name disambiguation. Journal of Data and Information Quality 2011; 2 (2): 10. doi: 10.1145/1891879.1891883
[24] Wang X, Tang J, Cheng H, Philip SY. Adana: Active name disambiguation. In: 2011 IEEE 11th International Conference on Data Mining 2011 (ICDM); IEEE Vancouver, BC, Canada. pp. 794-803.
[25] Levin FH, Heuser CA. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management 2010; 1(2): 183.
[26] Pereira DA, Ribeiro-Neto B, Ziviani N, Laender AH, Gonçalves MA et al. Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries 2009 (JCDL); Austin, TX, USA. pp. 49-58.
[27] Veloso A, Ferreira AA, Gonçalves MA, Laender AH, Meira Jr W. Cost-effective on-demand associative author name disambiguation. Information Processing & Management 2012; 48(4):680-697. doi: 10.1016/j.ipm.2011.08.005
[28] Wu J, Ding X H. Author name disambiguation in scientific collaboration and mobility cases. Scientometrics 2013; 96(3): 683-697. doi: 10.1007/s11192-013-0978-8
[29] Müller M C, Reitz F, Roy N. Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics 2017;111(3):1467-1500. doi: 10.1007/s11192-017-2363-5
[30] Kim J. Evaluating author name disambiguation for digital libraries: a case of DBLP. Scientometrics. 2018; 116(3): 1867-1886. doi: 10.1007/s11192-018-2824-5
[31] Abdulhayoglu MA, Thijs B. Use of ResearchGate and Google CSE for author name disambiguation. Scientometrics 2017; 111(3): 1965-1985. doi: 10.1007/s11192-017-2341-y
[32] Zhao Z, Rollins J, Bai L, Rosen G. Incremental author name disambiguation for Scientific Citation Data. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2017; Tokyo, Japan 2017. pp. 175-183.
[33] Kim J, Kim J, Owen-Smith J. Generating automatically labeled data for author name disambiguation: an iterative clustering method. Scientometrics 2018; 1-28. doi: 10.1007/s11192-018-2968-3
[34] Hellsten I, Lambiotte R, Scharnhorst A, Ausloos M. Self-citations, co-authorships and keywords: a new approach to scientists’ field mobility?. Scientometrics 2007; 72(3): 469-486. doi: 10.1007/s11192-007-1680-5
[35] King MM, Bergstrom CT, Correll SJ, Jacquet J, West JD. Men set their own cites high: Gender and self-citation across fields and over time. Socius 2017; 3: 2378023117738903. doi: 10.1177/2378023117738903
[36] Snyder H, Bonzi S. Patterns of self-citation across disciplines (1980-1989). Journal of Information Science 1998; 24(6): 431-435. doi: 10.1177/016555159802400606
[37] Porter MF. An algorithm for suffix stripping. PROGRAM 1980; 14(3): 130-137. doi: 10.1108/eb046814
[38] Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 1989; 84(406): 414-420. doi: 10.1080/01621459.1989.10478785