Estimating spatiotemporal focus of documents using entropy with PMI

Many text documents are spatiotemporal in nature, i.e. contents of a document can be mapped to a specifictime period or location. For example, a news article about the French Revolution can be mapped to year 1789 as timeand France as place. Identifying this time period and location associated with the document can be useful for variousdownstream applications such as document reasoning or spatiotemporal information retrieval. In this paper, temporalentropy with pointwise mutual information (PMI) is proposed to estimate the temporal focus of a document. PMI isused to measure the association of words with time expressions. Moreover, a word’s temporal entropy is considered as aweight to its association with a time point and a single time point with the highest overall score is chosen as the focustime of a document. The proposed method is generic in the sense that it can also be applied for spatial focus estimationof documents. In the case of spatial entropy with PMI, PMI is used to calculate the association between words and placeentities. The effectiveness of our proposed methods for spatiotemporal focus estimation is evaluated on diverse datasetsof text documents. The experimental evaluation confirms the superiority of our proposed temporal and spatial focusestimation methods.

PDF

___

[1] Jatowt A, Au Yeung C, Tanaka K. Generic method for detecting focus time of documents. Information Processing & Management 2015; 51 (6): 851-868.
[2] Strötgen J, Armiti A, Van Canh T, Zell J, Gertz M. Time for more languages: Temporal tagging of Arabic, Italian, Spanish, and Vietnamese. ACM Transactions on Asian Language Information Processing 2014; 13 (1): 1.
[3] De Jong F, Rode H, Hiemstra D. Temporal language models for the disclosure of historical text. In: Humanities, Computers and Cultural Heritage: Proceedings of the 16th International Conference of the Association for History and Computing; Amsterdam, the Netherlands; 2005. pp. 161-168.
[4] Kanhabua N, Nørvåg K. Improving temporal language models for determining time of non- timestamped documents. In: International Conference on Theory and Practice of Digital Libraries; Aarhus, Denmark; 2008. pp. 358-370.
[5] Kanhabua N, Nørvåg K. Using temporal language models for document dating. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Bled, Slovenia; 2009. pp. 738-741.
[6] Kumar A, Lease M, Baldridge J. Supervised language modeling for temporal resolution of texts. In: 20th ACM International Conference on Information and Knowledge Management; Glasgow, UK; 2011. pp. 2069-2072.
[7] Kumar A, Baldridge J, Lease M, Ghosh J. Dating texts without explicit temporal cues. arXiv preprint, arXiv:1211.2290, 2012.
[8] Dalli A. Temporal classification of text and automatic document dating. In: Human Language Technology Conference of the NAACL, Companion Volume: Short Papers; New York City, USA; 2006. pp. 29-32.
[9] Chambers N. Labeling documents with timestamps: learning from their time expressions. In: 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1; Jeju Island, Korea; 2012. pp. 98-106.
[10] Niculae V, Zampieri M, Dinu L, Ciobanu AM. Temporal text ranking and automatic dating of texts. In: 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers; Gothenburg, Sweden; 2014. pp. 17-21.
[11] Garcia-Fernandez A, Ligozat AL, Dinarelli M, Bernhard D. When was it written? Automatically determining publication dates. In: International Symposium on String Processing and Information Retrieval; Pisa, Italy; 2011. pp. 221-236.
[12] Kotsakos D, Lappas T, Kotzias D, Gunopulos D, Kanhabua N et al. A burstiness-aware approach for document dating. In: 37th International ACM SIGIR Conference on Research & Development in Information Retrieval; Gold Coast, Australia; 2014. pp. 1003-1006.
[13] Melo F, Martins B. Automated geocoding of textual documents: a survey of current approaches. Transactions in GIS 2017; 21 (1): 3-38.
[14] Woodruff AG, Flaunt C. Automated Geographic Indexing of Text Documents (Sequoia 2000 Technical Report 94/41). Berkeley, CA: University of California, EECS, 1994.
[15] Martins B, Manguinhas H, Borbinha J. Extracting and exploring the geo-temporal semantics of textual resources. In: IEEE International Conference on Semantic Computing; Washington, DC, USA; 2008. pp. 1-9.
[16] Han B, Cook P, Baldwin T. Geolocation prediction in social media data by finding location indicative words. In: 24th International Conference on Computational Linguistics; Mumbai, India; 2012. pp. 1045-1062.
[17] Zhang W, Gelernter J. Geocoding location expressions in Twitter messages: a preference learning method. Journal of Spatial Information Science 2014; 2014 (9): 37-70.
[18] Van Laere O, Schockaert S, Tanasescu V, Dhoedt B, Jones CB. Georeferencing Wikipedia documents using data from social media sources. ACM Transactions on Information Systems 2014; 32 (3): 12.
[19] Wing B, Baldridge J. Hierarchical discriminative classification for text-based geolocation. In: Conference on Empirical Methods in Natural Language Processing; Doha, Qatar; 2014. pp. 336-348.
[20] Priedhorsky R, Culotta A, Del Valle SY. Inferring the origin locations of tweets with quantitative confidence. In: 17th ACM Conference on Computer Supported Cooperative Work & Social Computing; New York, NY, USA; 2014. pp. 1523-1536.
[21] Li G, Hu J, Feng J, Tan KL. Effective location identification from microblogs. In: 30th International Conference on Data Engineering; Chicago, IL, USA; 2014. pp. 880-891.
[22] Van Laere O, Quinn J, Schockaert S, Dhoedt B. Spatially aware term selection for geotagging. IEEE Transactions on Knowledge and Data Engineering 2013; 26 (1): 221-234.
[23] Rahimi A, Vu D, Cohn T, Baldwin T. Exploiting text and network context for geolocation of social media users. arXiv preprint, arXiv:1506.04803, 2015.
[24] Hulden M, Silfverberg M, Francom J. Kernel density estimation for text-based geolocation. In: 29th AAAI Conference on Artificial Intelligence; Austin, TX, USA; 2015. pp. 145-150.
[25] Brunsting S, De Sterck H, Dolman R, Van Sprundel T. GeoTextTagger: High-Precision Location Tagging of Textual Documents using a Natural Language Processing Approach. arXiv preprint, arXiv:1601.05893, 2016.
[26] Rodrigues E, Assunção R, Pappa GL, Renno D, Meira W Jr. Exploring multiple evidence to infer users’ location in Twitter. Neurocomputing 2016; 171: 30-38.
[27] Kordopatis-Zilos G, Papadopoulos S, Kompatsiaris I. Geotagging text content with language models and feature mining. Proceedings of the IEEE 2017; 105 (10): 1971-1986.
[28] Strötgen J, Gertz M. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In: 5th International Workshop on Semantic Evaluation; Uppsala, Sweden; 2010. pp. 321-324.
[29] Setzer A. Temporal information in newswire articles: an annotation scheme and corpus study. PhD, University of Sheffield, Sheffield, UK, 2001.
[30] Ferro L, Mani I, Sundheim B, Wilson G. TIDES Temporal Annotation Guidelines Version 1.0.2. McLean, VA, USA: MITRE Corporation, 2001.
[31] Grishman R, Sundheim B. Message understanding conference-6: A brief history. In: 16th Conference on Computational Linguistics - Volume 1; Copenhagen, Denmark; 1996. pp. 466-471.
[32] Jaccard P. The distribution of the flora in the alpine zone. 1. New Phytologist 1912; 11 (2): 37-50.
[33] Church KW, Hanks P. Word association norms, mutual information, and lexicography. Computational Linguistics 1990; 16 (1): 22-29.
[34] Mazur P, Dale R. Wikiwars: A new corpus for research on temporal expressions. In: Conference on Empirical Methods in Natural Language Processing; Cambridge, MA, USA; 2010. pp. 913-922.
[35] Morbidoni C, Cucchiarelli A. A bag-of-entities approach to document focus time estimation. In: 3rd International Workshop on Knowledge Discovery on the WEB; Cagliari, Italy; 2017.
[36] Wing BP, Baldridge J. Simple supervised document geolocation with geodesic grids. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1; Portland, OR, USA; 2011. pp. 955-964.
[37] Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J. Supervised text-based geolocation using language models on an adaptive grid. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning; Jeju Island, Korea; 2012. pp. 1500-1510.
[38] Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction. In: 9th International Conference on Semantic Systems; Graz, Austria; 2013. pp. 121-124.
[39] Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K et al. YFCC100M: The new data in multimedia research. arXiv preprint, arXiv:1503.01817, 2015.
[40] Karney CF. Algorithms for geodesics. Journal of Geodesy 2013; 87 (1): 43-55.