Estimating spatiotemporal focus of documents using entropy with PMI
Estimating spatiotemporal focus of documents using entropy with PMI
Many text documents are spatiotemporal in nature, i.e. contents of a document can be mapped to a specifictime period or location. For example, a news article about the French Revolution can be mapped to year 1789 as timeand France as place. Identifying this time period and location associated with the document can be useful for variousdownstream applications such as document reasoning or spatiotemporal information retrieval. In this paper, temporalentropy with pointwise mutual information (PMI) is proposed to estimate the temporal focus of a document. PMI isused to measure the association of words with time expressions. Moreover, a word’s temporal entropy is considered as aweight to its association with a time point and a single time point with the highest overall score is chosen as the focustime of a document. The proposed method is generic in the sense that it can also be applied for spatial focus estimationof documents. In the case of spatial entropy with PMI, PMI is used to calculate the association between words and placeentities. The effectiveness of our proposed methods for spatiotemporal focus estimation is evaluated on diverse datasetsof text documents. The experimental evaluation confirms the superiority of our proposed temporal and spatial focusestimation methods.
___
- [1] Jatowt A, Au Yeung C, Tanaka K. Generic method for detecting focus time of documents. Information Processing
& Management 2015; 51 (6): 851-868.
- [2] Strötgen J, Armiti A, Van Canh T, Zell J, Gertz M. Time for more languages: Temporal tagging of Arabic, Italian,
Spanish, and Vietnamese. ACM Transactions on Asian Language Information Processing 2014; 13 (1): 1.
- [3] De Jong F, Rode H, Hiemstra D. Temporal language models for the disclosure of historical text. In:
Humanities,
Computers and Cultural Heritage: Proceedings of the 16th International Conference of the Association for History
and Computing; Amsterdam, the Netherlands; 2005. pp. 161-168.
- [4] Kanhabua N, Nørvåg K. Improving temporal language models for determining time of non-
timestamped documents.
In: International Conference on Theory and Practice of Digital Libraries; Aarhus, Denmark; 2008. pp. 358-370.
- [5] Kanhabua N, Nørvåg K. Using temporal language models for document dating. In: Joint European Conference on
Machine Learning and Knowledge Discovery in Databases; Bled, Slovenia; 2009. pp. 738-741.
- [6] Kumar A, Lease M, Baldridge J. Supervised language modeling for temporal resolution of texts. In: 20th ACM
International Conference on Information and Knowledge Management; Glasgow, UK; 2011. pp. 2069-2072.
- [7] Kumar A, Baldridge J, Lease M, Ghosh J. Dating texts without explicit temporal cues. arXiv preprint,
arXiv:1211.2290, 2012.
- [8] Dalli A. Temporal classification of text and automatic document dating. In: Human Language Technology Conference of the NAACL, Companion Volume: Short Papers; New York City, USA; 2006. pp. 29-32.
- [9] Chambers N. Labeling documents with timestamps: learning from their time expressions. In: 50th Annual Meeting
of the Association for Computational Linguistics: Long Papers-Volume 1; Jeju Island, Korea; 2012. pp. 98-106.
- [10] Niculae V, Zampieri M, Dinu L, Ciobanu AM. Temporal text ranking and automatic dating of texts. In: 14th
Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers;
Gothenburg, Sweden; 2014. pp. 17-21.
- [11] Garcia-Fernandez A, Ligozat AL, Dinarelli M, Bernhard D. When was it written? Automatically determining
publication dates. In: International Symposium on String Processing and Information Retrieval; Pisa, Italy; 2011.
pp. 221-236.
- [12] Kotsakos D, Lappas T, Kotzias D, Gunopulos D, Kanhabua N et al. A burstiness-aware approach for document
dating. In: 37th International ACM SIGIR Conference on Research & Development in Information Retrieval; Gold
Coast, Australia; 2014. pp. 1003-1006.
- [13] Melo F, Martins B. Automated geocoding of textual documents: a survey of current approaches. Transactions in
GIS 2017; 21 (1): 3-38.
- [14] Woodruff AG, Flaunt C. Automated Geographic Indexing of Text Documents (Sequoia 2000 Technical Report
94/41). Berkeley, CA: University of California, EECS, 1994.
- [15] Martins B, Manguinhas H, Borbinha J. Extracting and exploring the geo-temporal semantics of textual resources.
In: IEEE International Conference on Semantic Computing; Washington, DC, USA; 2008. pp. 1-9.
- [16] Han B, Cook P, Baldwin T. Geolocation prediction in social media data by finding location indicative words. In:
24th International Conference on Computational Linguistics; Mumbai, India; 2012. pp. 1045-1062.
- [17] Zhang W, Gelernter J. Geocoding location expressions in Twitter messages: a preference learning method. Journal
of Spatial Information Science 2014; 2014 (9): 37-70.
- [18] Van Laere O, Schockaert S, Tanasescu V, Dhoedt B, Jones CB. Georeferencing Wikipedia documents using data
from social media sources. ACM Transactions on Information Systems 2014; 32 (3): 12.
- [19] Wing B, Baldridge J. Hierarchical discriminative classification for text-based geolocation. In:
Conference on Empirical Methods in Natural Language Processing; Doha, Qatar; 2014. pp. 336-348.
- [20] Priedhorsky R, Culotta A, Del Valle SY. Inferring the origin locations of tweets with quantitative confidence. In:
17th ACM Conference on Computer Supported Cooperative Work & Social Computing; New York, NY, USA; 2014.
pp. 1523-1536.
- [21] Li G, Hu J, Feng J, Tan KL. Effective location identification from microblogs. In: 30th International Conference
on Data Engineering; Chicago, IL, USA; 2014. pp. 880-891.
- [22] Van Laere O, Quinn J, Schockaert S, Dhoedt B. Spatially aware term selection for geotagging. IEEE Transactions
on Knowledge and Data Engineering 2013; 26 (1): 221-234.
- [23] Rahimi A, Vu D, Cohn T, Baldwin T. Exploiting text and network context for geolocation of social media users.
arXiv preprint, arXiv:1506.04803, 2015.
- [24] Hulden M, Silfverberg M, Francom J. Kernel density estimation for text-based geolocation. In: 29th AAAI Conference on Artificial Intelligence; Austin, TX, USA; 2015. pp. 145-150.
- [25] Brunsting S, De Sterck H, Dolman R, Van Sprundel T. GeoTextTagger: High-Precision Location Tagging of Textual
Documents using a Natural Language Processing Approach. arXiv preprint, arXiv:1601.05893, 2016.
- [26] Rodrigues E, Assunção R, Pappa GL, Renno D, Meira W Jr. Exploring multiple evidence to infer users’ location
in Twitter. Neurocomputing 2016; 171: 30-38.
- [27] Kordopatis-Zilos G, Papadopoulos S, Kompatsiaris I. Geotagging text content with language models and feature
mining. Proceedings of the IEEE 2017; 105 (10): 1971-1986.
- [28] Strötgen J, Gertz M. HeidelTime: High quality rule-based extraction and normalization of temporal expressions.
In: 5th International Workshop on Semantic Evaluation; Uppsala, Sweden; 2010. pp. 321-324.
- [29] Setzer A. Temporal information in newswire articles: an annotation scheme and corpus study. PhD, University of
Sheffield, Sheffield, UK, 2001.
- [30] Ferro L, Mani I, Sundheim B, Wilson G. TIDES Temporal Annotation Guidelines Version 1.0.2. McLean, VA, USA:
MITRE Corporation, 2001.
- [31] Grishman R, Sundheim B. Message understanding conference-6: A brief history. In: 16th Conference on Computational Linguistics - Volume 1; Copenhagen, Denmark; 1996. pp. 466-471.
- [32] Jaccard P. The distribution of the flora in the alpine zone. 1. New Phytologist 1912; 11 (2): 37-50.
- [33] Church KW, Hanks P. Word association norms, mutual information, and lexicography. Computational Linguistics
1990; 16 (1): 22-29.
- [34] Mazur P, Dale R. Wikiwars: A new corpus for research on temporal expressions. In: Conference on Empirical
Methods in Natural Language Processing; Cambridge, MA, USA; 2010. pp. 913-922.
- [35] Morbidoni C, Cucchiarelli A. A bag-of-entities approach to document focus time estimation. In: 3rd International
Workshop on Knowledge Discovery on the WEB; Cagliari, Italy; 2017.
- [36] Wing BP, Baldridge J. Simple supervised document geolocation with geodesic grids. In: 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies-Volume 1; Portland, OR, USA; 2011.
pp. 955-964.
- [37] Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J. Supervised text-based geolocation using language models
on an adaptive grid. In: 2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning; Jeju Island, Korea; 2012. pp. 1500-1510.
- [38] Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction.
In: 9th International Conference on Semantic Systems; Graz, Austria; 2013. pp. 121-124.
- [39] Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K et al. YFCC100M: The new data in multimedia research.
arXiv preprint, arXiv:1503.01817, 2015.
- [40] Karney CF. Algorithms for geodesics. Journal of Geodesy 2013; 87 (1): 43-55.