Investigation of Luhn’s claim on information retrieval

Investigation of Luhn’s claim on information retrieval

In this study, we show how Luhn’s claim about the degree of importance of a word in a document can be related to information retrieval. His basic idea is transformed into z -scores as the weights of terms for the purpose of modeling term frequency (tf) within documents. The Luhn-based models represented in this paper are considered as the TF component of proposed TF × IDF weighing schemes. Moreover, the final term weighting functions appropriate for the TF × IDF weighting scheme are applied to TREC-6, -7, and -8 databases. The experimental results show relevance to Luhn’s claim by having high mean average precision (MAP) for the terms with frequencies around the mean frequency of terms within a document. On the other hand, the weighting, which significantly discriminates the importance between low/high frequencies and medium frequencies, degrades the retrieval performance. Therefore, any weighting scheme (TF) that is directly proportional to tf has a probability of high retrieval performance, if this can optimally indicate the difference of the importance regarding tf values and also optimally eliminate the terms that have high frequencies.

___

  • [1] M.E. Maron, J.L. Kuhns, “On relevance, probabilistic indexing and information retrieval”, J. ACM, Vol. 25, pp. 216-244, 1960.
  • [2] H.P. Luhn, “A statistical approach to mechanized encoding and searching of literary information”, IBM Journal Research and Development, Vol. 1, pp. 309-317, 1957.
  • [3] H.P. Luhn, “The automatic creation of literature abstracts”, IBM Journal of Research and Development, Vol. 2, pp. 159-165, 1958.
  • [4] G. Salton, “Automatic text analysis”, Science, Vol. 168, pp. 335-343, 1970.
  • [5] J. Minker, E. Peitola, G.A. Wilson, “Document retrieval experiments using cluster analysis”, Journal of the American Society for Information Science, Vol. 24, pp. 246-260, 2007.
  • [6] S.E. Robertson, S. Walker, “Some simple approximations to 2-Poisson model for probabilistic weighted retrieval”, in Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin), New York, Springer-Verlag, pp. 232-241, 1994.
  • [7] K.S. Jones, S. Walker, S.E. Robertson, “A probabilistic model of information retrieval: development and comparative experiments”, Information Processing and Management, Vol. 36, pp. 779-840, 2000.
  • [8] G. Salton, A. Wong, C.T. Yu, “Automatic indexing using term discrimination and term precision measurements”, Information Processing and Management, Vol. 12, pp. 43-51, 1976.
  • [9] G. Salton, C.S. Yang, “On the specification of term values in automatic indexing”, Journal of Documentation., Vol. 29, pp. 351-372, 1973.
  • [10] K.S. Jones, “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation, Vol. 28, pp. 11-21, 1972.
  • [11] S.P. Harter, “A probabilistic approach to automatic keyword indexing, Part I: On the distribution of specialty of words in a technical literature”, Journal of the American Society for Information Science, Vol. 26, pp. 197-216, 1975.
  • [12] S.P. Harter, “A probabilistic approach to automatic keyword indexing, Part II: An algorithm for probabilistic indexing”, Journal of the American Society for Information Science, Vol. 26, pp. 280-289, 1975.
  • [13] S.E. Robertson, K. Sparck Jones, “Relevance weighting of search terms”, Journal of the American Society for Information Science, Vol. 27, pp. 129-146, 1976.
  • [14] W.S. Cooper, M.E. Maron, “Foundations of probabilistic and utility-theoretic indexing”, Journal of the ACM, Vol. 26, pp. 67-80, 1978.
  • [15] W.B. Croft, D.J. Harper, “Using probabilistic models of document retrieval without relevance information”, Journal of Documentation, Vol. 35, pp. 285-295, 1979.
  • [16] S.E. Robertson, C.J. van Rijsbergen, M. Porter, “Probabilistic models of indexing and searching”, in Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, Cambridge, England, pp. 35-56, 1980.
  • [17] N. Fuhr, “Models for retrieval with probabilistic indexing”, Information Processing & Management, Vol. 25, pp. 55-72, 1989.
  • [18] H.R. Turtle, W.B. Croft, “A comparison of text retrieval models”, The Computer Journal, Vol. 35, pp. 279-290, 1992.
  • [19] S.K.M. Wong, Y.Y. Yao, “On modeling information retrieval with probabilistic inference”, ACM Transactions on Information Systems (TOIS), Vol. 13, pp. 38-68, 1995.
  • [20] J. Ponte, B. Croft, “A language modeling approach in information retrieval”, in Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne), New York, ACM, pp. 275-281, 1998.
  • [21] D. Hiemstra, A.P. de Vries, “Relating the new language models of information retrieval to the traditional retrieval models”, CTIT Technical Report TR-CTIT-00-09, Enschede, the Netherlands, Twente University, 2000.
  • [22] G. Amati, C.J. van Rijsbergen, “Probabilistic models of information retrieval based on measuring the divergence from randomness”, ACM Trans. Inf. Syst., Vol. 20, pp. 357-389, 2002.
  • [23] M. Porter, “An algorithm for suffix stripping”, Program 14, pp. 130-137, 1980.
  • [24] S.E. Robertson, S. Walker, M. Beaulieu, “Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive”, in the 7th Text Retrieval Conference NIST Special Publication 500:242, pp. 253-264, 1999.