Design of information retrieval experiments: the sufficient topic set size for providing an adequate level of confidence

In the current design of information retrieval (IR) experiments, a sample of 50 topics is generally agreed to be sufficient in size to perform dependable system evaluations. This article presents the detailed and formal explanation of how the second fundamental theorem of probability, the central limit theorem, can be used for the estimation of the sufficient size of a topic sample. The research performed in this article, using past Text Retrieval Conference data, reveals that, on average, 50 topics will be sufficient to provide a confidence level at or above 95% if the null hypothesis of an equal population mean average precision (MAP) (H0) is rejected for 2 IR systems having an observed difference in the MAP of 0.035 or more, whereas, in contrast, previous empirical research suggests a difference in the MAP of 0.05 or more. This study also shows that, for individual system pairs, the sample size required to provide 95% confidence on a declared significance may range from a size as small as 10 to a size as large as 722. Thus, for the design of IR experiments, it agrees with the common view that relying on average figures as a rule of thumb may well be misleading.

Design of information retrieval experiments: the sufficient topic set size for providing an adequate level of confidence

In the current design of information retrieval (IR) experiments, a sample of 50 topics is generally agreed to be sufficient in size to perform dependable system evaluations. This article presents the detailed and formal explanation of how the second fundamental theorem of probability, the central limit theorem, can be used for the estimation of the sufficient size of a topic sample. The research performed in this article, using past Text Retrieval Conference data, reveals that, on average, 50 topics will be sufficient to provide a confidence level at or above 95% if the null hypothesis of an equal population mean average precision (MAP) (H0) is rejected for 2 IR systems having an observed difference in the MAP of 0.035 or more, whereas, in contrast, previous empirical research suggests a difference in the MAP of 0.05 or more. This study also shows that, for individual system pairs, the sample size required to provide 95% confidence on a declared significance may range from a size as small as 10 to a size as large as 722. Thus, for the design of IR experiments, it agrees with the common view that relying on average figures as a rule of thumb may well be misleading.

___

  • C.W. Cleverdon, “The significance of the Cranfield tests on index languages”, Proceedings of the 14th International ACM SIGIR Conference on Research and Development in Information Retrieval (reprint), pp. 3–12, 1991.
  • W.G. Cochran, Sampling Techniques, New York, Wiley, 1977.
  • R.V. Hogg, A.T. Craig, J.W. McKean, Introduction to Mathematical Statistics, New York, Prentice Hall, 2004.
  • K. Sparck Jones, “Automatic indexing”, Journal of Documentation, Vol. 30, pp. 393–432, 1974.
  • C. Buckley, E.M. Voorhees, “Evaluating evaluation measure stability”, Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 33–40, 2000.
  • E.M. Voorhees, C. Buckley, “The effect of topic set size on retrieval experiment error”, Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 316–323, 2002. W. Webber, A. Moffat, J. Zobel, “Statistical power in retrieval experimentation”, Proceeding of the 17th ACM conference on Information and Knowledge Management, pp. 571–580, 2008.
  • E.M. Voorhees, “Variations in relevance judgments and the measurement of retrieval effectiveness”, Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 315–323, 19 E.M. Voorhees, “Topic set size redux”, Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 806–807, 2009.
  • R.A. Fisher, The Design of Experiments, Edinburgh, Oliver & Boyd, 1935.
  • W. Lin, A. Hauptmann, “Revisiting the effect of topic set size on retrieval error”, Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 637–638, 2005. D. Bodoff, P. Li, “Test theory for assessing IR test collections”, Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 367–374, 2007.
Turkish Journal of Electrical Engineering and Computer Science-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

Flow velocity measurement and analysis based on froth image SIFT features and Kalman filter for froth flotation

Jinping LIU, Weihua GUI, Zhaohui TANG

Negative selection algorithm for dengue outbreak detection

Maryam MOUSAVI, Azuraliza Abu BAKAR, Suhaila ZAINUDIN, Zalizah Awang LONG, Mazrura SAHANI, Mohammadmahdi VAKILIAN

Optimized operation and maintenance costs to improve system reliability by decreasing the failure rate of distribution lines

Hamed HASHEMI-DEZAKI, Seyed Hossein HOSSEINIAN, Hossein ASKARIAN-ABYANEH, Seyed Mohammad Mousavi AGAH

ANFIS-based estimation of PV module equivalent parameters: application to a stand-alone PV system with MPPT controller

Ahmet Afşin KULAKSIZ

Hybrid SPR algorithm to select predictive genes for effectual cancer classification

Aruna SUNDARAM, Nandakishore Lellapalli VENKATA, Rajagopalan Sarukai PARTHASARATHY

Stopping spam with sending session verification

Ahmet BARAN

Multi-objective Weighted Sum Approach Model Reduction by Routh-Pade Approximation using Harmony Search

Hasan Nasiri SOLOKLO, Malihe Maghfoori FARSANGI

Control of diesel engines mounted on vehicles in mobile cranes via CAN bus

Muciz ÖZCAN, Hidayet GÜNAY

SSR mitigation with SSSC thanks to fuzzy control

Seyed Mohammad Hassan HOSSEINI, Hadi SAMADZADEH, Javad OLAMAEI, Murtaza FARSADI

Optimal placement and sizing of distributed generations in distribution systems for minimizing losses and THDv using evolutionary programming

Aida Fazliana Abdul KADIR, Azah MOHAMED, Hussain SHAREEF, Mohd Zamri Che WANIK