Date of Original Version
Abstract or Description
Evaluating retrieval systems in a controlled environment with a large set of topics has been the core paradigm in the information retrieval community. Voorhees and Buckley proposed to estimate the reliability of retrieval experiments by calculating the probability of making wrong effectiveness judgments between two retrieval systems over two retrieval experiments, which is called Retrieval Experiment Error Rate (REER) in this paper. They have successfully shown how the topic set sizes affect the retrieval experiment reliability. However, the REER model in the previous work was empirically justified without providing a derivation based on statistical principles. We fill this gap and show that REER can indeed be derived from statistical principles. Based on the derived model we can explain why a successful experiment design depends on factors including a sufficient number of topics, large enough measurement score difference between systems, and a homogeneous distribution of retrieval scores for topics and systems, which reduces the variance of the score differences.
Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval. SIGIR '05. , 637-638.