I had the very good fortune to get to know Cyril Cleverdon towards the end of his distinguished career as Librarian at the Cranfield Institute of Technology and his invaluable work in creating and promoting the Cranfield Projects on information retrieval performance. These Projects formed the basis for the TREC events in the USA. At the time we first met in the early 1970s computers were still somewhat on the distant horizon (especially in the UK) but his insights into the fundamental aspects of information retrieval performance most certainly catalysed my move towards information science and away from chemistry and metallurgy.
I was therefore especially interested to read a paper by Professor Justin Zobell (University of Melbourne) in the Summer issue of SIGIR Forum entitled When Measurement Misleads: The Limits of Batch Assessment of Retrieval Systems
To quote from the abstract to this excellent 12pp open access paper
“The discipline of information retrieval (IR) has a long history of examination of how best to measure performance. In particular, there is an extensive literature on the practice of assessing retrieval systems using batch experiments based on collections and relevance judgements.
However, this literature has only rarely considered an underlying principle: that measured scores are inherently incomplete as a representation of human activity, that is, there is an innate gap between measured scores and the desired goal of human satisfaction. There are separate challenges such as poor experimental practices or the shortcomings of specific measures, but the issue considered here is more fundamental – straightforwardly, in batch experiments the human-machine gap cannot be closed. In other disciplines, the issue of the gap is well recognised and has been the subject of observations that provide valuable perspectives on the behaviour and effects of measures and the ways in which they can lead to unintended consequences, notably Goodhart’s law and the Lucas critique. Here I describe these observations and argue that there is evidence that they apply to IR, thus showing that blind pursuit of performance gains based on optimisation of scores, and analysis based solely on aggregated measurements, can lead to misleading and unreliable outcomes.”
To jump to the end of the paper (very lightly edited)
“Researchers – and reviewers! – should be alert to
- Achievement of an improved score does not mean that the method is improved.
- Choice of effectiveness measures that match the aim of the research should be part of the design of the experiment, not an afterthought.
- Results that are based on optimisation to a particular measure should be verified independently of the measure.
- Understanding of accuracy, distribution, and individual variation are critical to accurat interpretation of experimental results.
- A collection of individual cases should not be treated as a homogenous, consistent, or uniformly distributed whole, unless they have been shown to have these properties”
Whether you are an IR academic or a search practitioner this paper should be essential reading!