There was a remarkable amount of interest in the publication of a belated obituary of Karen Sparck Jones in the New York Times on 9 January 2019. I was delighted to see that the University of Cambridge obituary for Karen (published at the time of her death in 2007) has now been updated with the NYT reference. It never fails to surprise me how few people have an appreciation of the history of information retrieval and search both in terms of its technical development but even more of the people whose insights transformed our views of how to use computers to manage language.
The term ‘information retrieval’ was coined by Calvin Mooers in a paper to the March, 1950 meeting of the Association for Computing Machinery at Rutgers University, New Brunswick, N. J. Arguably the first detailed account of how computers could be used in information retrieval was a Masters thesis by Philip Bagley released in 1951, entitled Electronic Digital Machines for High Speed Information Searching. The latter end of the 1940’s and the 1950s also saw the beginnings of computational linguistics. Progress was slow but steady over the rest of the decade, leading to two very important papers being published in 1959.
The first of these was a paper presented by Calvin Mooers in the 1959 Proceedings of the Western Joint Computer Conference. It was entitled The Next Twenty Years in Information Retrieval: Some Goals and Predictions and contains many insights on the role of computers and information retrieval which are still relevant today.
“When we speak of information retrieval, we are really thinking about the use of machines in information retrieval. The purpose of using machines here, as in other valid applications, is to give the machines some of the tasks connected with recorded information that are most burdensome and unsuited to performance by human beings. At all times, it is important to remember that it is the human customer who uses the information- retrieval system who must be served, and not the machine. It makes a difference who is served, and this little matter is sometimes forgotten in computer projects.”
Another timeless prediction is
“When a customer comes to an information retrieval system, he comes in a state of ignorance. After all, he needs information. Thus, his problem of knowing how to specify pieces of information that are unknown to him is a severe one. For one thing, the vocabulary of the retrieval system, and the usages of the terms in the system, may be slightly different from the language that he is used to. For another thing, upon seeing some of the information emitted according to his own retrieval prescription, he may decide that an entirely different prescription should be used. In short, the customer definitely needs help in using a machine information retrieval system, and this help should be provided by the machine.”
Of course one cannot begin to talk about information retrieval without talking about relevance. The term was first proposed by Brian Vickery in 1958, just one of many UK academics who have made significant contributions to information retrieval over the years. The early history of the development of the concept has been very well presented by Stefano Mizzaro who reviewed 130 research papers published up to 1995, and this history gives a sense of just some of the challenges we face in optimizing ‘relevance’ in the ranking of search results.
I would like to highlight the paper by Maron and Kuhns that was published in 1959 as a working paper from the Bunker Ramo Corporation, which at that time was developing systems for the US defence intelligence organisations. The title was “On Relevance, Probabilistic Indexing and Information Retrieval” and in my view was the first attempt to move away from trying to achieve an exact match of query terms in a Boolean query against the index to a set of documents.
The abstract reads
“The notion of relevance is taken as the key concept in the theory of information retrieval and a comparative concept of relevance is explicated in terms of the theory of probability. The resulting technique called “Probabilistic Indexing,” allows a computing machine, given a request for information, to make a statistical inference and derive a number (called the “relevance number”) for each document, which is a measure of the probability that the document will satisfy the given request. The result of a search is an ordered list of those documents which satisfy the request ranked according to their probable relevance.”
“When I finally put it all together I had figured out how the weights could be interpreted as precisely defined probabilities and I had a probabilistic theory of information retrieval – probabilistic indexing and output ranking according to computed values of probability of relevance. I felt a great sense of excitement and quickly wrote up an internal document for senior R-W management. That document was dated August 1958. Shortly thereafter I urged my old friend Lary Kuhns to join me at R-W. He did and for the next year we worked together to further develop, clarify and expand the new theory of information retrieval called ‘‘Probabilistic Indexing’’. Lary Kuhns was an outstanding logician and mathematician and he made important contributions by proposing various measures of correlation between index terms by means of which a search query could be expanded. And he showed how probabilistically indexed documents could be viewed as weighted vectors and how the ‘‘distance’’ or similarity between documents could be measured and used to extend and expand a search”
I love Maron’s comment on “a great sense of excitement”, something that is rarely seen in an academic paper. I am surprised that the Elsevier reviewers did not ask for a quantification of this excitement.
This is a contribution about the history of information retrieval, and I’m not going to comment further on the development of relevance as a concept. If you want to look further into the concept then the book “The Notion of Relevance in Information Science: Everybody Knows What Relevance Is. But What is it Really? by Timo Saracevic is a good place to start. You may also be interested in a series of blog posts I published in 2017 on the history of enterprise search from 1948.