Book Review: Information Retrieval: The Early Years

It’s time to make space on your book shelf again, as this year, quite a few interesting IR books have been published. This is a brief review on one of them: “Information Retrieval: The Early Years“, written by Donna Harman.

Harman can rightly be described as one of the pioneers of IR who has influenced advances in the field for decades. In this book, she provides an interesting overview of those research works that played a significant role towards the development of modern search engines.

In the introduction chapter, she clarifies why she is the best person to write this monograph: Witnessing the first magic happening in the 1960’s as a member of Gerald Salton’s research lab at Cornell, she got more actively involved with information retrieval in the 1980’s, when she built the renowned TIPSTER test collection. In the 1990’s, she then went on to initiate the TREC conference series. While she summarised the success story of TREC in her 2005 book, this monograph now provides us with a wider picture on advances over the years.

Research advances are presented in chronological order with each chapter summarising the main efforts of another decade. She starts by briefly describing best practice in the Pre-1960’s era from the time when library card catalogues were considered to be an innovative novelty, the impact of Vannevar Bush’s vision (‘As we may think’) on how information could be stored, to the first efforts to index and abstracting documents using computers, e.g., including works by Luhn and others. She refers to this as ‘Indexing Wars: Round 1’.

This indexing war continued in the 1960’s when IR was receiving more attention. To illustrate this boost, the chapter dedicated to this decade is titled ‘Full Steam Ahead’. Main research efforts at that time was on automatic indexing and abstracting, the development of test collections and evaluation methodologies (most notably the Cranfield Tests), as well as computational advances that allowed for the development of first search systems such as the SMART System at Cornell.

Chapter 4 is dedicated to the 1970’s. Donna sees this as a time of consolidation where the main research efforts were on improving search effectiveness, the development of several operational systems, but also on the development of theories. Examples include Spärck Jones’ work on term frequency properties based on relevance, the introduction of F1 as evaluation metric, or Robertson’s work on probabilistic theory of relevance weighting.

The 1980’s saw a continuation of efforts that started in the preceding decade, e.g., with Robertson, Croft and colleagues focusing on a probabilistic models, but also saw new specialised areas emerging. While so far the main emphasis had been on measuring the effectiveness of systems using test collections, Belkin promoted further work on incorporating the use in the search process, which also let to further work on rethinking user interfaces. This decade was also the successful start of online databases such as library catalogues, indexing services, full-text and numeric databases.

The commercial success of these early online systems was very small compared to what we could then observe from the 1990’s onwards. Harman outlines this decade in Chapter 6 of her monograph. The arrival of the Internet let to an explosion of applications and novel research directions. The book finally concludes with a brief reflection on how much the world has changed and how omnipresent IR technology is nowadays.

Summarising, I consider this to be an excellent book as it provides us with the opportunity to ‘catch up’ with the history of and prior work on information retrieval. In addition, the book is a treasure chest full of references to research papers that have impacted research on IR for decades. One drawback is that it does not cover the more recent advances that took place during the past twenty years. The good news is that there will be other books that cover this time period in detail. For example, Ferro and Peters have edited a book on ‘Information Retrieval Evaluation in a Changing World‘ to celebrate the twentieth anniversary of the CLEF conference. Similarly, Kando and colleagues are currently working on a similar book to summarise research efforts at NTCIR. Their book is expected to appear in 2020.

About Frank Hopfgartner
Frank Hopfgartner

Frank Hopfgartner is Senior Lecturer in Data Science at University of Sheffield.

Leave a Reply

You must be logged in to post a comment.