Winter 2022

In the Winter issue

November 2021 was a very busy time for IRSG. The Search Solutions Conference took place virtually on 24 November. You can read reports on the conference and on the two tutorials in this issue. There is also a list of the BCS Search Industry Awards that were announced at the Conference.  Continuing the ‘awards’ theme the winner of the 2021 Karen Spärk Jones award is Dr Ivan Vulić (University of Cambridge). IRSG is a partner with UKeiG in the selection of the Strix award. The Strix Award Lecture this year was given by Professor Ian Ruthven.

The IRSG AGM took place immediately after the Conference, and there have been a number of changes to the list of Committee members. These changes are also mentioned by IRSG Chairman Udo Kruschwitz in his review of 2021 and IRSG plans for 2022. These include ECIR 2022 in Stavanger in April and a series of eight lectures in February and March which will offer a relatively low-technology introduction to information retrieval. The concept was developed by ISKO UK, which has invited IRSG to give some of the lectures. IRSG members can attend free of charge.

The opening paper at Search Solutions was given by Professor Katriina Bystrom (Metropolitan University of Oslo) who has been working on a project led by Professor Marianne Lykke (University of Aalborg) on an analysis of how a major biotech company makes use of enterprise search. As far as I am aware this is the first-ever paper that does a deep dive into an  organisation-wide use of enterprise search, and I have added a commentary on the outcomes based on my own experience over the last two decades. The other feature article is about the challenges of extracting text content from PDF files. This is much more challenging than most search managers appreciate, and I am very grateful to Tim Allison, a consultant at the NASA Jet Propulsion Laboratory for explaining why text extraction from PDFs is so difficult and how the process should be managed.

Also in this issue is a book review of a narrative account of the development of web search and the very useful list of events compiled by Andy MacFarlane, one of which is the ACS/IEEE Joint Conference on Digital Libraries in June.

And finally I offer some thoughts about the lack of recent text books on IR at either undergraduate or post-graduate level

From Udo Kruschwitz – BCS IRSG Chairman

Happy 2022!

Happy New Year everybody! It is a pleasure for me to contribute a few words in my role as chair of the BCS Information Retrieval Specialist Group. This is my second year of term, and we are looking back at another turbulent year — who would have thought that in autumn 2021 we would have to revert back to online conferences, meetings and teaching (apart from the experts, that is)? As a result we also had to make our annual Search Solutions event fully virtual. This also applied to our annual general meeting (AGM) which was co-located with Search Solutions.

The AGM serves a number of purposes. One is to reflect on what we had been doing over the past year. The general impression was that we had achieved quite a bit despite the circumstances and the vastly reduced budget, including a revamped BCS IRSG Web presence, a project intiated and conducted under the supervision of our very own editor and immediate  past Vice-Chair Martin White. We also looked back at highly successful conferences ECIR 2021 as well as Search Solutions 2020, both of which run in fully virtual mode.

The AGM also marks the election of a new committee. Our constitution defines two-year terms for committee members. As a result we always have some of the posts being up for election every year. I encourage you to read what our new (and old) Secretary is reporting on this in a piece elsewhere in this issue. You can of course, as always, find more details of the AGM in the (draft) minutes which can be found on our group’s homepage.

So what’s new? Well, we are pleased to welcome a number of new faces to the committee! And we are looking forward to some fresh ideas in the coming year and beyond. When I say new faces, then what this actually means is that we were lucky to recruit some well-established IR researchers and practitioners. This includes Graham McDonald, Annalina Caputo and Tony Russell-Rose who joins us again after a short break. While it is great to welcome new committee members of such calibre we also had to waive goodbye to a number colleagues, some of them having served the committee for decades. Here I am talking about Stefan Rüger and Andy MacFarlane (with 24 Years of continuous service on the committee!). But we equally value the lasting contributions made by Martin White, Dyaa Albakour and João Magalhaes. Thanks to all of them!

As we are starting 2022 we are again faced with plenty of uncertainty. While ACM CHIIR 2022 has already been announced as  a virtual event, ECIR 2022 is still planned to take place in hybrid format. ECIR 2023 in Dublin will then be back with real people, real talks and real reception drinks (famous last words). Speaking of ECIR, one of the hotly debated topics last year was the question as to how to achieve open access for ECIR papers. This will be something we will explore in detail this year.

+++ And a final word +++ hot off the press +++ on another activity we are privileged to organise on an annual basis, the Karen Spärck Jones Award. +++ And the winner of the 2021 award is +++ Ivan Vulić of the University of Cambridge! Congratulations, very well deserved! +++ We are all excited about the keynote talk at ECIR in April … +++

On that note I get back to preparing my lectures and hope that you all stay healthy and have a great start into the new year. See you around in 2022!

 

Search Solutions 2021 – conference report

Looking back at the ten papers and two panel discussions at the BCS IRSG Search Solutions 2021 conference in November there was a very strong theme that emerged with most of the papers providing a set of variations. The theme at its most basic was that ‘one size does not fit all’, especially when it comes to the user interface. The underlying issues were presented by Professor Katriina Byström (University of Oslo) who started by summarising the work she had undertaken with Sanna Kumpulainen (Tampere University) on vertical and horizontal relationships amongst task-based information needs. Katriina then moved on to summarise a research project in which she had been working with Professor Marianne Lykke (project leader) Ann Bygholm and Louise Søndergaard (Department of Communications and Psychology at Aalborg University) that had looked in great detail at the use made of enterprise search in a biotechnology company. This study has now published in Journal of Documentation and on the AAU web site.  One of the important outcomes was that searching for people was by far the dominant use of the SharePoint search application. In my opinion this is a very important piece of research, and I have commented further on some of the outcomes and implications in this issue of Informer.

Read more…

Search Solutions 2021 – Tutorials report

As has been the custom, the main Search Solutions 2021 event has again been preceded by a day of tutorials. Two tutorials were successfully delivered on the day: an introduction to Natural Language Processing and a “Practitioners’ Evaluation Roundtable” (which was not a tutorial in the strict sense). Haiming Liu, Ingo Frommholz and Jochen Leidner reflect on the scope and outcomes of the tutorials.

Read more…

BCS Search Industry Awards 2021

At the BCS IRSG Search Solutions 2021 Conference on 24 November there was a session during which the BCS Search Industry Awards for 2021 were presented by Tony Russell-Rose on behalf of the IRSG Committee. The Conference was once again (and hopefully for the last time!) virtual so there was no opportunity to applaud the winners.

This year there were multiple nominations for all the categories. The nominations were reviewed by three judges with a wide range of experience

  • Paul Clough, Head of Data Science at Peak Indicators, and Professor of Search & Analytics at University of Sheffield
  • Agnes Molnar Founder & Managing Consultant Search Explained, based in Budapest, Hungary
  • Paul Cleverley Visiting Professor, Founder of Infoscience Technology Ltd and a Visiting Professor of Information Science and Technology, Robert Gordon University, Aberdeen

Read more…

Microsoft-BCS/BCS-IRSG Karen Spärck Jones Award 2021 for Dr Ivan Vulić

Dr Ivan Vulić from the University of Cambridge has been named as the winner of the Microsoft-BCS/BCS-IRSG Karen Spärck Jones Award 2021

The award, which is sponsored by Microsoft Research, was created in 2008 to remember Professor Karen Spärck Jones FBA, a leading researcher in information retrieval at what was at the time called the Cambridge “Mathematical Laboratory”. In a landmark 1972 paper, she discovered the utility of inverse document (collection) frequency (IDF), an important factor now standard in ranked document retrieval as used by most search systems.

Read more…

Strix Lecture 2021 – 9 December 2021

The annual Tony Kent Strix Annual Memorial Lecture was held on December 9th. The lecture was given by the 2020 Strix Award winner Ian Ruthven (Professor, University of Strahtclyde). He delivered his journey of research over the past 20 years where attendees learned that many of his contributions were through collaborative efforts with many researchers in the IR community, to which he expressed much gratitude for their efforts. His talk was titled “Google’s what you use when Alexa doesn’t know the answer, Uncle Ian”.

Several learnings from his journey of research particularly stand out in my mind. First, the probability ranking principle has heavily influenced many of the systems we have in production now and the ranking of their results. Additionally, searchers generally don’t like reading as much as they do searching, for example they are much more likely to briefly skim a result than to read the page they visit. Ian also suggested that considerations be made for a new generation of systems that handle expression of problems as opposed to queries, as query expression is difficult.

Read more…

IRSG 2021-2022 AGM Committee Elections

The IRSG Annual General Meeting (AGM) took place immediately after Search Solutions 2021, where the committee election results were announced.  The full list of current committee members is now available on the IRSG governance page.

Incoming committee members are Tony Russell-Rose (Vice-Chair), Graham McDonald (Ordinary Committee Member) and Annalina Caputo (ECIR 2023 Committee Member). Existing Committee members that have transition to new new positions are Steven Zimmerman (Secretary), Yashar Moshfeghi (Inclusion Officer) and Jochen Leidner (Events Coordinator). Full details of the BCS IRSG committee are provided on our governance page.

We wish to thank outgoing committee members Stefan Rueger (Past-Chair) Martin White (Vice-Chair), Andy MacFarlane (Committee Member), Dyaa Albakour (Committee Member) and Joao Magalhaes (ECIR 2021 Committee Member) for their service. In particularly, we wish send a particular thank you to Andy MacFarlane whom served various roles on the committee for 25 consecutive years.

As is the unfortunate trend with many BCS committee elections, all positions were uncontested. As a reminder our elections take place every Autumn, so please watch our governance page and the IR listserv to which you can subscribe to for future election announcements.  We are very keen to have new members on our committee.

Steve Zimmerman (Honorary Secretary)

ECIR 2022 10-14 April – the latest news

The 44th European Conference on Information Retrieval (ECIR’22) will be held in Stavanger, Norway, between April 10 and 14, 2022. It will be the northern-most location in the history of ECIR, with the Norwegian fjord region offering some of the most majestic and spectacular scenery in the world. With its intimate and mostly pedestrian centre, the city is a truly unique destination, surrounded by mountains, beaches and the sea.

The conference is organized by the Information Access and Artificial Intelligence group at the Department of Computer and Electrical Engineering at the University of Stavanger. The IAI group’s mission is to build novel information retrieval and web mining solutions for next generation search engines and intelligent systems, by combining data-driven approaches with human-centered design. The group plays a key role in two of the national AI centers by leading and contributing to work packages on language technology, personalization, and media content production and analysis. It also has a growing network of international collaborations, including top industry players (such as Google and Bloomberg) and academic institutions (e.g., Carnegie Mellon University, University of Amsterdam, and L3S Research Center).

Read more…

Exploring Information Retrieval – A series of eight ISKO UK / BCS IRSG IR workshops

Following requests from attendees to the ISKO UK KO-ED classes (part of the programme Knowledge Organization Education) ISKO UK ran two workshops in the Autumn: the first one on vocabulary control with Sylvie Davis and the second one on the UDC with Aida Slavic. The last of the three workshops “Practical Thesaurus Construction” with Stella Dextre Clarke takes place in January 2022.

KO-ED events continue in February and March 2022 with a series of weekly lectures entitled “Exploring Information Retrieval“. The series is organised by Aida Slavic and Martin White in collaboration with the Information Retrieval Specialist Group (IRSG) of the British Computer Society (BCS). These eight lectures will provide an overview of the fundamental concepts, techniques and applications of Information Retrieval. With this programme ISKO UK and IRSG BCS aim to emphasise the central role of IR in all aspects of information work.  The target audience includes information professionals who would like to gain or refresh their knowledge of IR as well as students and researchers of any discipline where information retrieval plays a significant role.

The speakers from IRSG are Andy MacFarlane, Tony Russell-Rose, Ingo Frommholz and Martin White, together with Karen Blakeman, the leading UK web search specialist.

These lectures will take place on Thursdays at 6.30pm and will be free for all ISKO and BCS members.

At last – an academic deep dive into an enterprise search case study

The opening presentation at Search Solutions 2021 in November given by Professor Katriina Byström (Oslo Metropolitan University) in which she summarised the outcomes of  research she had carried out with Professor Marianne Lykke, Ann Bygholm and Louise Bak Søndergaard ( Department of Communications and Psychology at Aalborg University) on enterprise search use across an organisation. The paper was published in Journal of Documentation in late December under the title of ‘The role of historical and contextual knowledge in enterprise search’. The title is accurate but gives no hint of the scope, quality and implications of the outcomes of the project. There is an open access version on the Aalborg University site.

In my opinion it is one of the landmark papers in enterprise search, not just for the outcomes themselves but for the implications for enterprise search research and management. The extended analysis of the study results is outstanding but I would like to add some of my own comments from a practitioner/consultant perspective.

Read more…

Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction

(Note from the Editor  PDF files are so ubiquitous in business and in academia that few people give any thought to the problems that arise in extracting text from a PDF to incorporate into a search index. Tim Allison is a consultant working at the NASA Jet Propulsion Laboratory and has been at the forefront of understanding the causes of the problems and finding solutions. As you will see this very detailed analysis contains a substantial number of images, and so I have taken the decision to publish the introduction in HTML but then provide a link to the full paper as (ironically!) a PDF file.

Tim began working in natural language processing in the early 2000s. Since the early 2010s, he has focused on content/metadata extraction (and evaluation), advanced search and relevance tuning. Tim is the founder of Rhapsode Consulting LLC, and he currently works as a data scientist at NASA’s Jet Propulsion Laboratory, California Institute of Technology. Tim is also a member of numerous Apache software projects, including Apache Tika, PDFBox, POI, OpenNLP and Lucene/Solr. He holds a Ph.D. in Classical Studies, and he started his career as a professor of Latin and ancient Greek.

Disclaimer

“The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. Copyright 2021 California Institute of Technology©. U.S. Government sponsorship acknowledged.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

The author would like to thank Peter Wyatt, Chief Technology Officer of the PDF Association, and other colleagues for their feedback on this article. All errors and omissions are the author’s.

The following represents the viewpoints of the author and does not represent the funding agencies or reviewers.”

Introduction

The Portable Document Format (PDF) is one of the most common document file formats used in industry, academia and government. PDF comprises a significant component of files on the internet (see “PDF’s Popularity Online”). For non-technical users, PDF files may seem straightforward and largely reliable. However, in practice, PDF files present a rich set of challenges for tools that extract text to enable search or other natural language processing tasks. The goal of this article is to offer a general overview of some of the challenges in extracting text from PDFs for technically-oriented people who may be new to PDF. Specifically, this paper is intended for those who process PDF “in the wild”, which is to say, developers or development teams which do not have control over the generation of the PDFs they are processing. For those who are able to influence how the PDFs they process are generated, we encourage focusing on the final section of this article.

The full text of the 16 page article can be downloaded from OverviewOfTextExtractionFromPDFs

 

Book review – Invisible Search and Online Search Engines

(A note from the Editor. This 161pp book was published in 2019 but recently became open access. I must have missed it first time around!)

The reviewer

Cass Zhixue Zhao

I am a postdoctoral researcher in the Department of Computer Science, University of Sheffield. I have obtained my PhD from Information School, University of Sheffield. My PhD study focused on hate speech detection and model bias in deep learning. I was also a part-time research assistant during my PhD study, working for an NIHR-funded project to develop a search engine for pubic health research. In addition to the research areas mentioned above, my other research interests include disinformation diffusion and explainable machine learning

The review

Invisible Search and Online Search Engines will be an excellent start for new learners and researchers in the area of information retrieval, search engines and librarianship. It also works well as an exciting book for readers interested in search engines and searching behaviours online nowadays. It not only gives a broad overview of online searching but also incorporates a review of the history of information retrieval and librarianship.

The history is told in a story-telling tone rather than a textbook tone. After finishing the book, you will have a general idea of how different concepts and areas, such as librarian, information seeking, information retrieval and information behaviour, are linked. You will have your own idea of why a certain area in search has evolved in such a way.

Read more…

ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022

The annual ACM/IEEE Joint Conference on Digital Libraries (JCDL) is the primary international event for the inter- and multi-disciplinary community of academics and practitioners in digital libraries coming from computer, information and social sciences, and other related disciplines such as information retrieval. JCDL encompasses the many meanings of the term digital libraries, including notions of managing, operating, developing, curating, evaluating, or utilising collections of data/information/knowledge in various domains. The 2022 conference is hybrid, taking place in Cologne, Germany and online from 20-24 June. The call for research papers has recently been extended to 21 January

And finally….from the Editor

During the closing panel session of the Search Solutions conference in November Professor Iadh Ounis (University of Glasgow) highlighted the need to keep graduate and undergraduate courses updated in line with the very rapid developments in IR theory, development and practice. Around this time there was a Twitter thread about what might be the best textbooks to read to gain an introduction to IR. The most mentioned book in the replies was Introduction to Information Retrieval (Manning, Raghaven and Schütze 2010). Also published in 2010 was Search Engines – Retrieval in Practice (Croft, Metzler and Strohmann) and two years earlier Information Retrieval – Implementing and Evaluating Search Engines (Büttcher, Clarke and Cormack) was published. As far as I am aware the only more recent book has been Text Data Management and Analysis (Zhai and Massung) which was published in 2015. In the same year the second edition of my book on Enterprise Search was published by O’Reilly, followed by Searching the Enterprise (Kruschwitz and Hull 2017).

Read more…