Autumn 2021

In the Autumn 2021 issue

I’ve given a lot of prominence to the forthcoming Search Solutions 2021 event on 23/24 November with Tutorials on the Tuesday  and the Conference on the Wednesday. Apart from one tutorial it will (sadly) be a virtual event. The IRSG AGM will take place at the end of the Conference, and a call for nominations to the Committee is included in this issue. Over the last few months the IRSG web site has been cleansed and migrated into the BCS Group template. We are working on a new template for Informer, but that may not be visible until early 2022. ECIR 2022 is now gathering momentum. It will be held on-site in Stavanger on 10-14 April but there will also be support for virtually-attending delegates.

The two feature articles this month are on the newly-announced IR Anthology, with some 60,000 research papers on information retrieval, and on the impact on IT budgets of moving search from on-premise to cloud platforms. There is a brief note on the publication of a history of the Institute of Information Scientists (1958-2002) and a review of an excellent book by Susan Walsh on classifying and fixing dirty data. And finally an alert to a profile I am writing on the life and achievements of G. Malcolm Dyson (1902-1978) a brilliant British chemist who transformed the fortunes of Chemical Abstracts Service whilst acting as Director of Research from 1959-1962. The issue closes with a comprehensive list of forthcoming conferences.

Much of this issue has been authored by me and that should not be the case. I’d be delighted to have contributions from members of IRSG about (just as examples)

  • Research projects you are working on
  • Conferences you have participated in
  • Departments that you are proud of
  • Visions of the future you would like to test out
  • People who have inspired you to take IR seriously
  • Books you have enjoyed (or perhaps not enjoyed!) reading
  • Applications you have developed
  • Problems you are facing and would welcome solutions
  • Problems you have solved which may have a wider application

Search Solutions 2021 23/24 November

We had hoped to run the Search Solutions 2021 Conference and Tutorials on-site at the BCS London HQ but constraints on the number of delegates that could be accommodated because of Covid concerns meant that last month we made the decision to go virtual with the Conference (which worked well last year) and with one of the tutorials. As you will see the one-day tutorial will be held at the BCS London HQ.

Details of both events follow but this is the Eventbrite registration link.

The Tutorials 23 November

Tutorial 1 Overview of Natural Language Processing

Michael Oakes

BCS, Ground Floor, 25 Copthall Avenue, London, EC2R 7BP

This tutorial will give an overview of Natural Language Processing, which is the computer processing of human-produced speech and text). The textbook “Speech and Language Processing” by Daniel Jurafsky and James H Martin will be used as a basis for the tutorial. The levels we will cover are morphology (shapes of subword units),  phonology (pronunciation of subword units), spelling checkers, automatic assignment of grammatical classes to words, relations among words, parsing with context-free grammars, meaning representations, word sense disambiguation, pragmatics (language above the sentence level) and a brief introduction to machine translation.

What we expect the attendees to gain an overview of the field of Natural Language Processing. Lecture style presentations will be interspersed with practical exercise where we carry out the actions of the computer on pen and paper.

Schedule

10:00am: Overview of Natural Language Processing

10:30am: Regular Expressions and Finite State Automata

11:00am:  Speech Processing

12:00noon: Dealing with Spelling Errors

12:30pm: Automatic Part-of-Speech Tagging

2:00pm: Syntax: A Context-Free Grammar for English

2:30pm: Semantic Representations

3:00pm: Discourse Analysis

4:00pm: Machine Translation

4:30pm: Questions and Answers

5:00pm: End

Tutorial 2 – Practitioners’ Evaluation Roundtable – A virtual tutorial

Ingo Frommholz and Jochen Leidner

Information systems that are deployed in production settings and used operationally by hundreds or thousands of users are typically more complex than systems developed in academic research, which makes them much harder to evaluate. However, not evaluating a system is not a viable option, as it corresponds to “flying blindly” – the positive or negative impact of any change would remain unknown. As a consequence, many practitioners come up with their own protocols for assessing system quality in terms of the relevance of rankings given a query. In the academic world, several initiatives such as TREC1, MediaEval2 or CLEF3 are striving to provide benchmarks and datasets to make different solutions and algorithms comparable to each other for some specified task. A further example is Kaggle4. While BCS Search Solutions in the past has been successful in transferring knowledge among practitioners on the one hand, and academics and practitioners, on the other hand, we think evaluation is a topic that would require more attention. While we think there is no “one size fits all” solution, we also believe that there should be an exchange of ideas, solutions and experiences when it comes to evaluation information and search systems in an enterprise environment.

Instead of a full tutorial, we think the topic of evaluation needs to be driven by the participants. Hence we will conduct a round-table discussion (in lieu of a tutorial) at the upcoming BCS Search Solutions. Our aim is to provide an open forum where practitioners can share methods, metrics, challenges, and tricks of the trade with their peers. After a short introductory presentation that emphasises the importance of IR evaluation and sketches its history to set the scene and align participants, the format is one of free discussion without moderation. A human recorder will take notes, which may be published in a suitable venue (e.g. SIGIR Forum or BCS Informer) if findings emerge that are worthy to be preserved.

3.00pm: A brief history and introduction of IR systems evaluation – Ingo Frommholz & Jochen Leidner

3.45pm: Discussion & Lightning talks: Methods, metrics, challenges — how do practitioners evaluate their systems so far? – All participants

4.45pm: Discussion/Breakout Groups: Evaluation in “real-world” environments – all participants

5.30pm: Discussion of results/wrap up – all participants

6.00pm: Closing

The Conference 24 November

This year the format of the conference is based around paired papers (with a couple of exceptions) on specific themes, so that attendees can get two different perspectives on the themes. There will then be a Q&A session for both the speakers.

Incorporated into the agenda will be the presentation of the  BCS Search Industry Awards (organized by Tony Russell-Rose), one of which will be the SS 2021 Best Paper award which (for obvious reasons!) comes right at the end of the conference.

There will be two panel sessions at the end of the conference. The first of these will be a panel of some speakers and session chairs reflecting on what they have heard and learned during the conference. The second will be some invited panelists who will be asked to success what the themes for the SS2022 conference should be.

Once the final session is completed the AGM will take place. This will be open to all attendees but voting is of course only open to members of the Information Retrieval Specialist Group.

Inevitably attendees will come in and out of the event during the day, which is why each session starts on the hour so there is no excuse for missing a session that is of particular interest. We hope to make recordings of the presentations available but that may not be the case for every presentation, so please do not assume that you can miss a presentation and catch up later!

09.00 Formulating and treating information needs at work

Professor Katriina Byström, Department of Archivistics, Library and Information Science, Oslo Metropolitan University

10.00 Training for IR and data science

Professor Paul Clough, Information School, University of Sheffield and Peak Indicators

Olivia Foulds, Department of Computer Science, University of Strathclyde

11.00 Identifying and addressing misinformation

Dr. Andy MacFarlane, City, University of London

Dr. David Corney, NLP Engineer, FullFact

12.00 Searching the enterprise

Steve Sale, Search and Taxonomy Architect, AstraZeneca

John Western, Regional VP, Yext

13.00 Break

14.00 Systematic searching

Drs. Ing Rene Spijker, Academic Medical Centre, University of Amsterdam

BCS Search Industry Awards

15.00 Digital asset management

Tim Gollins, Head of Preservation and Information Management, National Records of Scotland

Theresa Regli, Consultant

16.00 Panel sessions

What have we learned today?

What are the priorities for 2022?

Search Solutions 2021 Best Paper Award

17.00 BCS IRSG AGM

Call for nominations to the BCS IRSG Committee

The BCS Information Retrieval Specialist Group invites nominations for the following positions:

– Vice-Chair

– Secretary

– Inclusion Officer

– Six ordinary members of the committee

Read more…

IRSG web site revisions

The new BCS IRSG web site has been up and running for a couple of months now. The web team at the BCS HQ were a pleasure to work with, and I’d like to thank Simon Curd and Fiona James For their patience and expertise converting my suggestions into the BCS Group template. There are a few pages which need a polish but overall it seems to be working well.

I’d like to draw your attention to the Resources page, which has been completely revised.

Read more…

ECIR 2022 Stavanger 10-14 April 2022 

 

The 44th European Conference on Information Retrieval will take place on site in Stavanger, Norway.  Sessions will also be streamed for delegates who are not able to travel to Norway. There has been a very good response to the invitations to all the sections of this conference. The conference team has set up an excellent web site and you can track developments via Twitter https://twitter.com/ecir2022

IR and ACL Anthologies

The US equivalent of IRSG is SIGIR, which publishes its Forum newsletter every six months. This is always a very good read and you do not have to be a member of SIGIR to do so. One of the feature articles in the June issue (which only came online in October!) is an introduction to the IR Anthology, a recently releaased structured collection of almost 60,000 research papers. The well-establishedACL Anthology currently hosts 71505 papers on the study of computational linguistics and natural language processing.

A description of the genesis and structure of the ACL Anthology is now somewhat out-of-date but remains an excellent introduction to the concept of an anthology of research papers. The IR Anthology has been established by Martin Potthast (Leipzig University), Benno Stein (Bauhaus-Universität, Weimar) and Matthias Hagen (Martin-Luther-Universität Halle-Wittenberg).

Read more…

Big Information and big budgets

The concept of Big Data has been around for some time. John Mashey at Silicon Graphics is usually credited with inventing the term in a presentation he gave in 1998. Without doubt big data is very difficult to manage and the demand for people with data science skills never seems to slow down. However much less attention has been paid to Big Information and the equivalent need for information scientists, a term invented in 1958 by Jason Farradane.

In early October a group of investigative journalists released the Pandora papers, The Pandora paper revelations came from a very large tranche of documents: 2.94 terabytes of data in all, 11.9 million records and documents dating back to the 1970s. I would recommend a very good article in Wired UK which provides a substantial amount of information on how the information in these documents was surfaced and analyzed.

Read more…

History of the Institute of Information Scientists 1958-2002

Over the last two years I have been working with Dr. Sandra Ward and Professor Charles Oppenheim in writing a history of the Institute of Information Scientists. The IIS was founded in 1958, largely due to the vision and commitment of Jason Farradane and the support of G. Malcolm Dyson. The IIS merged/was taken over by the Library Association in 2002. The archive of the IIS, such as the minutes of Council and Committees and the minutes of the AGM have vanished so we had to compile the history by reading through back issues of Inform, the newsletter of the IIS, and related publications.

The IIS played a very important role in supporting the early promotion of the technology and applications of text retrieval. launching a very well-attended Text Retrieval conferences.  If you would like to get a sense of the software applications that were available pre-Google and Microsoft there is a good 1994 review article in the Journal of Information Science, which was the journal of the IIS.

The 60,000 word history can be downloaded from a dedicated web site. This history is very much work-in-progress. We encourage comments about errors and omissions and will then revise the document ahead of publication in 2022, to mark 20 years since the IIS disappeared.

Book Review ‘Between the spreadsheets’ Susan Walsh

The full title of this book is Between the Spreadsheets – Classifying and Fixing Dirty Data. What a superb title! It makes you smile before you even open the book. At last there is a book that focuses on content quality and does so in a very practical way. Susan Walsh (aka The Classification Guru)  is an information entrepreneur who somewhat accidentally fell into the business of sorting out messy data. At the heart of Susan’s methodology is COAT, which focuses on data Consistency, Organisation, Accuracy and Trustworthiness. Having spent much of this year working on an e-commerce search project I can confirm that even market leading e-commerce companies are at the mercy of poor quality data generated by suppliers. The company had to depend on suppliers paying attention to data quality and yet in search after search rogue products were presented purely as a result of inconsistent and often incoherent codes being applied to products.

The chapter headings define the scope as The Dangers of Dirty Data, Supplier Normalisation, Taxonomies, Spend Data Classification, Basic Data Cleansing, a Dirty Data Maturity Model and Data Horror Stories.

Read more…

Events Autumn 2021

Note: Due to the COVID-19 crisis some events have been cancelled, postponed or will be run virtually. We have provided information on each of the events with the current status at the time of writing. Please check the URL of the event for further details.

Read more…

And finally….

I suspect that the name G. Malcolm Dyson in the History of the IIS item above will be unfamiliar to anyone who has not been in chemical information retrieval for quite a number of decades. Dyson developed a linear notation for organic chemical compounds in 1946, initially with a view to supporting the use of punched cards to retrieve information.  In 1959 Dyson was Research Director at Chemical Abstracts Service and had started working with H.P Luhn (IBM) on using a computer to handle the searching process, even if the 1401 computer only had 8k of core memory. The cheminformatics research at the Information School, University of Sheffield, can trace its origins back to Emeritus Professor Michael Lynch, who worked at CAS (initially with Dyson) from 1961 to 1965.

Read more…