IRSG Supported Events


  • Informer

    Read about IRSG activity in the latest Edition of Informer.
  • How to apply
    If you are intrested in becoming a IT professional.
  • BCS-IRSG postcard

    Join the BCS-IRSG!

Web Badge

  • Visit
    If you are a member of the BCS IRSG show your affiliation by displaying our Web Badge on your page. Click here for code.

Course: Applications of Text Processing

Dr. Michael Oakes


  1. Natural Language Processing while conventional information retrieval is restricted to the "bag of words" model, natural language is much richer than that. We will introduce the concept of linguistic levels above lexis (individual words) such as parts of speech, syntax and parsers, semantics, and pragmatics. In the light of this, we will discuss why the bag of words model still works as well as it does, and the ways in which techniques from information retrieval can be used in linguistic processing.
  2. Clustering and Classification. Clustering comprises a family of algorithms which can automatically assign entities (such as documents) to categories, which may be newly discovered in the process. Classification is the process of assigning entities to their correct categories. We will look both at the theory of different clustering algorithms and get some hands-on experience with the statistical programming language. Texts can be clustered by topic, genre or writing style.


The aim is to cover two important applications of text processing other than core Information Retrieval, to enable participants to select which language processing techniques might be useful for their own work in the area.


The half-day tutorial will be structured as follows:

  1. Levels of language and ambiguity in language (lecture)
  2. Stemming rules and part of speech taggers (pen and paper practical)
  3. Semantics and discourse level phenomena (lecture)
  4. Break
  5. Text classification feature selection and learning methods (lecture)
  6. Clustering (pen and paper practical)
  7. Evaluation of text classifiers.


My Ph.D was in Information Retrieval (search engine technology). My previous research assistant posts were in automatic sentence alignment of English, French and Spanish telecommunications texts, automatic summarisation of journal articles about agriculture and automatic classification of news feeds about the pharmaceuticals industry. While at Sunderland I have supervised seven Ph.D. students who have now completed theses in Information Retrieval, most recently Naveed Anwar, who worked on the data mining of audiology patient records, and Nandita Tripathi, who worked on the automatic classification of news articles and web services.
My own research has been in corpus linguistics, e.g. discovering differences between the types of English used throughout the world. I recently wrote an article on disputed authorship, plagiarism software and spam filters for the Oxford Handbook of Computational Linguistics, and edited a book "Quantitative Methods in Translation Studies" with Meng Ji at the University of Tokyo. I recently completed the EU-funded VITALAS project on a multi-media search engine. I am a committee member for the Information Retrieval Specialist Group of the British Computer Society, and a reviewer for the European Conference on Information Retrieval (ECIR).
This year and last year I have taught courses at the University on search engine technology, forensic linguistics, medical statistics and decision support systems. I have also given lectures on medical statistics externally to Information Analysts at the Teesside NHS Trust in Middlesbrough, and to trainee psychiatrists at Roseberry Park Hospital in Middlesbrough.

<<< back to Tutorial Programme