Informer
Newsletter of the BCS Information Retrieval Specialist Group
  • Informer Home
  • Issues
    • Spring 2017
    • Summer 2017
    • Autumn 2017
    • Winter 2018
    • Spring 2018
    • Summer 2018
    • Autumn 2018
    • Winter 2019
    • Spring 2019
    • Summer 2019
    • Autumn 2019
    • Winter 2020
    • Spring 2020
    • Summer 2020
    • Autumn 2020
  • Articles by Topic
  • Authors
  • About Informer
Browse: Home / 2014 / November / Mining search logs for usage patterns (pt 2)

Mining search logs for usage patterns (pt 2)

By Tony Russell-Rose on November 3, 2014

Expectation Maximization applied to a new sample of 100,000 sessions

In a previous post I discussed some initial investigations into the use of unsupervised learning techniques (i.e. clustering) to identify usage patterns in web search logs. As you may recall, we had some initial success in finding interesting patterns of user behaviour in the AOL log, but when we tried to extend this and replicate a previous study of the Excite log, things started to go somewhat awry. In this post, we investigate these issues, present the results of a revised procedure, and reflect on what they tell us about searcher behaviour.

So to recap, last time we got to the point where we’d applied the expectation maximization algorithm to a sample of 10,000 sessions from the AOL log, and were hoping to replicate the findings from Dietmar Wolfram’s 2008 study. But our results were very different: three clusters instead of four, and some very different patterns. Moreover, our results weren’t even replicable within themselves: a further three samples of 10,000 sessions produced widely different outcomes (7, 10 and 10 clusters respectively). Even increasing the sample size to 100,000 seemed to make little difference (despite the suggestion in Wolfram’s paper that subsets of 50k to 64k sessions should produce stable clusters).

So why are we seeing such different results? One interpretation may be of course that these insights are indeed an authentic reflection of changes in user behaviour due to differences in context (e.g. a different search engine, time period, demographic, etc.) But before we explore that possibility, we should take steps to discount the effect of other confounding factors. For example, is our data truly representative of the population? Taking a further sample of sessions is a relatively straightforward test of this, and indeed, applying EM to a fresh sample of 10,000 sessions produced the following 4 clusters [note that I have changed the display order of the features to facilitate comparison with Wolfram’s results, and simplified their names]:

Expectation Maximization using Wolfram’s 6 features on 10,000 sessions from AOL
Expectation Maximization using Wolfram’s 6 features on 10,000 sessions from AOL

This outcome seems to offer some interesting insights, but again, it fails to repeat across the other samples; with 5, 6 and 7 clusters produced each time. Moreover, increasing the sample size to 100,000 also fails to produce a stable result, with 7, 13, 6 and 6 clusters produced on each iteration.

But let’s pause for a moment and examine the pattern in more detail. There is something very odd happening with term popularity now: we see a small cluster (just 3% of the sessions) where this feature seems to be something of an outlier, compressing the remaining traces into a narrow band. Indeed, the phenomenon becomes even more pronounced when we take a sample of 100,000 sessions:

Expectation Maximization applied to 100,000 sessions
Expectation Maximization applied to 100,000 sessions

Perhaps this is an artefact of the clustering algorithm? Let’s try XMeans instead (which is a variant of kMeans where the value for k is determined by the algorithm). In this iteration, XMeans finds a local optimum at k=10, so the number of clusters is different. But the overall pattern, with a small cluster (1% of sessions) representing outlier values for term popularity is again clearly visible:

XMeans (k<=10) applied to 100,000 sessions
XMeans (k<=10) applied to 100,000 sessions

So something else must be at play. It turns out that there is indeed an artefact in the data which is causing this. Long story short, there are a small number of sessions which contain just a single query, consisting solely of the character ‘-‘. Precisely why they are there is a matter for speculation: they may have been the default query in some popular search application, or an artefact of some automated service or API, etc. We’ll probably never know. But sessions like these, along with other robot-generated sessions, aren’t generally helpful when trying to understand human behavioural patterns. Instead, they are best removed prior to analysis. Of course, there are no 100% reliable criteria for differentiating robot traffic from human, and what should be removed is a matter for judgement, often on a case-by-case basis. In this case, including these single character queries appears to be counter-productive.

So now, with a new sample of 100,000 sessions excluding these outlier queries, we see EM produce the following output:

Expectation Maximization applied to a new sample of 100,000 sessions
Expectation Maximization applied to a new sample of 100,000 sessions

This pattern is much more stable, with four iterations producing 7, 7, 7 and 9 clusters respectively. At this point we can start to speculate on what these patterns may be telling us. For example:

  • Cluster 6 appears to be a group of users that engage in longer sessions, with many queries and many page views (clicks), but few repeating terms
  • Cluster 4 appears to be a smaller group who seem to specialise in relatively long but popular queries (an odd combination!), also with few repeating terms
  • Cluster 3 appears to be a relatively large group who make greater use of repeated terms, but are otherwise relatively unengaged (with shorter sessions and fewer page views)

And so on. Evidently, the patterns above are somewhat hard to interpret due to the larger number of clusters and lines on the chart. So what would happen if we tried to determine the optimum number ourselves, rather than letting XMeans find one for us? One way of investigating this is to specify different values for k a priori, and see how the within-cluster sum of squared errors (which is calculated by Weka as part of its output) varies on each iteration. For example, varying k from 2 to 10 gives us the following result:

Sum of squared errors by k
Sum of squared errors by k

As we can see, there is an ‘elbow’ around k=4 and another around k=7. This implies that either of these two values may be good choices for a local optimum. We’ve already seen the output for k=7 (which is the optimum that xMeans found), so now let’s try kMeans with k=4:

kMeans (k=4) applied to 100,000 sessions
kMeans (k=4) applied to 100,000 sessions

This time the groups are somewhat easier to differentiate. For example, we might infer that:

  • Cluster 3 represents a baseline or ‘generic’ cluster of users who hover around the average for all measures
  • Cluster 4 represents a relatively large group of users who engage in longer sessions (with more queries and page views) but are diverse in their interests, with few repeated terms
  • Cluster 1 represents a smaller group who are the converse to cluster 4, engaging in shorter sessions but with more repeated terms
  • Cluster 2 represents a tiny group (2%) of users who are similar to cluster 1 but focus on highly popular queries

Evidently, there are other ways we could analyse this data, and there are other ways we could interpret the output. In fact, I hope to write more about search log analysis in the coming weeks, taking advantage of a new source of data, which should further validate the methodology and allow us to explore some very different behaviour patterns. But for now, let’s draw some of the threads together and review what we’ve learnt.

Conclusions

  • Replicate to validate: As researchers, our instincts are to explore the unknown, to solve the unsolvable, and to favour novelty over repetition. But sometimes it befits us to focus on replication: by applying new techniques to old data, we validate our methodology and build a more reliable baseline for our own experimental work.
  • Features describe, but behaviours explain: It’s tempting to select features based on whatever a particular data source offers, and include as many as possible in the learning process. But not all are equally useful, and some can indeed ‘drown out’ the influence of more important signals. So rather than starting from what the data can offer, identify the information seeking behaviours you’d like to explore, and try to find the features that most closely align with them.
  • There is no ‘right answer’: As in many investigations of naturalistic phenomena, there is a tendency to look for patterns that make sense or in some way align with our expectations. But those expectations themselves are a subjective, social construct. The fact that we can produce multiple interpretations of the same data underlines the need for a common perspective when comparing patterns in search logs, and to apply recognised models of information seeking behaviour when interpreting the outputs.

 

About Tony Russell-Rose
Tony Russell-Rose

Tony Russell-Rose is founder of 2Dsearch (https://www.2dsearch.com), a start-up applying artificial intelligence, natural language processing and data visualisation to create the next generation of professional search tools. He is also director of UXLabs, a research and design studio specialising in complex search and information access applications. He has served as vice-chair of the BCS Information Retrieval group and chair of the CIEHF Human-Computer Interaction group. Previously Tony has led R&D teams at Canon, Reuters, Oracle, HP Labs and BT Labs. He is author of "Designing the Search Experience" (Elsevier, 2013) and publishes widely on IR, HCI and NLP.

« Previous Next »

Search

Recent comments

  • How To Connect The Dots Of Enterprise Data To Reach Your Business Potential - Business Quick Magazine on The state of enterprise search in Scandinavia in 2019
  • Standout Virtual Events – How to create an experience that your audience will love | Intranet Focus | Intranet strategy & management on Microsoft Research New Future of Work Conference 3-5 August 2020 – Part 1 The conference technology platform
  • Standout Virtual Events – How to create an experience that your audience will love | Intranet Focus | Intranet strategy & management on ECIR 2020 – Delivering a virtual conference
  • My Homepage on Designing Faceted Search: Getting the basics right (pt 2)
  • How good is your enterprise search support team? Assuming you have one! | Intranet Focus | Intranet strategy & management on The state of enterprise search in Scandinavia in 2019

Categories

  • Winter 2021
  • Autumn 2020
  • Summer 2020
  • Spring 2020
  • Winter 2020
  • Autumn 2019
  • Summer 2019
  • Spring 2019
  • Winter 2019
  • Autumn 2018
  • Summer 2018
  • Spring 2018
  • Winter 2018
  • Autumn 2017
  • Summer 2017
  • Spring 2017
  • Winter 2017
  • Autumn 2016
  • Summer 2016
  • Spring 2016
  • Winter 2016
  • Autumn 2015
  • Promotion
  • Summer 2015
  • Spring 2015
  • Winter 2015
  • Autumn 2014
  • Summer 2014
  • Spring 2014
  • Winter 2014
  • Autumn 2013
  • Summer 2013
  • Org Overview
  • Spring 2013
  • Winter 2013
  • Conference Review
  • Feature Article
  • Editorial
  • Events
  • Book Review
  • Autumn 2012
  • Summer 2012
  • Spring 2012
  • Winter 2012
  • Uncategorized

Tags

awards BCS Boolean City University clustering conference conferences design ECIR editorial enterprise seach enterprise search events Faceted search facets HCIR information architecture information discovery Information Retrieval information seeking interaction design IR IR practice IRSG log analysis MSR multimedia retrieval navigation recruitment search Search Solutions search strategies sensemaking site search ss12 survey taxonomy text analytics tutorial user experience user study wayfinding web search weka workshop

Authors

  • Agnes Molnar (1)
  • Alberto Purpura (1)
  • Aldo Lipani (1)
  • Alejandra Gonzalez-Beltran (1)
  • Allan Hanbury (1)
  • Amit Kumar Jaiswal (1)
  • Andy Macfarlane (43)
  • Benjamin Kille (1)
  • Benno Stein (1)
  • Birger Larsen (1)
  • Carsten Eickhoff (1)
  • Cathal Gurrin (8)
  • Charlie Hull (2)
  • Chris Madge (1)
  • Thomas Mandl (1)
  • Claudia Hauff (1)
  • Colin Wilkie (1)
  • David Elsweiler (1)
  • David Haynes (1)
  • David Maxwell (1)
  • Deirdre Lungley (1)
  • Dennis Aumiller (1)
  • Franco Maria Nardini (1)
  • Frank Hopfgartner (13)
  • Gabriel Tanase (1)
  • Gabriella Kazai (5)
  • Giorgio Maria Di Nunzio (1)
  • Haiming Liu (1)
  • Helen Clegg (1)
  • Helen Lippell (1)
  • Iadh Ounis (1)
  • Ingo Frommholz (7)
  • Joao Magalheis (4)
  • Jochen L. Leidner (2)
  • John Tait (7)
  • Jolanta Pietraszko (1)
  • Jon Chamberlain (4)
  • Jose Alberto Equivel (1)
  • Julie Glanville (1)
  • Kamran Abbasi (1)
  • Katherine Allen (3)
  • Kurt Kragh Sørensen (1)
  • Linda Achilles (1)
  • Luca Soldaini (1)
  • Marc Sloan (2)
  • Marco Palomino (2)
  • Marianne Sweeny (1)
  • Marina Santini (1)
  • Markus Schedl (1)
  • Martin White (54)
  • Mateusz Dubiel (1)
  • Michael Oakes (1)
  • Mike Salampasis (1)
  • Mohammad Aliannejadi (1)
  • Morgan Harvey (1)
  • Nandita Tripathi (1)
  • Natasha Chowdory (1)
  • Norbert Fuhr (1)
  • Olivia Foulds (1)
  • Parth Mehta (2)
  • Paul Cleverley (1)
  • Paul Matthews (2)
  • Philipp Mayr (1)
  • Roland Roller (1)
  • Roman Kern (1)
  • Ronan Cummins (2)
  • Sam Marshall (1)
  • Samuel Dodson (1)
  • Silviu Paun (1)
  • Song Chen (1)
  • Stefan Rueger (1)
  • Stephane Goldstein (1)
  • Stephanie Segura Rodas (1)
  • Steven Zimmerman (4)
  • Thanh Vu (1)
  • Tony Russell-Rose (27)
  • Trung Huynh (1)
  • Tu Bui (1)
  • Tyler Tate (8)
  • Udo Kruschwitz (32)
  • Val Gillet (1)

Copyright © 2021 Informer.

Powered by WordPress and Hybrid.