Visualizing search strategies

According to the IDC whitepaper, The High Cost of Not Finding Information, knowledge workers spend 2.5 hours per day searching for information. Whether they eventually find what they are looking for or just stop and make a sub-optimal decision, there is a high cost to both outcomes. The recruitment industry, for example, relies on Boolean search as the foundation of the candidate sourcing process, and yet finding candidates with appropriate skills and experience remains an ongoing challenge. Similarly, patent agents rely on accurate prior art search as the foundation of their due diligence process, and yet infringement suits are being filed at a rate of more than 10 a day due to the later discovery of prior art which their original search tools missed.

What these professions have in common is a need to develop search strategies that are accurate, repeatable and transparent. The traditional solution to this problem is to use line-by-line query builders which require the user to enter Boolean strings that may then be combined to form a multi-line search strategy:

who ictrp search builder.PNG

 

However, such query builders typically offer limited support for error checking or query optimization, and their output is often compromised by errors and inefficiencies. In this post, we review three early but highly original and influential alternatives, and discuss their contribution to contemporary issues and design challenges.

Alternative approaches

The application of data visualization to search query formulation can offer significant benefits, such as fewer zero-hit queries, improved query comprehension, and better support for exploration of an unfamiliar database. An early example of such an approach is that of Anick et al. (1989), who developed a system that could parse natural language queries and represent them using a “Query Reformulation Workspace”. Although early work, this system introduced a number of key design ideas:

  • The query was represented as a set of ‘tiles’ on a visual canvas, which could be (re)arranged by direct manipulation
  • Query elements could be made ‘active’ or ‘inactive’
  • The layout had a left-to-right reading, with tiles that overlapped vertically being ORed and those which did not being ANDed.

For example, the natural language query ‘Copying backup savesets from tape under ~5.0’ would be represented as follows:

anick.png

and the Boolean semantic interpretation (shown in the lower half) would be:

(“copy” AND “BACKUP saveset” AND “tape” AND (“~5.0” OR “version 5.0”)).

The set of results retrieved was defined as all those documents that contained some combination of terms from any possible left-to-right path through the chart. Crucially, the user was at liberty to re-arrange those tiles to reformulate the expression, and to activate or deactivate alternative elements to optimise the query. In addition, the system offered support for integration with thesauri and it also displayed the number of hits in the lower left corner of each tile. These are remarkably prescient ideas, and themes to which we return in our own work.

In subsequent work, Fishkin and Stone (1995) investigated the application of direct manipulation techniques to the problem of database query formulation, using a system of ‘lenses’ to refine and filter the data. Lenses could be combined by stacking them and applying a suitable operator, e.g. AND/OR, etc. For example, a user could search a database of US census data to find cities that have high salaries (the upper filter) AND low taxes (the lower filter):

fishkin.PNG

Moreover, these lenses could be combined to create compound lenses, and hence support the encapsulation of queries of arbitrary complexity. This is a further theme to which we return in our own work.

A further influential work is that of Jones (1998), who reflected upon the difficulties that users experience in dealing with Boolean logic, noting in particular the disconnect between query specification and result browsing and the inefficiency caused by a lack of feedback regarding the effectiveness of individual terms. He proposed an alternative in which concepts are expressed using a Venn diagram notation combined with integrated query result previews. Queries could be formulated by overlapping objects within the workspace to create intersections and disjunctions, and subsets could be selected to facilitate execution of subcomponents of an overall query:

jones.jpg

Crucially, Jones noted that although the representation offered a degree of universality of expression, the semantic interpretation would necessarily need to be tied to that of the particular collection being searched, and thus independent adapters would be required for each such database. This is also a theme to which we return in our own work.

In summary

In this short piece we have briefly reviewed some of the challenges involved in articulating complex search strategies and Boolean expressions, and studied three early but highly original alternative approaches. Given the decade in which these systems were developed (the first of which pre-dates the web by several years), this is extraordinary work, offering design insights and principles of enduring value.  In our next post, we’ll review some of the more recent approaches, and reflect on how their design ideas and insights may be used to address contemporary search challenges.

About Tony Russell-Rose
Tony Russell-Rose

Tony Russell-Rose is founder of 2dSearch (https://www.2dsearch.com), a start-up applying artificial intelligence, natural language processing and data visualisation to create the next generation of professional search tools. He is also director of UXLabs, a research and design studio specialising in complex search and information access applications. He has served as vice-chair of the BCS Information Retrieval group and chair of the CIEHF Human-Computer Interaction group. Previously Tony has led R&D teams at Canon, Reuters, Oracle, HP Labs and BT Labs. He is author of "Designing the Search Experience" (Elsevier, 2013) and publishes widely on IR, HCI and NLP.

2 responses to “Visualizing search strategies”

  1. Michael Upshall

    Great article, Tony!

    Firstly, your complaints about Boolean are spot on. However, in my experience, the real problem is not so much the inadequacy of Boolean but the unwillingness of information professionals to try using something else.

    At UNSILO we developed a tool that can do this kind of searching in a rather different way, using concepts (which we call UNSILO Classify, even though it isn’t a classification tool, more a collection-building tool). However, when I presented this to healthcare professionals, and then to patent experts, they were unwilling to forego the claimed accuracy and replicability of Boolean. Their fear is that they might miss something (i.e. recall has to be very high), and they look at me with blank incomprehension when I tell them that Boolean is ultimately a string-based search, and so will miss things. The example I give is if you search for “kidney disease”, then no amount of stemming or wildcards will ever find “renal disease”. In other words, you need some kind of synonym or concept searching, and a tool based on unsupervised machine learning, such as the UNSILO tool, will identify synonyms and related terms.

    Nonetheless, the users remain resolutely unconvinced. They just aren’t interested in looking at new types of search strategies. As I see it, the problem is more one of users and their attitudes than the search itself. Because Boolean is “replicable”, goes the argument, it is therefore more reliable. By “replicable”, they mean that a search can be saved and run again at any time, and if it were run against the same corpus, the same results would be obtained. These criteria would be equally valid for the search tools you describe, I would imagine – so it’s not a valid objection!

Leave a Reply