According to the IDC whitepaper, The High Cost of Not Finding Information, knowledge workers spend 2.5 hours per day searching for information. Whether they eventually find what they are looking for or just stop and make a sub-optimal decision, there is a high cost to both outcomes. The recruitment industry, for example, relies on Boolean search as the foundation of the candidate sourcing process, and yet finding candidates with appropriate skills and experience remains an ongoing challenge. Similarly, patent agents rely on accurate prior art search as the foundation of their due diligence process, and yet infringement suits are being filed at a rate of more than 10 a day due to the later discovery of prior art which their original search tools missed.
What these professions have in common is a need to develop search strategies that are accurate, repeatable and transparent. The traditional solution to this problem is to use line-by-line query builders which require the user to enter Boolean strings that may then be combined to form a multi-line search strategy:
However, such query builders typically offer limited support for error checking or query optimization, and their output is often compromised by errors and inefficiencies. In this post, we review three early but highly original and influential alternatives, and discuss their contribution to contemporary issues and design challenges.
Alternative approaches
The application of data visualization to search query formulation can offer significant benefits, such as fewer zero-hit queries, improved query comprehension, and better support for exploration of an unfamiliar database. An early example of such an approach is that of Anick et al. (1989), who developed a system that could parse natural language queries and represent them using a “Query Reformulation Workspace”. Although early work, this system introduced a number of key design ideas:
- The query was represented as a set of ‘tiles’ on a visual canvas, which could be (re)arranged by direct manipulation
- Query elements could be made ‘active’ or ‘inactive’
- The layout had a left-to-right reading, with tiles that overlapped vertically being ORed and those which did not being ANDed.
For example, the natural language query ‘Copying backup savesets from tape under ~5.0’ would be represented as follows:

and the Boolean semantic interpretation (shown in the lower half) would be:
(“copy” AND “BACKUP saveset” AND “tape” AND (“~5.0” OR “version 5.0”)).
The set of results retrieved was defined as all those documents that contained some combination of terms from any possible left-to-right path through the chart. Crucially, the user was at liberty to re-arrange those tiles to reformulate the expression, and to activate or deactivate alternative elements to optimise the query. In addition, the system offered support for integration with thesauri and it also displayed the number of hits in the lower left corner of each tile. These are remarkably prescient ideas, and themes to which we return in our own work.
In subsequent work, Fishkin and Stone (1995) investigated the application of direct manipulation techniques to the problem of database query formulation, using a system of ‘lenses’ to refine and filter the data. Lenses could be combined by stacking them and applying a suitable operator, e.g. AND/OR, etc. For example, a user could search a database of US census data to find cities that have high salaries (the upper filter) AND low taxes (the lower filter):
Moreover, these lenses could be combined to create compound lenses, and hence support the encapsulation of queries of arbitrary complexity. This is a further theme to which we return in our own work.
A further influential work is that of Jones (1998), who reflected upon the difficulties that users experience in dealing with Boolean logic, noting in particular the disconnect between query specification and result browsing and the inefficiency caused by a lack of feedback regarding the effectiveness of individual terms. He proposed an alternative in which concepts are expressed using a Venn diagram notation combined with integrated query result previews. Queries could be formulated by overlapping objects within the workspace to create intersections and disjunctions, and subsets could be selected to facilitate execution of subcomponents of an overall query:

Crucially, Jones noted that although the representation offered a degree of universality of expression, the semantic interpretation would necessarily need to be tied to that of the particular collection being searched, and thus independent adapters would be required for each such database. This is also a theme to which we return in our own work.
In summary
In this short piece we have briefly reviewed some of the challenges involved in articulating complex search strategies and Boolean expressions, and studied three early but highly original alternative approaches. Given the decade in which these systems were developed (the first of which pre-dates the web by several years), this is extraordinary work, offering design insights and principles of enduring value. In our next post, we’ll review some of the more recent approaches, and reflect on how their design ideas and insights may be used to address contemporary search challenges.
Great article, Tony!
Firstly, your complaints about Boolean are spot on. However, in my experience, the real problem is not so much the inadequacy of Boolean but the unwillingness of information professionals to try using something else.
At UNSILO we developed a tool that can do this kind of searching in a rather different way, using concepts (which we call UNSILO Classify, even though it isn’t a classification tool, more a collection-building tool). However, when I presented this to healthcare professionals, and then to patent experts, they were unwilling to forego the claimed accuracy and replicability of Boolean. Their fear is that they might miss something (i.e. recall has to be very high), and they look at me with blank incomprehension when I tell them that Boolean is ultimately a string-based search, and so will miss things. The example I give is if you search for “kidney disease”, then no amount of stemming or wildcards will ever find “renal disease”. In other words, you need some kind of synonym or concept searching, and a tool based on unsupervised machine learning, such as the UNSILO tool, will identify synonyms and related terms.
Nonetheless, the users remain resolutely unconvinced. They just aren’t interested in looking at new types of search strategies. As I see it, the problem is more one of users and their attitudes than the search itself. Because Boolean is “replicable”, goes the argument, it is therefore more reliable. By “replicable”, they mean that a search can be saved and run again at any time, and if it were run against the same corpus, the same results would be obtained. These criteria would be equally valid for the search tools you describe, I would imagine – so it’s not a valid objection!
Thanks Michael. It is indeed true that some professions can be conservative in their outlook, almost to the point where they will defend the approaches they’ve learned in the face of evidence to the contrary. Partly this is human nature, but partly it’s also that some individuals feel disintermediated when presented with automated approaches, and (understandably) feel the need to defend their expertise. And a further proportion have become (to a degree) ‘institutionalized’ in their thinking.
But I personally wouldn’t ‘blame’ people for this attitude. I see the enlightenment process as a gradual evolution, rather than series of events (or sales pitches :). I believe it could take years to persuade users of line-by-line string editing tools to embrace alternative, more scalable, robust representations and processes. But I do think it will eventually change, for all the same reasons that people don’t write code using tools like Notepad.