Monday 14 October 2013

Searching with Precision



What features do we expect from our Search Engines? How much do these features contribute to the effectiveness and accessibility of large data collections such as the genealogical ones?

I’ve heard it said that a data collection is only as good as the search engine providing access to it. This is not totally true, as we’ll see soon, but certainly a simplistic or inadequate search engine can render a data collection useless. It should be a concern, then, that there is so much variation between the search tools that we’re offered. Although I will mention a few providers as illustrations of certain features, I don’t intend this post to be a blow-by-blow comparison of the providers. Instead, I want to explore the pro’s-and-con’s of various search features – some common, some uncommon, and some positively rare.

It doesn’t seem that long ago that a ‘search’ was something we did in a word processor. You’d enter a single word and it would search for the next occurrence, either forwards or backwards. There would usually be an option for a case-blind match (where ‘A’ and ‘a’ are equivalent, etc.), and if it was really advanced then an additional option for accent-blind (where ‘Á’ and ‘A’ are equivalent, etc.). Word processors mostly now find all the hits in one sweep and then let you navigate between the hits but the principle is similar,

With the advent of the World Wide Web, we found ourselves part of an environment with many, many documents (i.e. HTML pages) and the concept of a Web Search Engine became familiar to us all. However, the same principles applied to any computerised document retrieval system and so the concept of a Search Engine actually predated the Web.

There are two broad classes of document searches that you may not have thought about:
  • Documents represented solely by indexed meta-data. That meta-data may be keywords or phrases used to categorise the content and would typically be available where the content is non-textual (e.g. scans, video, binary data) or is non-digitised text. For instance, in a library, a book may be represented by its title, author, subject, etc. We’re also familiar with this when we search for a census page or a service record since only selected details from the content will have been extracted and indexed. Searching here will only locate the document itself, not the specific references inside them.
  • Documents with a predominantly textual content that has been digitised. Examples include digitised books and newspapers, as well as Web pages. Such text can be searched for the required references but we need to frame the context in order to ensure that we get useful references.

The goal of any search engine is to retrieve a set of documents matching some given criteria. The situation with digitised text (the 2nd case above) is more complicated because you need to identify the matches (i.e. “hits”) that have an appropriate context in free-form text. For instance, simply searching the Web for “Nottingham” might generate more hits than you could read through in a lifetime, and so you need to whittle it down. A typical approach is to specify multiple words or phrases. Originally, the convention was to find documents with any of those multiple words (i.e. word1 OR word2 …) but this is often too relaxed to be useful. Google was one of the first search engines to buck the trend and default to finding documents with all of the words (i.e. word1 AND word2 …). More on this later though.

When searching textual documents, there are effectively two separate goals: to find all the documents matching your criteria, and then to identify the matching parts of each document so that the end-user can select the appropriate ones. In other words, it is accepted that locating the documents themselves is unlikely to be exact because it’s so hard to specify a precise context. Let’s call these document-hits and word-hits for convenience.

Once the document-hits have been obtained, you might be presented with a list of them with a synopsis of the word-hits for each one. This alone can be problematic since that synopsis typically displays only the first, or first few, word-hits. I don’t think I’ve actually used a search engine where the distinction between subject-words and context-words is made but there is a difference. For instance, I once wanted to search for references to the surname Jesson in the context of the city of Nottingham[1]. Putting the two words into the search tool gave them equal significance, but I was only interested in “Jesson” word-hits, not “Nottingham”; the second word was to help define a context. The result was that the synopsis showed only Nottingham word-hits since there were many more of those than the Jesson word-hits. I was then forced to look through every document-hit individually but unfortunately this particular interface did not highlight the word-hits within each document – a very basic requirement when searching newspaper print!

When I say ‘word’, I implicitly mean phrases too. The use of quotation marks is commonly available to delimit an exact phrase, or to treat a word exactly with no alternative spellings or similar words. For instance, if I searched Google for STEMMA then it would also show hits for stimma, but if I searched for “STEMMA” (including the quotes) then this would not happen.

So is a choice of any-of or all-of enough when we’re entering multiple words? The answer has to be ‘No’ since you may want to exclude some words, or define some more complex combination of words. Search engines usually provide a specific syntax, called a query language, for precisely expressing a search. Unfortunately, though, end-users are mostly considered incapable of handling such a syntax and so we often get a dumbed-down interface using a form-fill. For instance, the London Gazette offers a fairly typical form providing separate search boxes for all-of, any-of, and an exact-phrase. But what happens when you want to find two separate exact-phrases, or one pair of words OR another pair of words. These are easy to express in a query language, e.g.

            “exact phrase 1” AND “exact phrase 2”
            (word1 AND word2) OR (word3 AND word4)

The problem is not so much that end-users can’t handle a query language. It’s just that there is no universal syntax, and not all features are supported by all search engines.

The AND and OR, used here, are referred to Boolean or logical operators. Sometimes AND is represented by a special character such as ‘+’, and OR by ‘|’, but there is no standard. Along with AND/OR, the next most common Boolean operator is NOT (or ‘-‘) to exclude a word. XOR is very rarely available, possibly because only a few people even know what the semantics of an “Exclusive OR” operation are. Google used to have a ‘+’ operator but it was a monadic operator (in contrast to AND) and was the opposite of their ‘-‘ operator used to exclude a given word. This was withdrawn in 2011 because it conflicted with the use of the ‘+’ as a prefix to identify a person in Google+. The operator was considered redundant because Google uses an all-of approach – or it used to. Around about 2009, it started using a more probabilistic approach where the operator was almost ignored, except to order the hits with the more likely ones appearing first. This meant that the ‘+’ operator was actually essential if you wanted to force the inclusion of particular words, and the deprecation caused an outcry amongst users. Interestingly, the dyadic AND operator isn’t listed on any Google help that I’ve seen, and yet it certainly affects a query. It’s hard to prove exactly how it is being interpreted but from a cursory glance it appears to be doing what I would expect.

Less common, but equally important, are a class of operators called proximity or adjacency operators. These are only available through query languages, which is probably why they are less well known. They allow you to search for words which are close together and so potentially more powerful than a Boolean for establishing a search context. The Times Digital Archive used to use a NEAR operator (e.g. w1 NEAR w2) although this is directionless and with no specific range. Google’s ‘*’, when used between words, amounts to the same thing. In contrast, the Gale newspaper archive allows both a direction and a range to be specified, e.g. ‘word1 W6 word2’ to find word2 within 6 words after word1, and ‘word1 N6 word2’ is the directionless equivalent. For newspapers, the importance of these cannot be over-stressed since lists of notices (e.g. police ones) or advertisements are rarely digitised as separate articles. That often means that your separate word-hits are scattered across most of a broadsheet page, and occasionally across multiple pages. This, in turn, wastes your time as your have to visit many more document-hits manually. The British Newspaper Archive is one of the newspaper archives that do not provide such operators, and they did not consider them necessary when I approached them.

Another common feature is wildcard characters that allow some variability within a word. For instance, ‘*’ to represent zero-or-more characters, ‘?’ to represent exactly one unspecified character, and ‘!’ to represent an optional unspecified character. A search for pigment* might match pigment, pigments, pigmentation, etc., and a search for colo!r might match both color (American) and colour (British). These are very useful where there are alternative spellings as in US/UK differences or historical spellings.

Some search tools try to help by introducing variants of what you had specified, such as the Google “similar words” example above. Google also had a tilde (‘~’) operator that explicitly included synonyms of a given word (see below). In our genealogical tools, a search may add name variants such as “Antony” and “Tony” when looking for “Anthony”, or common abbreviations of a name. Some tools try to infer some semantics by recognising phrases such as ‘where is’ or ‘what is’. There is a modern trend to return more than you had asked for, just in case you were overly specific. This can be frustrating to someone who knows exactly what they’re looking for, and is one of the causes of the Lilliputian arguments over the search tools of the big providers Ancestry and findmypast. Ancestry’s search is often criticised for returning much more than what you want (although mostly prioritised) and findmypast for being too rigid and precise.

Another way that software hopes to improve our ability to search text is by the addition of semantic meta-data, such as RDF, hidden in the text, and this is a cornerstone of the Semantic Web. For instance, not just having the name of a person or place written as amorphous text, and not simply marking it as the name of a person or place, but actually adding extra information such as contact details, biographical details, postal-address information, etc. Although this may have serious issues when applied to historical data (see Semantic Tagging of Historical Data), it will introduce another problem: Our searches need to be a lot more specific in order to take advantage of that meta-data during a search. It would be unrealistic for a generic search tool to achieve this with a form-fill, so does that mean it will require bigger query languages? Do you see the dichotomy here? On one hand our searches are made more probabilistic and less precise, and on the other hand we will need to frame a more specific context to take advantage of semantic meta-data.

The fact that there is no standardised query language is not unexpected. You would have more luck herding cats than getting agreement on that. Part of the problem, though, is that not all search engines are implemented with the same level of sophistication. Hence, they cannot all perform the same types of complex search. As end-users, we all have to take some responsibility for not using the so-called advanced features, or even asking for them when they’re not there. How many people reading this simply chuck a bunch of words into a search box and hope for the best? One of the reasons that Google dropped its tilde (‘~’) operator was that so few people used it[2] and it was expensive to maintain.

How many people are frustrated by the inability to be precise in a search? How many people feel that the probabilistic approach to searching, where any attempt at precision is only partially honoured, just dilutes their effectiveness? Do many people use query languages when they’re available? How many people have had situations where a census search needed to specify a lot more than just a person’s details, or an address’s details? Findmypast has both an address-search and a person-search for their census collections, but they cannot be combined, e.g. finding anyone with the surname “Proctor” on a given street. How much more basic can a requirement be?


[1] This was using the findmypast interface to the British Newspaper Archive database.
[2] Google Kills Tilde Search Operator, Jennifer Slegg, June 25, 2013. http://searchenginewatch.com/article/2277383/Google-Kills-Tilde-Search-Operator

No comments:

Post a Comment