What features do we expect from our Search Engines? How much
do these features contribute to the effectiveness and accessibility of large
data collections such as the genealogical ones?
I’ve heard it said that a data collection is only as good as
the search engine providing access to it. This is not totally true, as we’ll
see soon, but certainly a simplistic or inadequate search engine can render a
data collection useless. It should be a concern, then, that there is so much
variation between the search tools that we’re offered. Although I will mention
a few providers as illustrations of certain features, I don’t intend this post
to be a blow-by-blow comparison of the providers. Instead, I want to explore
the pro’s-and-con’s of various search features – some common, some uncommon,
and some positively rare.
It doesn’t seem that long ago that a ‘search’ was something
we did in a word processor. You’d enter a single word and it would search for
the next occurrence, either forwards or backwards. There would usually be an
option for a case-blind match (where ‘A’ and ‘a’ are equivalent, etc.), and if
it was really advanced then an additional option for accent-blind (where ‘Á’
and ‘A’ are equivalent, etc.). Word processors mostly now find all the hits in
one sweep and then let you navigate between the hits but the principle is
similar,
With the advent of the World Wide Web, we found ourselves
part of an environment with many, many documents (i.e. HTML pages) and the
concept of a Web
Search Engine became familiar to us all. However, the same principles
applied to any computerised document retrieval system and so the concept of a Search Engine
actually predated the Web.
There are two broad classes of document searches that you
may not have thought about:
- Documents represented solely by indexed meta-data. That meta-data may be keywords or phrases used to categorise the content and would typically be available where the content is non-textual (e.g. scans, video, binary data) or is non-digitised text. For instance, in a library, a book may be represented by its title, author, subject, etc. We’re also familiar with this when we search for a census page or a service record since only selected details from the content will have been extracted and indexed. Searching here will only locate the document itself, not the specific references inside them.
- Documents with a predominantly textual content that has been digitised. Examples include digitised books and newspapers, as well as Web pages. Such text can be searched for the required references but we need to frame the context in order to ensure that we get useful references.
The goal of any search engine is to retrieve a set of
documents matching some given criteria. The situation with digitised text (the
2nd case above) is more complicated because you need to identify the
matches (i.e. “hits”) that have an appropriate context in free-form text. For
instance, simply searching the Web for “Nottingham” might generate more hits
than you could read through in a lifetime, and so you need to whittle it down.
A typical approach is to specify multiple words or phrases. Originally, the
convention was to find documents with any of those multiple words (i.e. word1
OR word2 …) but this is often too relaxed to be useful. Google was one of the first
search engines to buck the trend and default to finding documents with all of
the words (i.e. word1 AND word2 …). More on this later though.
When searching textual documents, there are effectively two
separate goals: to find all the documents matching your criteria, and then to
identify the matching parts of each document so that the end-user can select
the appropriate ones. In other words, it is accepted that locating the
documents themselves is unlikely to be exact because it’s so hard to specify a
precise context. Let’s call these document-hits and word-hits for convenience.
Once the document-hits have been obtained, you might be
presented with a list of them with a synopsis of the word-hits for each one.
This alone can be problematic since that synopsis typically displays only the
first, or first few, word-hits. I don’t think I’ve actually used a search
engine where the distinction between subject-words and context-words is made
but there is a difference. For instance, I once wanted to search for references
to the surname Jesson in the context of the city of Nottingham[1].
Putting the two words into the search tool gave them equal significance, but I
was only interested in “Jesson” word-hits, not “Nottingham”; the second word
was to help define a context. The result was that the synopsis showed only
Nottingham word-hits since there were many more of those than the Jesson
word-hits. I was then forced to look through every document-hit individually but
unfortunately this particular interface did not highlight the word-hits within
each document – a very basic requirement when searching newspaper print!
When I say ‘word’, I implicitly mean phrases too. The use of
quotation marks is commonly available to delimit an exact phrase, or to treat a
word exactly with no alternative spellings or similar words. For instance, if I
searched Google for STEMMA then it would also show hits for stimma, but if I
searched for “STEMMA” (including the quotes) then this would not happen.
So is a choice of any-of or all-of enough when we’re
entering multiple words? The answer has to be ‘No’ since you may want to
exclude some words, or define some more complex combination of words. Search
engines usually provide a specific syntax, called a query language, for
precisely expressing a search. Unfortunately, though, end-users are mostly considered
incapable of handling such a syntax and so we often get a dumbed-down interface
using a form-fill. For instance, the London
Gazette offers a fairly typical form providing separate search boxes for
all-of, any-of, and an exact-phrase. But what happens when you want to find two
separate exact-phrases, or one pair of words OR another pair of words. These
are easy to express in a query language, e.g.
“exact
phrase 1” AND “exact phrase 2”
(word1 AND
word2) OR (word3 AND word4)
The problem is not so much that end-users can’t handle a
query language. It’s just that there is no universal syntax, and not all
features are supported by all search engines.
The AND and OR, used here, are referred to Boolean or
logical operators. Sometimes AND is represented by a special character such as
‘+’, and OR by ‘|’, but there is no standard. Along with AND/OR, the next most
common Boolean operator is NOT (or ‘-‘) to exclude a word. XOR is very rarely
available, possibly because only a few people even know what the semantics of
an “Exclusive OR”
operation are. Google used to have a ‘+’ operator but it was a monadic operator
(in contrast to AND) and was the opposite of their ‘-‘ operator used to exclude
a given word. This was withdrawn in 2011 because it conflicted with the use of
the ‘+’ as a prefix to identify a person in Google+. The operator was
considered redundant because Google uses an all-of approach – or it used to.
Around about 2009, it started using a more probabilistic approach where the
operator was almost ignored, except to order the hits with the more likely ones
appearing first. This meant that the ‘+’ operator was actually essential if you
wanted to force the inclusion of particular words, and the deprecation caused
an outcry amongst users. Interestingly, the dyadic AND operator isn’t listed on
any Google help that I’ve seen, and yet it certainly affects a query. It’s hard
to prove exactly how it is being interpreted but from a cursory glance it appears
to be doing what I would expect.
Less common, but equally important, are a class of operators
called proximity or adjacency operators. These are only available through query
languages, which is probably why they are less well known. They allow you to search
for words which are close together and so potentially more powerful than a
Boolean for establishing a search context. The Times Digital Archive
used to use a NEAR operator (e.g. w1 NEAR w2) although this is directionless
and with no specific range. Google’s ‘*’, when used between words, amounts to
the same thing. In contrast, the Gale
newspaper archive allows both a direction and a range to be specified, e.g. ‘word1
W6 word2’ to find word2 within 6 words after word1, and ‘word1 N6 word2’
is the directionless equivalent. For newspapers, the importance of these cannot
be over-stressed since lists of notices (e.g. police ones) or advertisements
are rarely digitised as separate articles. That often means that your separate
word-hits are scattered across most of a broadsheet page, and occasionally across
multiple pages. This, in turn, wastes your time as your have to visit many more
document-hits manually. The British Newspaper
Archive is one of the newspaper archives that do not provide such
operators, and they did not consider them necessary when I approached them.
Another common feature is wildcard characters that allow
some variability within a word. For instance, ‘*’ to represent zero-or-more
characters, ‘?’ to represent exactly one unspecified character, and ‘!’ to
represent an optional unspecified character. A search for pigment* might match
pigment, pigments, pigmentation, etc., and a search for colo!r might match both
color (American) and colour (British). These are very useful where there are
alternative spellings as in US/UK differences or historical spellings.
Some search tools try to help by introducing variants of
what you had specified, such as the Google “similar words” example above.
Google also had a tilde (‘~’) operator that explicitly included synonyms of a
given word (see below). In our genealogical tools, a search may add name
variants such as “Antony” and “Tony” when looking for “Anthony”, or common
abbreviations of a name. Some tools try to infer some semantics by recognising
phrases such as ‘where is’ or ‘what is’. There is a modern trend to return more
than you had asked for, just in case you were overly specific. This can be
frustrating to someone who knows exactly what they’re looking for, and is one
of the causes of the Lilliputian arguments over the search tools of the big
providers Ancestry and findmypast. Ancestry’s search is often
criticised for returning much more than what you want (although mostly
prioritised) and findmypast for being too rigid and precise.
Another way that software hopes to improve our ability to
search text is by the addition of semantic meta-data, such as RDF,
hidden in the text, and this is a cornerstone of the Semantic Web. For
instance, not just having the name of a person or place written as amorphous
text, and not simply marking it as the name of a person or place, but actually
adding extra information such as contact details, biographical details, postal-address
information, etc. Although this may have serious issues when applied to
historical data (see Semantic
Tagging of Historical Data), it will introduce another problem: Our
searches need to be a lot more specific in order to take advantage of that
meta-data during a search. It would be unrealistic for a generic search tool to
achieve this with a form-fill, so does that mean it will require bigger query
languages? Do you see the dichotomy here? On one hand our searches are made
more probabilistic and less precise, and on the other hand we will need to
frame a more specific context to take advantage of semantic meta-data.
The fact that there is no standardised query language is not
unexpected. You would have more luck herding cats than getting agreement on that.
Part of the problem, though, is that not all search engines are implemented
with the same level of sophistication. Hence, they cannot all perform the same
types of complex search. As end-users, we all have to take some responsibility
for not using the so-called advanced features, or even asking for them when
they’re not there. How many people reading this simply chuck a bunch of words
into a search box and hope for the best? One of the reasons that Google dropped
its tilde (‘~’) operator was that so few people used it[2]
and it was expensive to maintain.
How many people are frustrated by the inability to be
precise in a search? How many people feel that the probabilistic approach to
searching, where any attempt at precision is only partially honoured, just
dilutes their effectiveness? Do many people use query languages when they’re
available? How many people have had situations where a census search needed to
specify a lot more than just a person’s details, or an address’s details?
Findmypast has both an address-search and a person-search for their census
collections, but they cannot be combined, e.g. finding anyone with the surname
“Proctor” on a given street. How much more basic can a requirement be?
[2] Google Kills Tilde Search Operator, Jennifer Slegg, June 25, 2013.
http://searchenginewatch.com/article/2277383/Google-Kills-Tilde-Search-Operator
No comments:
Post a Comment