This post is a follow-up on a familiar subject that James Tanner
recently revisited at Returning
to the issue of standardized place names and dates: that of software trying
to standardise place names and, to a lesser extent, dates.
Figure 1 - Tanner Street, London SE1.[1]
This is a subject that James has mentioned before, and I
generally agree with him. However, I wanted to follow-up on the subject, rather
than indulge in blig-blog[2],
because I fear that readers may conflate some different concepts under
this same heading. My post is therefore about normalisation rather than
standardisation (using my normal UK spelling).
His basic point is that some software forces you to select
one specific, standardised name for a place at the expense of the more
relevant, and probably more accurate, one found in the record itself. He makes
the very clear observation that:
“Genealogically important records
are generally created as a result of the occurrence of an event. Such events
occur in a particular and very specific geographic location.”
I agree entirely with this! Source information was recorded
as a result of some event, and so has a specific time-and-place context that
must be captured accurately. I belong to the school of thought that place names
should be recorded as they appear in the sources, and not changed to some close
alternative or to some modern equivalent. However, I am also a software developer
who believes in the use of place-hierarchies (see my previous post at A
Place For Everything) so how do I reconcile the two?
The underlying issue is that places may have multiple names.
These may be concurrent or the result of some name change in the history of
that place, but that does not mean that older or less-used names are invalid.
The extent and boundaries of a place may have changed during its existence, and
it may even have been divided, or merged with somewhere else, and so a modern
name may be totally inappropriate. Finally, the parent place — that
larger-scale bounding place — may have similarly changed over time.
Part of the solution requires that a place be recognised as
a real-world notion that physically exists — just like a person — and not
simply as a name. This then allows the same place entity (as represented in
your data) to be referenced by any of its names, and obviates the need for a
standardised reference. The history of the place, its various names, its
location and extent, any boundary or name changes, associated images,
documents, and maps, would all be held as part of that single place-entity by
the software, and those alternative references all pointing to that same
collection of information. We take this approach for granted when dealing with
people, and their alternative formal or informal names, but it equally applies
to any named entity.
STEMMA adopted this approach right from its inception. It
was refined V2.2 to allow connections between places which were not hierarchical,
such as when a place was divided, or neighbouring places were merged (see Related
Entities). More recently, it was recognised that this enhancement also made
it possible to support alternative types of place hierarchy, such as
geographical, administrative, census & civil registration, or
ecclesiastical, and to cross-link them when relevant. An example might be when
a registration district overlaps a particular ecclesiastical parish. They are
different places, and with independent hierarchies, but their locations overlap.
If you want to relate a civil birth registration to someone’s baptism then this
type of correlation is very useful.
Another part of the solution is a bilateral approach to the recording
of information from a source, whether it’s a name, a date, or any piece of
data. This means recording both the original form, verbatim, and a separate
normalised version of it, if possible (see Is That
a Fact?). As well as supporting places with an uncertain identification,
this also means that you retain the exact name used in the source, including
any spelling errors and transcription issues, but can still connect it to an
appropriate place entity in your data. In effect, it is the place entity that is
standardised rather the place name.
I recently dealt with a case of this in my previous
blog-post at My
Ancestor Changed Their Surname. A woman was recorded with the exact same
birth place in three successive census returns (1851, 1861, and 1871):
“Barkworth, Lincolnshire”. To my knowledge, there isn’t, and never had been, a
place with this name, but there was an East/West Barkwith nearby. This approach
means that I can retain the exact spelling used by the census enumerators, and
make a tentative association with the standardised place entity representing
Barkwith.
A similar argument applies to dates, too. It has been
suggested that software can parse, and so decode, any written date once it has
been transcribed. This may be possible for Gregorian dates, assuming that
they’re written in English (or some other known language), and that they don’t
employ an ambiguous, all-numeric representation, but that’s not going to help
in many real-life situations. The recorded date may have uncertain characters
in it, or it may be an informal or relative expression of the date. For
instance, cases such as “last Sunday” or “two days before my grandmother’s
birthday” are also dates but they require a human to decode them by using
context from elsewhere, such as the date of recording/publication or the
identification of the author.
So this is good for our recorded data, but what about the
user-interface (UI) that software products or Web sites present to us? I know
that Ancestry, for instance, has a drop-down list of standardised place names
in its search forms, and that alternative names or spelling are not available. One
reason for their exclusion may be the sheer size of the resulting list, but it
would be quite possible, in principle, to include all accepted variants. In
Ancestry’s favour, you can still search on an unknown name — such as the one
presented above — and it will perform a textual search rather than a known-place
search. Contrast this with findmypast which has recently adopted a similar
approach to place names as Ancestry. It also presents a drop-down list of
standardised place names, say, in its census search forms, but if you enter an
unknown one then it is simply ignored. In fact, if you tried to re-edit your
search criteria then you would see (at the time of writing) that your
non-standard name will have vanished; discarded by the form. This is
interesting because when you report a census transcription error to findmypast,
then it now includes duplicate fields for “Birth town”, “Birth county”, and “Birth
place”, with one set including the suffix “as transcribed”. It looks like they
would like to support the same type of search modes as Ancestry, but their UI
currently makes this impossible.
One obvious question that I haven’t answered here is why we
need a normalised version at all. Well, if you want your software to do
something useful, rather than simply help you to put names and dates on a tree
(which is rather mundane and trivially easy from a software perspective), then
it needs to understand certain contextual data. It needs a normalised,
computer-readable representation to work with. If you want to do a
proximity-based search then it needs to understand which places are near to
other places, or are within other places. If you want to present information
using a map then it needs to know the location and boundary extent of the
referenced places — things that can be looked up for known place entities.
A similar situation exists with dates. If you want to
perform a search between two date limits then it needs to understand what lies
within that temporal range and what lies outside of it. If you want to present
a timeline then it needs to know what sequence your events occurred in.
James concludes with a suggestion blaming the issue on
programmers and developers, and it’s here that I’m afraid I have to disagree:
“My opinion is that the main
reason for this issue of standardization involves the desires of the
programmers to regularize their data for search purposes”
Outside of very small teams then programmers have little say
in the functionality of a product. Gone are the days when computer-illiterate
managers would delegate full responsibility to a programming team. Large
organisations, in particular, have product managers whose responsibility it is
to meet market needs, and — depending on how modern they are — have some
synergy with a software architect.
In summary, place entities have to be standardised because
they are representing something in the real world, but not so with place names.
Any number of place names may reference the same place entity. If some software
component doesn’t accommodate that then it implies a deep lack of appreciation
somewhere in that chain of responsibility.
[1] Picture displayed by
permission of Fashion and Textile Museum, London SE1 3XF (www.ftmlondon.org).
[2] My own whimsical term
for when two blogs go back-to-back responding to each other. Analogous to
ping-pong.
No comments:
Post a Comment