Tuesday, 23 December 2014

Returning to Normalised Names and Dates



This post is a follow-up on a familiar subject that James Tanner recently revisited at Returning to the issue of standardized place names and dates; that of software trying to standardise place names and, to a lesser extent, dates.

Figure 1 - Tanner Street, London SE1.[1]

This is a subject that James has mentioned before, and I generally agree with him. However, I wanted to follow-up on the subject, rather than indulge in blig-blog[2], because I fear that readers may conflate some different concepts under this same heading. My post is therefore about normalisation rather than standardisation (using my normal UK spelling).

His basic point is that some software forces you to select one specific, standardised name for a place at the expense of the more relevant, and probably more accurate, one found in the record itself. He makes the very clear observation that:

“Genealogically important records are generally created as a result of the occurrence of an event. Such events occur in a particular and very specific geographic location.”

I agree entirely with this! Source information was recorded as a result of some event, and so has a specific time-and-place context that must be captured accurately. I belong to the school of thought that place names should be recorded as they appear in the sources, and not changed to some close alternative or to some modern equivalent. However, I am also a software developer who believes in the use of place-hierarchies (see my previous post at A Place For Everything) so how do I reconcile the two?

The underlying issue is that places may have multiple names. These may be concurrent or the result of some name change in the history of that place, but that does not mean that older or less-used names are invalid. The extent and boundaries of a place may have changed during its existence, and it may even have been divided, or merged with somewhere else, and so a modern name may be totally inappropriate. Finally, the parent place — that larger-scale bounding place — may have similarly changed over time.

Part of the solution requires that a place be recognised as a real-world notion that physically exists — just like a person — and not simply as a name. This then allows the same place entity (as represented in your data) to be referenced by any of its names, and obviates the need for a standardised reference. The history of the place, its various names, its location and extent, any boundary or name changes, associated images, documents, and maps, would all be held as part of that single place-entity by the software, and those alternative references all pointing to that same collection of information. We take this approach for granted when dealing with people, and their alternative formal or informal names, but it equally applies to any named entity.

STEMMA adopted this approach right from its inception. It was refined V2.2 to allow connections between places which were not hierarchical, such as when a place was divided, or neighbouring places were merged (see Related Entities). More recently, it was recognised that this enhancement also made it possible to support alternative types of place hierarchy, such as geographical, administrative, census & civil registration, or ecclesiastical, and to cross-link them when relevant. An example might be when a registration district overlaps a particular ecclesiastical parish. They are different places, and with independent hierarchies, but their locations overlap. If you want to relate a civil birth registration to someone’s baptism then this type of correlation is very useful.

Another part of the solution is a bilateral approach to the recording of information from a source, whether it’s a name, a date, or any piece of data. This means recording both the original form, verbatim, and a separate normalised version of it, if possible (see Is That a Fact?). As well as supporting places with an uncertain identification, this also means that you retain the exact name used in the source, including any spelling errors and transcription issues, but can still connect it to an appropriate place entity in your data. In effect, it is the place entity that is standardised rather the place name.

I recently dealt with a case of this in my previous blog-post at My Ancestor Changed Their Surname. A woman was recorded with the exact same birth place in three successive census returns (1851, 1861, and 1871): “Barkworth, Lincolnshire”. To my knowledge, there isn’t, and never had been, a place with this name, but there was an East/West Barkwith nearby. This approach means that I can retain the exact spelling used by the census enumerators, and make a tentative association with the standardised place entity representing Barkwith.

A similar argument applies to dates, too. It has been suggested that software can parse, and so decode, any written date once it has been transcribed. This may be possible for Gregorian dates, assuming that they’re written in English (or some other known language), and that they don’t employ an ambiguous, all-numeric representation, but that’s not going to help in many real-life situations. The recorded date may have uncertain characters in it, or it may be an informal or relative expression of the date. For instance, cases such as “last Sunday” or “two days before my grandmother’s birthday” are also dates but they require a human to decode them by using context from elsewhere, such as the date of recording/publication or the identification of the author.

So this is good for our recorded data, but what about the user-interface (UI) that software products or Web sites present to us? I know that Ancestry, for instance, has a drop-down list of standardised place names in its search forms, and that alternative names or spelling are not available. One reason for their exclusion may be the sheer size of the resulting list, but it would be quite possible, in principle, to include all accepted variants. In Ancestry’s favour, you can still search on an unknown name — such as the one presented above — and it will perform a textual search rather than a known-place search. Contrast this with findmypast which has recently adopted a similar approach to place names as Ancestry. It also presents a drop-down list of standardised place names, say, in its census search forms, but if you enter an unknown one then it is simply ignored. In fact, if you tried to re-edit your search criteria then you would see (at the time of writing) that your non-standard name will have vanished; discarded by the form. This is interesting because when you report a census transcription error to findmypast, then it now includes duplicate fields for “Birth town”, “Birth county”, and “Birth place”, with one set including the suffix “as transcribed”. It looks like they would like to support the same type of search modes as Ancestry, but their UI currently makes this impossible.

One obvious question that I haven’t answered here is why we need a normalised version at all. Well, if you want your software to do something useful, rather than simply help you to put names and dates on a tree (which is rather mundane and trivially easy from a software perspective), then it needs to understand certain contextual data. It needs a normalised, computer-readable representation to work with. If you want to do a proximity-based search then it needs to understand which places are near to other places, or are within other places. If you want to present information using a map then it needs to know the location and boundary extent of the referenced places — things that can be looked up for known place entities.

A similar situation exists with dates. If you want to perform a search between two date limits then it needs to understand what lies within that temporal range and what lies outside of it. If you want to present a timeline then it needs to know what sequence your events occurred in.

James concludes with a suggestion blaming the issue on programmers and developers, and it’s here that I’m afraid I have to disagree:

“My opinion is that the main reason for this issue of standardization involves the desires of the programmers to regularize their data for search purposes”

Outside of very small teams then programmers have little say in the functionality of a product. Gone are the days when computer-illiterate managers would delegate full responsibility to a programming team. Large organisations, in particular, have product managers whose responsibility it is to meet market needs, and — depending on how on modern they are — have some synergy with a software architect.

In summary, place entities have to be standardised because they are representing something in the real world, but not so with place names. Any number of place names may reference the same place entity. If some software component doesn’t accommodate that then it implies a deep lack of appreciation somewhere in that chain of responsibility.



[1] Picture displayed by permission of Fashion and Textile Museum, London  SE1 3XF (www.ftmlondon.org).
[2] My own whimsical term for when two blogs go back-to-back responding to each other. Analogous to ping-pong.