Monday 3 February 2014

One Name to Rule Them All

Yes, that old chestnut! How would we handle people having multiple names — or even no name at all — if we had better software?

Back in A Place For Everything, I explained the difference between a place and its name. This had to be said because far too many people don’t make the distinction, and so don’t recognise that a name is just one of several possible properties for identifying a real thing (irrespective of whether it still exists now). The industry’s preoccupation with family trees (i.e. lineage, and hence genealogy in its literal sense), at the expense of history, is probably at the root of this.

You might expect, therefore, that there would be less excuse in the context of a person. Surely, we all appreciate that the person and their name, or names, are two different things. Well, apparently not! I still see questions asking how to find someone’s “real name”, as though everyone has just one unique name that they can be referenced by and indexed by. They then run into difficulties when they know someone exists but they don’t have a reliable or complete name, or the person was never assigned a name, or their name was shared by a close relative. When you factor-in the many reasons for someone changing their name, or having multiple concurrent names, then it can be a cause of confusion.

Now I admit that the existence of an actual person corresponding to some name found in a source might be harder to verify than the case of an actual place, and this is acknowledged to some extent through the use of the persona concept when recording data. At some point, though, it will be associated with a “conclusion person” entity in your data, and that entity will have many properties (e.g. date of birth), of which their personal names are simply one instance.

From a software perspective, a personal name cannot be the key that defines a Person entity since it is neither unique nor fixed. So what functionality are we looking for from a person’s name(s)?

  • To record their preferred epithets.
  • To use as one of several keys in identifying a person from a source.
  • To use as a title or label in reports, charts, etc.

At first sight, you may be thinking that these are all the same — and in some products they are — but there are fundamental differences. The preferred name is not the same as the accepted variations of it, and the annotation used for display may not even be a name at all.

Let me start by using my own name as an illustration. Although I am known in most circles as Tony Proctor, I was assigned the given names Anthony Charles at birth. Now I have never changed my name but even this simple case means I have several alternatives by which I might be referenced:

In other words, I have a full name, and an accepted diminutive form, but several variations that could still refer to me. If I had more than one middle name, or I had changed my name, or I had variations in my native language, or separate stage/professional names, then you could imagine this diagram becoming a lot more complex.

Now you might be about to say that initialisms are obvious and can be deduced from the given names, if they’re known of course. If true then you’re thinking of English-speaking, Western conventions. Initials are not applicable to logogram (or ideogram) based languages. Also, whist we accept their use in modern Latin-based languages, it would be a gross generalisation to assume that all alphabet-based languages, modern or ancient, use this custom in personal names. People of other cultures who have adopted Romanised versions of their native names may also not use initials.

Whether someone changed their name through marriage, deed poll, at a point of immigration, or when entering a different phase of their life, there will be a date associated with that change, and possibly a significant life-event to which it should be connected. Sometimes those names are mutually exclusive (changing from one to another) and sometimes they run together, but those dates primarily describe the preferred names rather than the accepted variations. This means that the accepted variations may still be used in sources long after one of those life-events. STEMMA® handles names by dividing them into groups, with each group having any relevant date ranges and name type. Each group consists of a series of accepted name variants, each represented by a sequence of tokens, and a separate set of canonical names that represent the preferred versions. This same scheme is also used for places as well as for people. When looking up a person (or place) by name, each of the accepted variations is compared, in sequence, against the required name.

What about evidential variations? This is probably the most common cause of confusion, although it’s no different, say, to handling a range of birth years found in different census sources. The years may differ because details were provided by someone else, or ages were rounded up/down for census purposes, or the birth was on a different side of the census day, etc, but it doesn’t mean that the person had multiple birth dates. With names, just as with other personal data, the evidential forms have to be recorded, but separately from the conclusion forms. The diagram below illustrates this with an example event supported by two sources, both of which have misspellings of my name (surname in first case, and given name in second case). In STEMMA, the evidential forms are associated with the Event-to-Source link, as explained in Evidence and Where to Stick It, whereas the conclusion forms are part of the Person entity.

So what about identifying a person on a computer display, or in a genealogical report? In the context of variant spellings, Elizabeth Shown Mills advocates picking a common spelling and using it consistently[1]. With a woman’s maiden name then it’s very common to place it in parentheses, such as Sarah (Smith) Jones. Where there are other cases of alternative names then there may be different conventions, such as separating them with a slash (or solidus, ‘/’), or specifying an “aka” (also-known-as) in parentheses. When a name is ambiguous, either because it has been used in several generations, or it was used for by a deceased sibling, then you might add the year of birth in parentheses. If a child didn’t live long enough to be given a name, or you simply don’t know it, then you may still want to identify it in a report. All these cases that are more than simple variations of spelling have the same issue:  the annotation is no longer a personal name. In our software, we should never store some display annotation and call it a personal name. This issue has been covered in excellent detail by Tamura Jones[2]. What is needed is a separate title/label field specifically for display purposes, and STEMMA provides such a field using the <Title> element in both the Person and Place entities.

GEDCOM made a fair stab at handling multiple names. Version 5.5.1 supported multiple names for a given person, and even had a range of name types that could be used to distinguish them. GEDCOM names are generally unstructured sequences of tokens — which is good for generality — but this version also had an optional PERSONAL_NAME_PIECES description which allowed the name tokens to be categorised, albeit with a warning that most systems will not use this alternative form. It wasn’t until the draft GEDCOM XML 6.0 specification that individual names were given a NAME_PRINCIPAL_FORM which was roughly equivalent to STEMMA’s canonical names, but there was still no title/label facility.

Several products do accommodate multiple names for a single person, although their separation of preferred versus accepted variants, names of different types, names over different time spans, titles/labels for display purposes, and evidential variants, are not as organised as in STEMMA. By attempting to take short-cuts, our software may deprive genealogists of the flexibility they need, and effectively corrupt our recorded histories. The following is an excellent quote from a genealogist on a Usenet Newsgroup[3]:

There is no excuse for software that pretends 80% of the people in the
world are wrong about their own names.

I wish I’d said that!

[1] Elizabeth Shown Mills, "Re: [TGF] Surname spelling variants", Transitional-Genealogists-Forum-L, message dated 24 Jun 2011 ( : accessed 3 Feb 2014).
[2] Tamura Jones, “FNU LNU MNU UNK”, Modern software Experience, 11 Aug 2013 ( : accessed 22 Nov 2013); Also his previous works: “The Lnu Family Mystery”, Modern software Experience, 11 Aug 2013 ( : accessed 22 Nov 2013); “Unk is a Real Name”, Modern software Experience, 10 Aug 2013 ( : accessed 22 Nov 2013).
[3] Wes Groleau, soc.genealogy.britain Usenet Newsgroup, 15 Dec 2013.

No comments:

Post a Comment