Yet another subject where there is little or no agreement.
Let me try and explain some of the many issues with personal names, and with other
types of name, and then present my own approach to handling them.
This is probably one of the most likely areas for trapping
the unwary with insular attitudes or limited knowledge of other cultures. We so
desperately want to record our names as
we know them rather than as we see
them that we may fail to consider the bigger picture. Most people reading
this will have names consisting of one or more given names (the parts chosen to
distinguish members of a family) and a single surname (the inherited part).
English-speaking people sometimes select one of their middle
names as their preferred given name, rather than the norm of selecting the
first one. However, this is far from unusual in, say, Germany where one of the
given names (the Rufname, or “call
name”) — which may be the second or third one — is identified as the primary
one. Hence, the concept of a first name
and middle names is inappropriate for
them.
If we’re lucky then we may have Honorifics expressing esteem
or respect. In English-language names, these may be academic titles (e.g. Dr.
or Prof.), honorific prefixes (e.g. the honourable, or his holiness), honorific
titles (e.g. Sir, Lord, Dame, Lady), or post-nominal letters (e.g. VC, OBE,
PhD). These are mostly either prefixes or postfixes[1].
Another type of postfix is a generational title (e.g. .Jr, Sr, I, II, III,
etc), although the Irish equivalent is actually infix as opposed to either
prefix or postfix (e.g. Seán Óg Ó Súilleabháin).
Spanish-speaking people often have two or more surnames, but
even English-speaking people may have double-barrelled or hyphenated surnames.
In German, a family may have a second surname, preceded by the word vulgo (meaning “so-called” or “also
known as”), in order to show their association with a farm or other property.
Such a vulgo name may change, therefore, when that family moves.
While there is a lot of variation so far, it’s still
possible to describe distinct cultural patterns. Every so often, someone
suggests having the flexibility to store the precisely categorised parts of
their (usually Western, English-speaking) names, and of “foreign names”,
together in their software. If they have some knowledge of software development
then they may be suggesting that Object Orientation (OO) can help to treat
those different patterns in a consistent way. However, let’s look at some more
variations.
Traditional Chinese names can use something called a Generational name to identify members of
a particular generation, including siblings, cousins, etc. There is no Western
equivalent of this custom.
Many cultures employ ‘name particles', analogous to grammatical
particles, to separate the various parts of their names. For instance: “von”,
“van”, “der”, “de [la]”, “d′”, “the”, “[son] of”, “mc”, “mac",
"Ó", "Ní", "Nic", "Mhic",
"Bean", "Ui", "y", etc. These may occur almost
anywhere, and their behaviour under case conversion and sorting is culturally
dependent.
Then there’s the important case that all genealogists should
be aware of: the patronym and matronym[2].
These are surnames based on the given name of a male of female ancestor,
respectively. For instance: son of William (now Williamson, or Wilson), van Dijk, Nic Dhòmhnaill, Nikolayevich.
The OO advocates would suggest ‘no problem’, but what is the
practicality and the ultimate goal of categorising every single token in a
personal name, and then rigidly representing that classification in digital
storage?
Personal names, as described here, haven’t always existed.
At one time, people would have been given an epithet based on their occupation
(e.g. William the thatcher), their origin (or topoanthroponym, e.g. Robin of
Loxley), or some other attribute (e.g. Little John). Even now, we may encounter
epithetic titles such as Earl of Huntingdon, or Henry VIII. This is where many
software schemes start to break down, and these cases are usually given scant
consideration on the basis that few researchers can trace their lineage back
that far, or can reliably identify titled ancestors.
Indeed, an ancestor’s identification may have been just a
single-word mononym, as opposed to a multi-word polynym, so how can you
categorise that? I have pointed out previously that Native Americans typically
have unstructured names, and that they may have different names assigned at
different phases of their lives. My point, here, being that the particularly
diverse cultural origins within the US are not simply the product of latter-day
immigration, and that they will eventually affect many researchers.
What I’ve briefly described here are structural differences in personal names. These may vary from those
name structures that we take for granted in the West, through other structures
that we’re less familiar with, to having no discernable structure at all. Little
wonder that the design of STEMMA® makes a case for handling names as simple,
uncategorised sequences of tokens — multiple names being just alternative
sequences — but more on that later.
In a previous post, One
Name to Rule Them All, I explained about the many types and forms of name
that a person may be identified by in practice, and how that is a different set
to their preferred identifications. It also explained the relationship to
evidential forms (with their possible misspellings, transcription errors, and
informality) and to the labels that we, as researchers, want to identify them
by in our reports or charts.
In contrast to the purely structural differences, there are
a number of considerations that might be described as processing differences, including the following:
- How they’re sorted.
- Behaviour under case-conversion.
- Behaviour under capitalisation.
- How they’re compared.
- Handling of initials.
- Handling inheritance.
In the West, we commonly replace middle names with their
initials, and sometimes our forenames too, but this is not a universal option. Initials
are not applicable to logogram (or ideogram) based languages. Also, whist we
accept their usage in some modern Latin-based languages, it would be a gross
generalisation to assume that all Latin-based languages, or indeed all alphabet-based
languages, modern and ancient, use this custom in personal names. Even people
of other cultures who have adopted Romanised versions of their native names may
not use initials.
Another issue with initials involves the case where we know
an initial but not what it stands for. As with any abbreviations, these must be
represented with a trailing period in order to prevent ambiguity, even in the
single-letter case. There are common cases where a name may contain a
single-letter non-initial, such as the Irish Ó (or O-fada, meaning from)
and the Spanish y (meaning and). An English-speaking example would be the renowned geologist J Harlen Bretz whose first name was “J” and not a “J.” initial.
If we were searching for Frederick and some text contained
frederick or FREDERICK then we would still expect a match; this is called
case-blind. If the text contained Frédérick then we may still expect a match;
this is called accent-blind. These are quite common ways of performing a
textual match in software, but lesser-known is that Unicode makes specific recommendations about which composed and decomposed
forms should be equivalent: http://www.unicode.org/reports/tr15/.
A composed form involves one Unicode character and a decomposed form involves
two or more Unicode characters. For instance, the Angstrom sign (+U212B, Å) should
match the combination Latin-A (+U0041, A) plus Combining-ring-above (+U030A,
°), as well as a Latin-A-with-ring-above (+U00C5, Å). This is normally achieved
by normalising each piece of text to its lowest common denominator (e.g.
lower-cased, no diacritical marks, and decomposed forms) and compare those
using a standard match.
Now if we’re sorting
a mixture of text from different locales then we have another problem that
tends to get ignored by software people: culturally preferred sort orders.
Although there is an international sort order, it is basically just a
convenience for software people as it relies on the numeric character codes.
However, different cultures want to sort their characters in slightly different
ways. This issue was encountered by the SQL standard when Unicode text columns
were introduced since it made its column-specific “collation sequences” all but
useless. In effect, sort orders should be selected by the application, dependent
upon the current end-user, and not implied by the data itself.
Sorting and collation are troublesome in many ways. For instance, some cultures sort on their given names rather than their surname, and the position of those parts is similarly dependent upon culture. When a name includes multiple surnames, as in the Spanish-speaking world, then the sorting may attach priority to either of them depending on the person’s location. Also, any name particles may be considered significant (i.e. involved in the sort) or ignored during the sorting. Finally, the ideographic characters in Japanese names can be pronounced in different ways, and if the sorting is to reflect the way that the name is spoken then additional information is usually required to assist the sorting. In summary, there are two pieces of information required for correct sorting: the sorted representation (e.g. ‘surname, given-names’ in English) and a possible overriding “sort as” instruction when one-or-more tokens do not sort according to simple text rules.
Sorting and collation are troublesome in many ways. For instance, some cultures sort on their given names rather than their surname, and the position of those parts is similarly dependent upon culture. When a name includes multiple surnames, as in the Spanish-speaking world, then the sorting may attach priority to either of them depending on the person’s location. Also, any name particles may be considered significant (i.e. involved in the sort) or ignored during the sorting. Finally, the ideographic characters in Japanese names can be pronounced in different ways, and if the sorting is to reflect the way that the name is spoken then additional information is usually required to assist the sorting. In summary, there are two pieces of information required for correct sorting: the sorted representation (e.g. ‘surname, given-names’ in English) and a possible overriding “sort as” instruction when one-or-more tokens do not sort according to simple text rules.
Case conversion is not something I recommend — despite it
being commonplace —since the specific choice of character case may be important
in a given language (e.g. Irish), or there may be no duality for a given character
(e.g. the German eszett, ß).
Even capitalisation — normally considered to be the uppercasing of the initial
letter, as with English proper nouns — is problematic. Sometimes it may be the
first two characters (e.g. O’Connor), or the second character (e.g. the Irish
hUiginn), or something more exotic such as deShannon, deSouza, or diCaprio (all
of which may incur an unwanted initial capitalisation). See Letter
Case and Capitalisation,
respectively.
Lastly, there is name inheritance. In cultures where there
is an inherited part of a personal name — which isn’t true of all of them —
then it may be via the father’s line (patrilineal), or the mother’s line
(matrilineal), or both in the Spanish-speaking world. The inherited part may be
a surname or a given name (as in patronyms) but in Russia it is common to have
both a surname and a patronym. Even in cultures where we think we recognise a
simple case of a name being inherited from the father, the way in which that
name is represented may depend on the sex of the child. In other words, we can
never assume that it is simply tacked on. In marriage, it may be normal in some
cultures for the woman to not take the man’s family name, but this has also become
a life-style choice in many Western cases. The man may take the woman’s name,
or they may both take a hybrid name. The essential fact here is that there are
no rules. There are just conventions, and these will depend on the culture or
social group involved.
As well as wanting to adopt a portable approach to personal
names, and so avoid trying to taxonomise the non-taxonomical, I also wanted
STEMMA to adopt the same approach for both place names and group names. This
isn’t as wild as it first sounds. If you abandon any formalised structural
differences, then you find that all of the processing differences except
‘inheritance’ (see below) are also common. I take it as obvious that all of
these entity types also share the common requirement of supporting alternative
names — possibly in different languages — and linking the name changes to
specific dates or events.
In order to describe the STEMMA approach — which is still
evolving[3] —
I want to avoid simply showing code and use a schematic representation instead.
Personal names are represented by a series of time-dependent descriptions for
each distinct name, as follows:
The optional From
and To fields may be dates or Events
at which the name came into use or was (officially) no longer used, and the Name Type field may be something like “Maiden”
or “Adopted”. More important are the Canonical
Names section, which contains the preferred renderings of this name, and
the Match Sequences section, which
may contain additional matching instructions.
Note that this same structure is also used for the names of
places and of groups. The only difference is the vocabulary used for the Name Type field.
The mode of usage for the first three canonical names is
fairly obvious. The Listing mode is used for ordered listings of names, and may
be supplemented by a separate “sort as” instruction for the problem cases
mentioned above. The match-sequences may specify very simple parsing instructions
for accepting name variants beyond the canonical ones. This will use the
following notation here:
Name[i]
- simple name token, e.g. Tony.
{name, ...}[i] - mandatory selection from alternative tokens.
[name, ...][i] - optional selection from alternative tokens.
The optional ‘i’ superscript indicates that initials are
appropriate for the respective tokens.
Let’s look at a trivial example:
Now STEMMA’s name handling has been accused of being
cumbersome and verbose but let me explain its layered approach. At run-time,
when the data is loaded, the name information is used to create a simple parse
tree using the normalised (see above) tokens. Developer Note: It turns out that this can be stored economically by
using token indices, into an “atom table”, but a local table (for the current
person) is just as effective as a global table (for the whole tree). Despite
the commonality of surnames, etc., the shorter local indices take up considerably
less space and may be packed more densely without data alignment issues. The
match-sequences section feeds the generation of the parse tree, but note that
it is a simple representative form and so not significant in terms of
repetitions or parse efficiency. The canonical names are also part of this
feed, in conjunction with the match-sequences, and so if we know the relevant
personal name style (see below) then the match-sequences are only required to
express cases beyond the canonical ones. The above example can be simplified,
therefore, by omitting all the explicit match-sequences.
Furthermore, each of the main subject entities (Person,
Place, and Group) has a shorter mechanism for specifying a Semi-Formal
canonical name in the very simplest of cases: PersonalName, PlaceName, and GroupName,
respectively.
In other words, the STEMMA approach has been designed
bottom-up; starting with what is required for the in-memory parse tree, and
then working up to a simplified and practical representation within the data
files. The intermediate representations are not always required but their
availability gives the flexibility and power of expression when it is needed.
I have mentioned a name
style in this article, but I am still looking for an acceptable vocabulary.
The style of name (and hence the rules for sorting, initials, inheritance, etc.)
obviously depends on the relevant culture or social group, but these must
include historical ones as well as modern-day ones. The computer locale is
inadequate for this, as is a simple language identifier (ISO 639) or country
identifier (ISO 3166). My own name style, for instance, is very common but
terms such as English-speaking, Anglo-Saxon, or Anglo-American do not
adequately describe the group using this style, or the actual conventions associated
with the style.
GEDCOM also handled names as unstructured lists of tokens,
albeit with the family name enclosed between slashes. It supported multiple
names per person, and even a NAME_TYPE record to categorise them, e.g. as
maiden, married, or immigrant. V5.5 introduced an optional PERSONAL_NAME_PIECES
description to allow the individual name tokens to be typed, e.g. as given
name, surname, etc. However, V5.5.1 — the last official specification —
contained a warning that this wasn’t portable. The STEMMA approach is
considerably more powerful than either of the GEDCOM schemes, but has a certain
level of compatibility with its original scheme. I hope that my research has
indicated how that general direction is the more portable, both between different
name styles and between the names of different entity types.
[1] Although suffix and postfix are usually treated as synonyms — both as nouns and
as verbs — I prefer to use the less-common postfix
as it was directly modelled on prefix
and so more accurately expresses the opposite condition.
[2] The Oxford English Dictionary (and many others) declares patronym
to be a noun, as expected, but patronymic
to be both a noun and an adjective. Interestingly, it presents a similar dual
usage for matronymic but doesn’t list
matronym, despite it being in common
use and listed in other dictionaries. I do not know the etymology of this but
using the –ic derivation as a noun really grates on the ear. I make no apologies, therefore, for reserving
the –ic forms as adjectives, and thus being consistent with words such as:
acronymic, toponymic, antonymic, eponymic, metonymic, and homonymic.