When you record such things as a name, age, occupation,
place-of-birth, etc., do you refer to them as ‘facts’ or something else? Are
they held as simple text values in your database? Have you thought about the
true nature of those data items?
As usual in the digital side of genealogy, we have a
plethora of alternative terms for the same thing, and ambiguous interpretations
of the more common terms. Genealogists are encouraged to refer to these data
items as ‘facts’, although I have already made the point in Evidence
and Where to Stick It that their facticity is dependent upon the source
from which they came. A number of software developers prefer the term ‘PFACT’,
which stands for property, fact, attribute, characteristic, or trait. However,
this is squandering five perfectly good words – each with distinct meanings in
normal usage – and so reducing the possibility of any of them being given
distinct genealogical uses. I will be employing the more generic STEMMA® term
of ‘Properties’
in this post.
So, what is a Property? You might say that it is an item of
evidence[1]
taken from a given source of information. This is a fair description, but as
soon as you acknowledge that a Property is “an extracted and summarised item of information” then
a number of issues have to be considered and solved for their digital
representation. What I’m about to present is my own approach as to-date I’m not
aware of any product that tackles all of these issues.
Foremost amongst the issues – and yet rarely discussed in the
context of Properties – is the difference between what was written and your
interpretation of it. Although this is a fundamental part of supporting evidence and conclusion, or E&C, I
need to clarify that, here, this is purely the analysis and interpretation of
each item rather than building them into any proof argument; that being a
separate phase. For instance, if a place name has been misspelled, or is hard
to read, then you need to record it as it was written (indicating any uncertain
characters) together with your interpretation of what it should have been. In
effect, each Property has two distinct values: the recorded one, including any
transcription anomalies, and the interpreted one. As with any form of
conclusion-making, you’ll also need a way to add any explanatory notes, and possibly
add some level of confidence in your result. I will come back to this duality
of Properties in a moment.
All Property values are implicitly associated with a
particular time and place. For instance, someone’s name may have changed during
their life, and someone’s age will certainly have changed over time. STEMMA
copes with this because the Properties are associated with specific Event-to-Person
connections[2] in
the data, and the Event entity implicitly provides a relevant date for the interpretation
and applicability of the value.
Another issue to consider is the nature of the Property. Is
it the name of something (e.g. a person or place), a description (e.g. cause of
death), a date, or a measure of something (e.g. age, height, weight)? This is
termed its data-type. The importance
of it lies with the interpreted value (rather than the written value) which
should be computer-readable in order to make the most use of it. Whilst I
acknowledge that there may be detractors to this statement, let me try and make
a number of observations to justify it.
For the simple expedient of consistency checking, software
needs to know whether a value should be textual, numeric (integer or real), or
a date. More than this, though, a value such as a date can be used in a
timeline, and an age can be used to derive dates and to separate events, so
their values should be accessible to software. In the case of a person or place
reference, these can be linked (using some type of pointer mechanism) to the
corresponding Person or Place entity in the data. That linkage, which is as
much a conclusion as the interpreted value of any date, is required in order to
allow you to follow the reference to the entity’s details. However, the duality
of the Property values doesn’t require you to change the name from how it was
recorded at that time. Finally, in certain cases, a Property may have a
representation that doesn’t correspond to a value in the normal sense, either
because the written form was undecipherable or it had a special meaning. For
instance, the use of “Full Age” for a young married couple, or “Unknown”,
“N/A”, or “LNU” for an unknown name, are special non-values. There’s a golden
rule that you do not record anything in a name field that isn’t actually a name[3].
Being able to distinguish the recorded form from an interpreted form avoids
this issue.
If a Property is a measure of something, such as a height or
weight, then the interpreted value needs to identify the units. In all but one
case, it is debatable whether or not software will want to make use of these
units themselves as opposed to simply distinguishing values held in different
units. That exception involves the age of a person. Ages are normally recorded
in years, but ages in months, weeks, or even days, are quite common for infant
deaths. These may also be fractional rather than integer values, e.g. “3 ½
weeks”.
Some Properties are necessarily multi-valued. The most
obvious case is a Role (i.e. the part a Person plays in an Event). For
instance, a witness at a wedding may also have been a relative of either the bride
or the groom. A computer representation must accommodate multiple values, and
support the duality for each instance.
It would be folly to try and enumerate all possible
Properties in advance of them being used. Different researchers, different
sources, and different cultures, may all result in unanticipated Properties having
to be recorded. What is required, therefore, is a scheme that allows custom
Properties to be freely defined without some onerous, centralised registration
process, and yet still allows those custom Properties to be loaded by any
compliant product. This is certainly possible but it is such a widespread
requirement – applying to many types, subtypes, and other sets of named values
– that I plan to write about it separately.
If you’re still with me then you’re probably about to say
‘this is way too complicated Tony’. Before you finish preparing your response,
though, consider these points:
- We cannot assume that a recipient of your data has access to the same online images, and the T&C’s that you’ve checked probably prohibit you from sharing your images. Also, if you’re one of the minority who still visit archives, etc., then the originals may be locked away, and not copiable or online at all. In other words, our transcriptions can be invaluable. Hence, if we take shortcuts with those transcriptions – even for mere Properties – and assume that we know what the author meant without recording things verbatim (or even literatim), or fail to mention crossings-out and other annotation, then we’re diluting that effort and “short changing” some later recipient.
- Do we want our genealogy products to simply record what we type in? If so then we might as well just use a word-processor. Providing more detail, and making it machine-readable, means that our products can work with the data to provide such things as analysis and consistency checking.
I’ll close by providing some links to a couple of worked examples
in STEMMA for any code-junkies: Transcription
Anomalies and Census
Roles. Between them, these deal with many of the cases discussed here,
including transcription anomalies, spelling errors, clarifications, and
mis-recorded information.
[1] If anyone wants to
comment that the evidence in any given source is more than a set of discrete
values then I entirely agree. There is usually much context and information
that cannot be distilled down to simple values. What we’re discussing here is
just the digested pieces of information that many genealogists store in their
databases, but also acknowledging that this alone is not fully representative.
[2] For historical
references to places, the corresponding STEMMA Properties would be associated with Event-to-Place connections.
[3] This issue is covered
in excellent detail by Tamura Jones, “FNU LNU MNU UNK”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml
: accessed 22 Nov 2013); Also his previous works: “The Lnu Family Mystery”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/TheLnuFamilyMystery.xhtml
: accessed 22 Nov 2013); “Unk is a Real Name”, Modern software Experience, 10 Aug 2013 (http://www.tamurajones.net/UnkIsARealName.xhtml
: accessed 22 Nov 2013).
No comments:
Post a Comment