No, not that sort of date! Calendar dates are a crucial part
of historical research — including genealogy — but how well do we understand
them? Is there more to their representation than a mere distinction between accurate and approximate?
A calendar is simply a mechanism by which a given culture
records the passing of the days. I will try and restrict this article to the
Gregorian calendar that we use everyday, although the basic principles can be
applied to any calendar.
The Gregorian calendar has a selection of units that may be
used in conjunction to express a given date, as illustrated below:
The pattern shown underneath each form is how it should be
represented numerically according to the ISO 8601 standard, and the
yearly-quarters pattern is shown in brackets since the ISO standard doesn’t
currently address that form (see Is
the ISO Date Standard Bad?).
Most genealogical dates try to describe a given day.
Providing the actual time of an event is quite rare, but references to larger
units are not so rare. When mentioning “last week”, or “the sixties”, or “19th
Century”, then the implication is that the whole of that period is being
referenced; not merely one particular day somewhere within it. Each of those ISO
patterns may be truncated to express a date representing some of those cases,
such as yyyy-mm or just yyyy. The proposed yyyy-Qq representation already
describes a period greater than one day (i.e. three months), and it would have
a very good use for certain record types. The GRO
indexes of civil registrations for vital events in England & Wales are
compiled on a quarterly basis, and that means that no finer-grained
representation would be appropriate when citing the date of their entries. STEMMA
refers to this concept as the granularity
of the date reference, and it roughly corresponds to the GEDCOM concept of a
date-period.
This is a subtle semantic difference from an approximate
date, but it is the latter that we’re more familiar with. We commonly have a day-based
date that we believe falls between some upper and lower limits — one of which
could be unknown in the general case (i.e. including before or after some
threshold). STEMMA refers to this concept as imprecision, and it roughly corresponds to the GEDCOM concept of a
date-range.
In fact, imprecision also applies to dates with a
granularity greater than one day, and the first table at Date
Margins shows how a ±margin is interpreted in conjunction with different
granularities by STEMMA. The following diagram uses lumen and penlumen[1]
to visually illustrate how equality
is interpreted as ‘having some overlap’, whilst the degree of the overlap may
be used to rank date matches.
Another concept that is used with less-than-known dates is uncertainty. The difference between
uncertainty and imprecision concerns how sure you are of a date value or of a
date range. For instance, saying “I think he was born in 1878” would be a case
of uncertainty whereas saying “He was born during 1876–1880” would be a case of
imprecision. STEMMA doesn’t address this concept in the date notation, but it
can attach an attribute of Surety=certainty%
to the datum. By contrast, the US Library of Congress Extended Date Time Format
(EDTF) contains specific
syntax for representing each of these cases. It uses a suffix of ‘~’ (tilde) to
indicate imprecision and ‘?’ to indicate uncertainty; both of which may be
combined. For instance:
- 2000-06? Possibly June 2000, but not definitely.
- 1974~ Approximately the year 1974 .
- 1974?~ Approximately 1974 but even that is uncertain.
These are examples of their Level-1 specification, but in Level-2
these suffixes may be applied to the individual parts of a date.
- 2004?-06-11 Uncertain year (month & day known).
- 2004-06~-11 Year and month are approx. (day known).
- 2004-(06)?-11 Uncertain month (year and day known).
- 2004-06-(11)~ Day is approximate (year & month known).
- 2004-(06)?~ Month is approx. and uncertain (year known).
- 2004-(06-11)? Month and day uncertain (year known).
The EDTF has comprehensive mechanisms for handling partial dates,
but I believe their mechanism for handling uncertainty in the digits of a date
(e.g. 19uu-12-uu) is actually misplaced as part of its specification. This
should not be date-specific and is encroaching on the bigger requirement of a
standard representation for uncertain characters during transcription.
One area of confusion is that although there are distinct
reasons for the different notational schemes, the schemes themselves are
sometimes indistinct. For instance, there’s the humanly-readable notation which
generally uses a c./ca. prefix (for circa,
meaning “about”) for approximate dates, and an en-dash for date ranges (e.g.
1852–1855). It may also use word prefixes such as before or after. Looking
at the GEDCOM support for dates shows that some of this humanly-readable
notation has crept into an essentially computer-readable notation. Its
date-range term uses prefix operators of AFT and BEF, and an infix operator of
BET. Its date-approximated term uses prefix operators of ABT, CAL, and EST. The
primary example of a computer-readable notation would be the ISO 8601 standard.
Although it may be acceptable to employ the ISO notation in a document, some
style guides indicate that the truncated numeric forms may be ambiguous to the
reader. For instance, 1910-11 with a hyphen would be an ISO representation of
November 1910, but 1910–11 (using an en-dash) would be a date range of 1910 to
1911. This ambiguity would not arise if the schemes were used in their
appropriate contexts.
Even in the context of computer-readable notation, there are
distinct goals that separate the different schemes. The EDTF notation is an
expressive representation, designed to capture the full details of incomplete,
approximate, or uncertain dates, and may therefore be more applicable to
transcription. The W3CDTF
format — which is a restricted subset of the ISO standard, employing the format
yyyy-mm-dd — is a comparative representation. By that, I mean that any two
dates in that representation are comparable, and all such dates would form a totally ordered set in
mathematical terms. The ability to compare dates efficiently is essential for
timelines and for date searches, and the general ability (for any data-type)
underpins many types of software index, such as the B-tree. It’s worth noting that
the different numeric ISO forms, highlighted above, are individually comparative
but not together. For instance, the yyyy-mm-dd form cannot be directly sorted
with the yyyy-Www form, and this was one of the driving forces for STEMMA
implementing its own computer-readable notation; one that ensured all
granularities were inclusively comparative (see Date
Value).
The ability to compare dates is also a requirement when both
imprecision and granularity are present together. Rather than encoding
imprecision in the date string, STEMMA, uses its date notation to separately describe
the start and end of the associated date range. This avoids encumbering the
core notation while making it easy to implement comparisons in terms of the range
end-points. The second table at Date
Comparisons shows how STEMMA interprets comparison operators such as
less-than-of-equal in this situation.
Indicating that a date falls before, after, or between other
dates is called a temporal constraint.
These obviously have their uses when implementing the concept of imprecision,
but they are less appropriate between dates that both have some real-world
significance. If you roughly knew, for instance, the dates of someone’s birth
and baptism, then it would be inappropriate to express a temporal constraint to
indicate that the latter is greater than the former. It’s inappropriate because
the underlying semantics would have been lost. What is needed is an event constraint which indicates that
their baptism follows their birth, and this topic was briefly discussed back in
Eventful
Genealogy – Part II. More recently, the topic of representing the birth
order of a family’s children was discussed on the FHISO TSC-public mailing list at Birth
Order. In the situation where their birth dates were unknown, it was
suggested that a Family record could
implicitly order them. This maybe true but a proper event constraint is a much
more general concept, and one purposely designed to express those semantics. It
could even be applied between twins when their birth dates are identical but
their birth order was known to be otherwise.
If we want to take an extreme view of imprecision then we
have to discuss the concept of probability distributions. Simply saying that
something occurred during 1881–1885 doesn’t indicate whether 1881 is more or
less likely than 1883 (i.e. mid-range); it simply describes a flat distribution
of the likelihood. I believe that in most cases like this one, we could indicate
one date that would be the statistical mode (i.e. the
most common or likely value) of the distribution, but specifying and utilising
distribution curves would be impractical in my opinion.
An interesting take on this may be found in recent research undertaken
in Verona, Italy, to look at supporting fuzzy dates on their SITAVR information
system.[2]
Their research considers basic aspects of fuzzy dates, calendars, fuzzy
temporal constraint networks (FTCN), and probability distributions. Those
distributions are of a trapezoidal nature, and so require only four defining
values rather than a full curve. Although the report may be very academic, it’s
worth reading since the justification is the real-world archaeological data in
their SITAVR system; much of which is subjective, estimated, or imprecise.
In conclusion, there are distinct reasons for the different date
notations, and we should keep them in focus so that we don’t confuse them:
- Computer-readable. These notations may record details of transcription issues (e.g. uncertain characters) or the uncertainty of a claimed date. I would contend that these are both general requirements that should apply to any datum — including numbers and text — and not just dates. For a decipherable date, they will also represent details of granularity and imprecision, both of which must be represented in a way that facilitates efficient comparison, sorting, and searching.
- Humanly-readable. The traditional notations we use in written works rarely go into great detail regarding the possibilities or the levels of surety. In order to produce a humanly-readable version of a computer notation then one alternative might be to generate the nearest traditional form and use a footnote, or an interactive pop-up or right-click equivalent, to supplement it with the greater detail.
The jury may be out as regards the level of detail required
in our notations, and whether imprecision should consider variable likelihoods
(i.e. some type of probability distribution). However, in constructing such a
notation, we must remain sure of whether it’s designed for humans or for
computer software, and whether the issues being addressed are specific to dates
or are a general consideration for any type of datum.
[1] Coined
from Latin paene ("almost”) and lumen
(“light”). Analogous to umbra and penumbra for shadow.
[2] Alberto
Belussi and Sara Migliorini, "Modeling Time in Archaeological Data: the
Verona Case Study", report to Dipartimento di Informatica Università degli
Studi di Verona, Apr 2014, Verona University (http://www.univr.it/documenti/AllegatiOA/allegatooa_40675.pdf
: accessed 29 Jan 2015).
No comments:
Post a Comment