Thursday, 29 January 2015

Warm Fuzzy Dates

No, not that sort of date! Calendar dates are a crucial part of historical research — including genealogy — but how well do we understand them? Is there more to their representation than a mere distinction between accurate and approximate?

A calendar is simply a mechanism by which a given culture records the passing of the days. I will try and restrict this article to the Gregorian calendar that we use everyday, although the basic principles can be applied to any calendar.

The Gregorian calendar has a selection of units that may be used in conjunction to express a given date, as illustrated below:

Structure of Gregorian date units, and the associated ISO numeric patterns

The pattern shown underneath each form is how it should be represented numerically according to the ISO 8601 standard, and the yearly-quarters pattern is shown in brackets since the ISO standard doesn’t currently address that form (see Is the ISO Date Standard Bad?).

Most genealogical dates try to describe a given day. Providing the actual time of an event is quite rare, but references to larger units are not so rare. When mentioning “last week”, or “the sixties”, or “19th Century”, then the implication is that the whole of that period is being referenced; not merely one particular day somewhere within it. Each of those ISO patterns may be truncated to express a date representing some of those cases, such as yyyy-mm or just yyyy. The proposed yyyy-Qq representation already describes a period greater than one day (i.e. three months), and it would have a very good use for certain record types. The GRO indexes of civil registrations for vital events in England & Wales are compiled on a quarterly basis, and that means that no finer-grained representation would be appropriate when citing the date of their entries. STEMMA refers to this concept as the granularity of the date reference, and it roughly corresponds to the GEDCOM concept of a date-period.

This is a subtle semantic difference from an approximate date, but it is the latter that we’re more familiar with. We commonly have a day-based date that we believe falls between some upper and lower limits — one of which could be unknown in the general case (i.e. including before or after some threshold). STEMMA refers to this concept as imprecision, and it roughly corresponds to the GEDCOM concept of a date-range.

In fact, imprecision also applies to dates with a granularity greater than one day, and the first table at Date Margins shows how a ±margin is interpreted in conjunction with different granularities by STEMMA. The following diagram uses lumen and penlumen[1] to visually illustrate how equality is interpreted as ‘having some overlap’, whilst the degree of the overlap may be used to rank date matches.

Another concept that is used with less-than-known dates is uncertainty. The difference between uncertainty and imprecision concerns how sure you are of a date value or of a date range. For instance, saying “I think he was born in 1878” would be a case of uncertainty whereas saying “He was born during 1876–1880” would be a case of imprecision. STEMMA doesn’t address this concept in the date notation, but it can attach an attribute of Surety=certainty% to the datum. By contrast, the US Library of Congress Extended Date Time Format (EDTF) contains specific syntax for representing each of these cases. It uses a suffix of ‘~’ (tilde) to indicate imprecision and ‘?’ to indicate uncertainty; both of which may be combined. For instance:

  • 2000-06?                   Possibly June 2000, but not definitely.
  • 1974~                        Approximately the year 1974 .
  • 1974?~                      Approximately 1974 but even that is uncertain.

These are examples of their Level-1 specification, but in Level-2 these suffixes may be applied to the individual parts of a date.

  • 2004?-06-11              Uncertain year (month & day known).
  • 2004-06~-11              Year and month are approx. (day known).
  • 2004-(06)?-11            Uncertain month (year and day known).
  • 2004-06-(11)~            Day is approximate (year & month known).
  • 2004-(06)?~               Month is approx. and uncertain (year known).
  • 2004-(06-11)?            Month and day uncertain (year known).

The EDTF has comprehensive mechanisms for handling partial dates, but I believe their mechanism for handling uncertainty in the digits of a date (e.g. 19uu-12-uu) is actually misplaced as part of its specification. This should not be date-specific and is encroaching on the bigger requirement of a standard representation for uncertain characters during transcription.

One area of confusion is that although there are distinct reasons for the different notational schemes, the schemes themselves are sometimes indistinct. For instance, there’s the humanly-readable notation which generally uses a c./ca. prefix (for circa, meaning “about”) for approximate dates, and an en-dash for date ranges (e.g. 1852–1855). It may also use word prefixes such as before or after. Looking at the GEDCOM support for dates shows that some of this humanly-readable notation has crept into an essentially computer-readable notation. Its date-range term uses prefix operators of AFT and BEF, and an infix operator of BET. Its date-approximated term uses prefix operators of ABT, CAL, and EST. The primary example of a computer-readable notation would be the ISO 8601 standard. Although it may be acceptable to employ the ISO notation in a document, some style guides indicate that the truncated numeric forms may be ambiguous to the reader. For instance, 1910-11 with a hyphen would be an ISO representation of November 1910, but 1910–11 (using an en-dash) would be a date range of 1910 to 1911. This ambiguity would not arise if the schemes were used in their appropriate contexts.

Even in the context of computer-readable notation, there are distinct goals that separate the different schemes. The EDTF notation is an expressive representation, designed to capture the full details of incomplete, approximate, or uncertain dates, and may therefore be more applicable to transcription. The W3CDTF format — which is a restricted subset of the ISO standard, employing the format yyyy-mm-dd — is a comparative representation. By that, I mean that any two dates in that representation are comparable, and all such dates would form a totally ordered set in mathematical terms. The ability to compare dates efficiently is essential for timelines and for date searches, and the general ability (for any data-type) underpins many types of software index, such as the B-tree. It’s worth noting that the different numeric ISO forms, highlighted above, are individually comparative but not together. For instance, the yyyy-mm-dd form cannot be directly sorted with the yyyy-Www form, and this was one of the driving forces for STEMMA implementing its own computer-readable notation; one that ensured all granularities were inclusively comparative (see Date Value).

The ability to compare dates is also a requirement when both imprecision and granularity are present together. Rather than encoding imprecision in the date string, STEMMA, uses its date notation to separately describe the start and end of the associated date range. This avoids encumbering the core notation while making it easy to implement comparisons in terms of the range end-points. The second table at Date Comparisons shows how STEMMA interprets comparison operators such as less-than-of-equal in this situation.

Indicating that a date falls before, after, or between other dates is called a temporal constraint. These obviously have their uses when implementing the concept of imprecision, but they are less appropriate between dates that both have some real-world significance. If you roughly knew, for instance, the dates of someone’s birth and baptism, then it would be inappropriate to express a temporal constraint to indicate that the latter is greater than the former. It’s inappropriate because the underlying semantics would have been lost. What is needed is an event constraint which indicates that their baptism follows their birth, and this topic was briefly discussed back in Eventful Genealogy – Part II. More recently, the topic of representing the birth order of a family’s children was discussed on the FHISO TSC-public mailing list at Birth Order. In the situation where their birth dates were unknown, it was suggested that a Family record could implicitly order them. This maybe true but a proper event constraint is a much more general concept, and one purposely designed to express those semantics. It could even be applied between twins when their birth dates are identical but their birth order was known to be otherwise.

If we want to take an extreme view of imprecision then we have to discuss the concept of probability distributions. Simply saying that something occurred during 1881–1885 doesn’t indicate whether 1881 is more or less likely than 1883 (i.e. mid-range); it simply describes a flat distribution of the likelihood. I believe that in most cases like this one, we could indicate one date that would be the statistical mode (i.e. the most common or likely value) of the distribution, but specifying and utilising distribution curves would be impractical in my opinion.

An interesting take on this may be found in recent research undertaken in Verona, Italy, to look at supporting fuzzy dates on their SITAVR information system.[2] Their research considers basic aspects of fuzzy dates, calendars, fuzzy temporal constraint networks (FTCN), and probability distributions. Those distributions are of a trapezoidal nature, and so require only four defining values rather than a full curve. Although the report may be very academic, it’s worth reading since the justification is the real-world archaeological data in their SITAVR system; much of which is subjective, estimated, or imprecise.

In conclusion, there are distinct reasons for the different date notations, and we should keep them in focus so that we don’t confuse them:

  • Computer-readable. These notations may record details of transcription issues (e.g. uncertain characters) or the uncertainty of a claimed date. I would contend that these are both general requirements that should apply to any datum — including numbers and text — and not just dates. For a decipherable date, they will also represent details of granularity and imprecision, both of which must be represented in a way that facilitates efficient comparison, sorting, and searching.
  • Humanly-readable. The traditional notations we use in written works rarely go into great detail regarding the possibilities or the levels of surety. In order to produce a humanly-readable version of a computer notation then one alternative might be to generate the nearest traditional form and use a footnote, or an interactive pop-up or right-click equivalent, to supplement it with the greater detail.

The jury may be out as regards the level of detail required in our notations, and whether imprecision should consider variable likelihoods (i.e. some type of probability distribution). However, in constructing such a notation, we must remain sure of whether it’s designed for humans or for computer software, and whether the issues being addressed are specific to dates or are a general consideration for any type of datum.

[1] Coined from Latin paene ("almost”) and lumen (“light”). Analogous to umbra and penumbra for shadow.
[2] Alberto Belussi and Sara Migliorini, "Modeling Time in Archaeological Data: the Verona Case Study", report to Dipartimento di Informatica Università degli Studi di Verona, Apr 2014, Verona University ( : accessed 29 Jan 2015).