Thursday, 29 January 2015

Warm Fuzzy Dates

No, not that sort of date! Calendar dates are a crucial part of historical research — including genealogy — but how well do we understand them? Is there more to their representation than a mere distinction between accurate and approximate?

A calendar is simply a mechanism by which a given culture records the passing of the days. I will try and restrict this article to the Gregorian calendar that we use everyday, although the basic principles can be applied to any calendar.

The Gregorian calendar has a selection of units that may be used in conjunction to express a given date, as illustrated below:

Structure of Gregorian date units, and the associated ISO numeric patterns

The pattern shown underneath each form is how it should be represented numerically according to the ISO 8601 standard, and the yearly-quarters pattern is shown in brackets since the ISO standard doesn’t currently address that form (see Is the ISO Date Standard Bad?).

Most genealogical dates try to describe a given day. Providing the actual time of an event is quite rare, but references to larger units are not so rare. When mentioning “last week”, or “the sixties”, or “19th Century”, then the implication is that the whole of that period is being referenced; not merely one particular day somewhere within it. Each of those ISO patterns may be truncated to express a date representing some of those cases, such as yyyy-mm or just yyyy. The proposed yyyy-Qq representation already describes a period greater than one day (i.e. three months), and it would have a very good use for certain record types. The GRO indexes of civil registrations for vital events in England & Wales are compiled on a quarterly basis, and that means that no finer-grained representation would be appropriate when citing the date of their entries. STEMMA refers to this concept as the granularity of the date reference, and it roughly corresponds to the GEDCOM concept of a date-period.

This is a subtle semantic difference from an approximate date, but it is the latter that we’re more familiar with. We commonly have a day-based date that we believe falls between some upper and lower limits — one of which could be unknown in the general case (i.e. including before or after some threshold). STEMMA refers to this concept as imprecision, and it roughly corresponds to the GEDCOM concept of a date-range.

In fact, imprecision also applies to dates with a granularity greater than one day, and the first table at Date Margins shows how a ±margin is interpreted in conjunction with different granularities by STEMMA. The following diagram uses lumen and penlumen[1] to visually illustrate how equality is interpreted as ‘having some overlap’, whilst the degree of the overlap may be used to rank date matches.

Another concept that is used with less-than-known dates is uncertainty. The difference between uncertainty and imprecision concerns how sure you are of a date value or of a date range. For instance, saying “I think he was born in 1878” would be a case of uncertainty whereas saying “He was born during 1876–1880” would be a case of imprecision. STEMMA doesn’t address this concept in the date notation, but it can attach an attribute of Surety=certainty% to the datum. By contrast, the US Library of Congress Extended Date Time Format (EDTF) contains specific syntax for representing each of these cases. It uses a suffix of ‘~’ (tilde) to indicate imprecision and ‘?’ to indicate uncertainty; both of which may be combined. For instance:

  • 2000-06?                   Possibly June 2000, but not definitely.
  • 1974~                        Approximately the year 1974 .
  • 1974?~                      Approximately 1974 but even that is uncertain.

These are examples of their Level-1 specification, but in Level-2 these suffixes may be applied to the individual parts of a date.

  • 2004?-06-11              Uncertain year (month & day known).
  • 2004-06~-11              Year and month are approx. (day known).
  • 2004-(06)?-11            Uncertain month (year and day known).
  • 2004-06-(11)~            Day is approximate (year & month known).
  • 2004-(06)?~               Month is approx. and uncertain (year known).
  • 2004-(06-11)?            Month and day uncertain (year known).

The EDTF has comprehensive mechanisms for handling partial dates, but I believe their mechanism for handling uncertainty in the digits of a date (e.g. 19uu-12-uu) is actually misplaced as part of its specification. This should not be date-specific and is encroaching on the bigger requirement of a standard representation for uncertain characters during transcription.

One area of confusion is that although there are distinct reasons for the different notational schemes, the schemes themselves are sometimes indistinct. For instance, there’s the humanly-readable notation which generally uses a c./ca. prefix (for circa, meaning “about”) for approximate dates, and an en-dash for date ranges (e.g. 1852–1855). It may also use word prefixes such as before or after. Looking at the GEDCOM support for dates shows that some of this humanly-readable notation has crept into an essentially computer-readable notation. Its date-range term uses prefix operators of AFT and BEF, and an infix operator of BET. Its date-approximated term uses prefix operators of ABT, CAL, and EST. The primary example of a computer-readable notation would be the ISO 8601 standard. Although it may be acceptable to employ the ISO notation in a document, some style guides indicate that the truncated numeric forms may be ambiguous to the reader. For instance, 1910-11 with a hyphen would be an ISO representation of November 1910, but 1910–11 (using an en-dash) would be a date range of 1910 to 1911. This ambiguity would not arise if the schemes were used in their appropriate contexts.

Even in the context of computer-readable notation, there are distinct goals that separate the different schemes. The EDTF notation is an expressive representation, designed to capture the full details of incomplete, approximate, or uncertain dates, and may therefore be more applicable to transcription. The W3CDTF format — which is a restricted subset of the ISO standard, employing the format yyyy-mm-dd — is a comparative representation. By that, I mean that any two dates in that representation are comparable, and all such dates would form a totally ordered set in mathematical terms. The ability to compare dates efficiently is essential for timelines and for date searches, and the general ability (for any data-type) underpins many types of software index, such as the B-tree. It’s worth noting that the different numeric ISO forms, highlighted above, are individually comparative but not together. For instance, the yyyy-mm-dd form cannot be directly sorted with the yyyy-Www form, and this was one of the driving forces for STEMMA implementing its own computer-readable notation; one that ensured all granularities were inclusively comparative (see Date Value).

The ability to compare dates is also a requirement when both imprecision and granularity are present together. Rather than encoding imprecision in the date string, STEMMA, uses its date notation to separately describe the start and end of the associated date range. This avoids encumbering the core notation while making it easy to implement comparisons in terms of the range end-points. The second table at Date Comparisons shows how STEMMA interprets comparison operators such as less-than-of-equal in this situation.

Indicating that a date falls before, after, or between other dates is called a temporal constraint. These obviously have their uses when implementing the concept of imprecision, but they are less appropriate between dates that both have some real-world significance. If you roughly knew, for instance, the dates of someone’s birth and baptism, then it would be inappropriate to express a temporal constraint to indicate that the latter is greater than the former. It’s inappropriate because the underlying semantics would have been lost. What is needed is an event constraint which indicates that their baptism follows their birth, and this topic was briefly discussed back in Eventful Genealogy – Part II. More recently, the topic of representing the birth order of a family’s children was discussed on the FHISO TSC-public mailing list at Birth Order. In the situation where their birth dates were unknown, it was suggested that a Family record could implicitly order them. This maybe true but a proper event constraint is a much more general concept, and one purposely designed to express those semantics. It could even be applied between twins when their birth dates are identical but their birth order was known to be otherwise.

If we want to take an extreme view of imprecision then we have to discuss the concept of probability distributions. Simply saying that something occurred during 1881–1885 doesn’t indicate whether 1881 is more or less likely than 1883 (i.e. mid-range); it simply describes a flat distribution of the likelihood. I believe that in most cases like this one, we could indicate one date that would be the statistical mode (i.e. the most common or likely value) of the distribution, but specifying and utilising distribution curves would be impractical in my opinion.

An interesting take on this may be found in recent research undertaken in Verona, Italy, to look at supporting fuzzy dates on their SITAVR information system.[2] Their research considers basic aspects of fuzzy dates, calendars, fuzzy temporal constraint networks (FTCN), and probability distributions. Those distributions are of a trapezoidal nature, and so require only four defining values rather than a full curve. Although the report may be very academic, it’s worth reading since the justification is the real-world archaeological data in their SITAVR system; much of which is subjective, estimated, or imprecise.

In conclusion, there are distinct reasons for the different date notations, and we should keep them in focus so that we don’t confuse them:

  • Computer-readable. These notations may record details of transcription issues (e.g. uncertain characters) or the uncertainty of a claimed date. I would contend that these are both general requirements that should apply to any datum — including numbers and text — and not just dates. For a decipherable date, they will also represent details of granularity and imprecision, both of which must be represented in a way that facilitates efficient comparison, sorting, and searching.
  • Humanly-readable. The traditional notations we use in written works rarely go into great detail regarding the possibilities or the levels of surety. In order to produce a humanly-readable version of a computer notation then one alternative might be to generate the nearest traditional form and use a footnote, or an interactive pop-up or right-click equivalent, to supplement it with the greater detail.

The jury may be out as regards the level of detail required in our notations, and whether imprecision should consider variable likelihoods (i.e. some type of probability distribution). However, in constructing such a notation, we must remain sure of whether it’s designed for humans or for computer software, and whether the issues being addressed are specific to dates or are a general consideration for any type of datum.

[1] Coined from Latin paene ("almost”) and lumen (“light”). Analogous to umbra and penumbra for shadow.
[2] Alberto Belussi and Sara Migliorini, "Modeling Time in Archaeological Data: the Verona Case Study", report to Dipartimento di Informatica Università degli Studi di Verona, Apr 2014, Verona University ( : accessed 29 Jan 2015).

Saturday, 17 January 2015

Hierarchical Sources

Some interrelated topics to be discussed in this article: What is a hierarchical source? How does it relate to a hierarchical arrangement and to provenance? Is one hierarchy enough? Does it affect our citations? Does it affect digital organisation?

You may be thinking that these are unrelated topics but let’s just begin with a question commonly posed in genealogical forums and mailing lists: ‘how do I organise my media files?’. This usually translates into ‘how do I name my files?’ or ‘how do I arrange the corresponding folder hierarchy?’, and an example may be found at: How should I Organise My Digital Documents?.

Most people have at least tried to organise their digital artefacts (i.e. document files and media files) by surname, and then realised how impractical that is. For instance, should a marriage certificate be organised by the groom’s or the bride’s surname? Whose surname do you use for a group photograph that includes several generations of relatives and in-laws? What do you do when you have inherited a photograph of ‘woman holding a baby’ but you haven’t yet formed a positive identification?

Back in May 2013, Sarah Ashley presented her solution at Organizing Your Genealogical Documents. This was to use a source-based scheme where the documents were each assigned sequential 4-digit identifiers while being scanned. Different computer folders were then used to store the different categories of material, such as vital events, newspaper articles, photographs, census pages, etc., and the individual files were named using the corresponding identifiers. In the following June, Louis Kessler presented an improved version of this at Source Based Document Organization where he suggested a hierarchical organisation, and also the use of the GEDCOM REFN tag (defined as: “A description or number used to identify an item for filing, storage, or other reference purposes”) for linking to the relevant documents.

These are good schemes but neither one suggests how those flat or hierarchical identifiers should be allocated, nor whether (when material was copied from some external source) there should be any relationship to external cataloguing of the original or an online version. Also, in the case of physical, rather than digital, artefacts then how should someone deal with a collection donated-by or inherited-from another family member?

The answer to these issues can be found in archival science. Archivists have been doing this for years, and they have international standards and a well-established vocabulary. In particular, provenance is a core principle of archival science, and it has two fundamental concepts: respect des fonds — basically grouping records according to their creator, or fonds — and original order — basically maintaining the same record order as that of their creator. In effect, I’m suggesting that we should manage our physical and digital artefacts as a micro-archive.

In May 2013, Sue Adams produced an excellent description of this approach on her Family Folklore blog at Provenance of a Personal Collection – Archival Accession, Arrangement and Description. She explained that archival arrangement places all the items at positions in a hierarchy reflecting their provenance, usage, and physical structure. An archival description would then provide information that served to identify, manage, locate and explain the archival materials at each level.[1] This definition is important because the resulting catalogue should identify information about an item rather than information within an item, and so not express any analysis or conclusions — more on this later.

The International Standard Archival Description (ISAD) defines a model for the levels in a hierarchical arrangement, and this includes fonds, sub-fonds, series, sub-series, files, and items[2]; items being the lowest level. A fonds (silent ‘d’ and ‘s’) is a term for a grouping of documents that have been naturally accumulated by an individual, family, or organisation as a result of their normal activities or work. This replaces much of the usage of the older term collection which is now reserved for groupings that have been assembled rather than created. The difference is effectively whether the grouping relates to a common provenance rather than a common characteristic.

Following Sue’s lead, I’ll select an example using a reference to a page in the 1901 census of England[3], as held by The National Archives of the UK (TNA): piece 3191, folio 125, page 19. Their guide to citing their documents and catalogues presents the following general document-reference formats:[4]

dept-code  series / piece
dept-code  series / piece / item

The census folio and page number are actually internal identifiers for those items, and so are relevant to a citation for some piece of information but not to the cataloguing of the associated source. The result might be something like:

RG 13/3191, f.125, p.19

Note that their recommendations involve the folio abbreviations f./ff. rather than the fo./fos. ones that some readers may be familiar with.

Evidence Explained covers the use of citations for multi-levelled archival arrangements, and remarks that “Your citation should follow the practice of the archive whose material you are using”.[5] However, it also warns that this may lead to conflicting styles when dealing with international sources, such as whether elements should be sequenced large-to-small or small-to-large.[6]

It has been suggested, more than once, that digital images should contain elements of meta-data that detail their provenance, and that this would greatly help when people have downloaded otherwise-untraceable images from online sources. The ubiquitous copying and downloading of digital images makes it nigh on impossible to know where they first came from, and who they should be attributed to. There is nothing technically impossible about this. For instance, the XMP meta-data design applies to several image and document formats, and without hindering applications that read them. It uses namespaces to make it applicable to any number of distinct meta-data sets, and it even has an international standard: ISO 16684-1:2012. Unfortunately, XMP is still a registered trademark of Adobe Systems Inc., and this has probably limited its take-up. One of the safer (i.e. more portable) alternatives is something called sidecar files, where the meta-data is held separately from the associated data by using a second file with a related name.

There is a good case for using something like XMP in this image-copying scenario, but it becomes less useful for your micro-archive because (a) it will likely contain physical as well as digital artefacts, and (b) the meta-data will be applicable to different units in the hierarchy, and not simply to the lowest-level items. Where a specific arrangement of the artefacts has been created then it is more common to use a meta-data database. However, STEMMA’s approach is to use its own file format to create machine-readable archival descriptions. Its files are plain-text, and its Resource entities may be used to describe each of the levels of a hierarchical arrangement. By incorporating its inheritance mechanism, this allows the description, provenance, access control, and any amount of meta-data to be provided for each unit in a single text file that can be loaded and referenced by other STEMMA files.

STEMMA has two main entity types relevant to this discussion:

  • Resource — a representation of a digital or physical artefact, or a combination of these such as when you have a scan of an original letter in your possession, or a digital photograph of a set of medals.
  • Citation — despite its name, this is a generalised reference to sources or information held elsewhere. It includes the location of information within a source, as well as the location of a source itself, and even allows for the representation of attribution.

Both of these entities share a mechanism of parameterisation. This means that they can both define a number of named parameters, each having its own specific data-type. These can be used in a similar way to the REFN tag, mentioned above, but with significant advantages. That GEDCOM tag allows only for a single amorphous code; a code that cannot be decomposed or reverse-engineered. Having the elements of a hierarchical reference in separate parameters allows them to be used in a more powerful fashion. For instance, returning to Sue’s example, the initial levels of her arrangement might be described using the following entities:

<Resource Name='rRWC' Abstract=’1’ >
    <Title>Raymond Walter Coulson (1922-1997) collection</Title>
        <Param Name='Lev1'>CWC</Param>
        <Param Name='File'/>
        <Param Name='Folder1'>collections/${Lev1}</Param>

    Papers, photographs, correspondence, memorabilia and probate
    documents of Raymond Walter Coulson of 322 Aston Hall Road,
    Aston, Birmingham, who died intestate on 24 May 1997

<Resource Name='rRWC_Probate' Abstract=’1’ >
    <Title>Probate file</Title>
    <BaseResourceLnk Name='rRWC’/>
        <Param Name='Lev2' Type=’Integer’>1</Param>
        <Param Name='Folder2'>${Folder1}/${Lev2}</Param>

    Compiled by [my dad], administrator for the estate of Raymond     Walter Coulson, between May 1997 and January 1998

These effectively construct a hierarchy of Resource entities describing the various units in that arrangement. An individual file, such as a marriage certificate, could then be specified through its file name and the Resource entity representing that archival unit. Yes, the folder names could have been hard-coded, and the Resource entities crafted independently of each other, but using the inheritance mechanism introduced in Genealogical Inheritance makes it more flexible and maintainable. Each Resource inherits an accumulated set of parameters from the higher levels.

The parameter mechanism is a general-purpose tool, and may be used to add specific items of meta-data that you want to separate out of the associated archival description. One of the developers independently writing software around the STEMMA specification recently presented me with a related question. He was transferring photographic slides to a digital organisation, and wanted to know how to deal with dates written on each slide frame. Since parameters can be defined freely then I pointed out that a Resource one could be defined for this purpose with a specific data-type of ‘Date’. In a Citation entity, the parameters may be used to define citation elements; those discrete values that would be later formatted into a traditional reference-note citation.

Since both Resource and Citation entities share this parameterisation mechanism then it is also possible to pass parameters from one to another. Imagine, for instance, that the citation for the aforementioned census page had parameters for the piece, folio, and page. If you had a local image copy of it then it could be located using the same parameter values, either substituted into a file name or a folder hierarchy. They could even be used to interrogate a Web site in order to summon the census image on demand (see ‘rCensusImage’ example at Resource).[7]

We’ve mentioned the hierarchy of a source inherent in its archival arrangement, but are there any other examples of a source hierarchy? Well, the chain of data provenance when we cite a source — that is, the relationship between records and the individuals or organisations that have created, maintained, reproduced, transcribed, indexed, otherwise modified them — also constitutes a hierarchy. When we cite a derivative source, such as an online edition, or some database, then we usually cite the source of the source in a secondary fashion.[8]  Provenance also applies to specific information as well as to a source or source data. A common example is when we’re citing an author who is citing other works; ones they have consulted but which we haven’t. We may feel that the provenance of the information is important to our case, but we cannot directly cite what we haven’t consulted. This scenario is covered in some detail by Evidence Explained[9], but consider the case where an author hasn’t cited their source, but we believe we have identified an earlier version of their claim or statement. This may be very important to our case, especially if there are subtle differences, but a simple comment in a reference note may be insufficient to encompass our justification and reasoning.

An important point here is that these forms of hierarchy are facets of the real world, and not some subjective notion that software might decide to support or ignore. This issue was recently discussed on the FHISO TSC-Public mailing list starting at Filing Sources. STEMMA’s Citation entity was endowed with two types of hierarchical linkage: ParentCitationLnk, in order to model provenance (see Cite Seeing), and BaseCitationLnk, in order to model the structure of groupings and the structure within a given source (see Genealogical Inheritance).

So, both local materials and consulted materials held elsewhere, including any associated digital images of them, can be represented using some combination of Resource and Citation entities. The core genealogical data is where we would analyse those materials and form our conclusions, and that will necessarily require links or citations to those materials. However, materials should never be catalogued according to such conclusions since they may change. If you’re cataloguing a photograph of ‘woman holding a baby’, or a painting of ‘a cracked vase with daisies’, then it must be independent of opinions or conclusions. Even their archival description must only record what we know about the materials rather than something we’ve determined from their contents. Our core genealogical data will also need to reference these materials from multiple points, and in different ways — something that renders a simple name-based arrangement redundant.

This article is placing great emphasis on both our sources and the local artefacts in our own micro-archives, but why? Isn’t one arrangement as good as another? Why do we need to be concerned with provenance, or with the arrangement used by some archive? The answer to this would be obvious to an archivist, or to an historian, but less so to most genealogists. The problem is that the majority of genealogy — and especially where it involves online family trees — is people-centric. The pursuit very often boils down to that of searching for a person’s name, or the vital events of a named person, and since the results will mostly come from online data — data that is deliberately keyed on personal names — then it has some consequences:

  • The source of the information is an afterthought. Although some Web sites allow a researcher to tag data with links to their relevant online content, that is merely an electronic bookmark (in the form of a URL) and not a real citation.
  • Even when a researcher references a source, it is only in the context of a citation. The belief that the data is the answer, as opposed to the source contains a clue, means that any reasoning for the making of a considered argument is being short-circuited.

Contrast this with the way someone might approach historical research, where individual sources are assimilated and relevant items analysed and correlated with information from other sources. That style of research begins with a source rather than with a name. The origin, nature, and quality of the source are then very important factors during its analysis.

This same point was recently raised by Jan Murphy on the FHISO TSC-Public mailing at Format for Raw Source Content, and I’ll leave you with her own words — words designed to keep the software mindset focused on the real world:

I hate to keep arguing this point over and over again, but we are looking at documents and other source material.  We are not looking at people.  We are looking at sources, most of which (but not all) contain names. 

A lot of beginning researchers, including many of the people in the Genealogy Do-Over group, struggle to learn how to cite their sources, and why? Because if you work in a people-centric system the sources are always an afterthought.

[1] CBPS - Sub-Committee on Descriptive Standards, "ISAD(G): General International Standard Archival Description - Second edition", International Council on Archives (ICA) ( : accessed 16 Jan 2015); attached document CBPS_2000_Guidelines_ISAD(G)_Second-edition_EN.pdf; glossary, p.10, s.v. “archival description”.
[2] “Model of the levels of arrangement of a fonds”, ISAD(G), appendix A-1, p.36.
[3] Whereas these TNA census references apply to England & Wales, they do not apply to Scotland. Scotland has its own system (see and this has caused issues for sites such as findmypast that try to provide a UK-wide search form. The Ancestry equivalent only solicits criteria such as piece/folio/page when specifically searching, say, the census of England, but findmypast currently solicits them in all UK cases, whether relevant or not. See Chris Paton’s views on this at FindmyPast - Scottish censuses.
[4] "Citing documents in The National Archives“, The National Archives of the UK (TNA) ( : accessed 16 Jan 2015).
[5] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace, 2nd ed. (Baltimore, Maryland: Genealogical Pub. Co., 2009), p.116–119.
[6] E. S. Mills, sec.3.3 “International Differences”.
[7] The idea of a reliable, non-internal URL for summoning the image of a particular census page is a nice idea that could help when sharing data with friends and relatives, or between researcher and clients, without the paranoia associated with T&Cs or copyright. Although the recipient would need a subscription to the site, the idea could be adopted by other providers to create a sort of genealogical "Open URL" variation. It would be quite easy for them to offer because they already have form-fill functionality that achieves the same type of lookup. However, the idea is strangely ignored.
[8] E. S. Mills, p.180 under “Citing the Source of a Source”.
[9] E. S. Mills, sec.2.21 “Citing the Source of a Source”.