Saturday, 17 January 2015

Hierarchical Sources

Some interrelated topics to be discussed in this article: What is a hierarchical source? How does it relate to a hierarchical arrangement and to provenance? Is one hierarchy enough? Does it affect our citations? Does it affect digital organisation?

You may be thinking that these are unrelated topics but let’s just begin with a question commonly posed in genealogical forums and mailing lists: ‘how do I organise my media files?’. This usually translates into ‘how do I name my files?’ or ‘how do I arrange the corresponding folder hierarchy?’, and an example may be found at: How should I Organise My Digital Documents?.

Most people have at least tried to organise their digital artefacts (i.e. document files and media files) by surname, and then realised how impractical that is. For instance, should a marriage certificate be organised by the groom’s or the bride’s surname? Whose surname do you use for a group photograph that includes several generations of relatives and in-laws? What do you do when you have inherited a photograph of ‘woman holding a baby’ but you haven’t yet formed a positive identification?

Back in May 2013, Sarah Ashley presented her solution at Organizing Your Genealogical Documents. This was to use a source-based scheme where the documents were each assigned sequential 4-digit identifiers while being scanned. Different computer folders were then used to store the different categories of material, such as vital events, newspaper articles, photographs, census pages, etc., and the individual files were named using the corresponding identifiers. In the following June, Louis Kessler presented an improved version of this at Source Based Document Organization where he suggested a hierarchical organisation, and also the use of the GEDCOM REFN tag (defined as: “A description or number used to identify an item for filing, storage, or other reference purposes”) for linking to the relevant documents.

These are good schemes but neither one suggests how those flat or hierarchical identifiers should be allocated, nor whether (when material was copied from some external source) there should be any relationship to external cataloguing of the original or an online version. Also, in the case of physical, rather than digital, artefacts then how should someone deal with a collection donated-by or inherited-from another family member?

The answer to these issues can be found in archival science. Archivists have been doing this for years, and they have international standards and a well-established vocabulary. In particular, provenance is a core principle of archival science, and it has two fundamental concepts: respect des fonds — basically grouping records according to their creator, or fonds — and original order — basically maintaining the same record order as that of their creator. In effect, I’m suggesting that we should manage our physical and digital artefacts as a micro-archive.

In May 2013, Sue Adams produced an excellent description of this approach on her Family Folklore blog at Provenance of a Personal Collection – Archival Accession, Arrangement and Description. She explained that archival arrangement places all the items at positions in a hierarchy reflecting their provenance, usage, and physical structure. An archival description would then provide information that served to identify, manage, locate and explain the archival materials at each level.[1] This definition is important because the resulting catalogue should identify information about an item rather than information within an item, and so not express any analysis or conclusions — more on this later.

The International Standard Archival Description (ISAD) defines a model for the levels in a hierarchical arrangement, and this includes fonds, sub-fonds, series, sub-series, files, and items[2]; items being the lowest level. A fonds (silent ‘d’ and ‘s’) is a term for a grouping of documents that have been naturally accumulated by an individual, family, or organisation as a result of their normal activities or work. This replaces much of the usage of the older term collection which is now reserved for groupings that have been assembled rather than created. The difference is effectively whether the grouping relates to a common provenance rather than a common characteristic.

Following Sue’s lead, I’ll select an example using a reference to a page in the 1901 census of England[3], as held by The National Archives of the UK (TNA): piece 3191, folio 125, page 19. Their guide to citing their documents and catalogues presents the following general document-reference formats:[4]

dept-code  series / piece
dept-code  series / piece / item

The census folio and page number are actually internal identifiers for those items, and so are relevant to a citation for some piece of information but not to the cataloguing of the associated source. The result might be something like:

RG 13/3191, f.125, p.19

Note that their recommendations involve the folio abbreviations f./ff. rather than the fo./fos. ones that some readers may be familiar with.

Evidence Explained covers the use of citations for multi-levelled archival arrangements, and remarks that “Your citation should follow the practice of the archive whose material you are using”.[5] However, it also warns that this may lead to conflicting styles when dealing with international sources, such as whether elements should be sequenced large-to-small or small-to-large.[6]

It has been suggested, more than once, that digital images should contain elements of meta-data that detail their provenance, and that this would greatly help when people have downloaded otherwise-untraceable images from online sources. The ubiquitous copying and downloading of digital images makes it nigh on impossible to know where they first came from, and who they should be attributed to. There is nothing technically impossible about this. For instance, the XMP meta-data design applies to several image and document formats, and without hindering applications that read them. It uses namespaces to make it applicable to any number of distinct meta-data sets, and it even has an international standard: ISO 16684-1:2012. Unfortunately, XMP is still a registered trademark of Adobe Systems Inc., and this has probably limited its take-up. One of the safer (i.e. more portable) alternatives is something called sidecar files, where the meta-data is held separately from the associated data by using a second file with a related name.

There is a good case for using something like XMP in this image-copying scenario, but it becomes less useful for your micro-archive because (a) it will likely contain physical as well as digital artefacts, and (b) the meta-data will be applicable to different units in the hierarchy, and not simply to the lowest-level items. Where a specific arrangement of the artefacts has been created then it is more common to use a meta-data database. However, STEMMA’s approach is to use its own file format to create machine-readable archival descriptions. Its files are plain-text, and its Resource entities may be used to describe each of the levels of a hierarchical arrangement. By incorporating its inheritance mechanism, this allows the description, provenance, access control, and any amount of meta-data to be provided for each unit in a single text file that can be loaded and referenced by other STEMMA files.

STEMMA has two main entity types relevant to this discussion:

  • Resource — a representation of a digital or physical artefact, or a combination of these such as when you have a scan of an original letter in your possession, or a digital photograph of a set of medals.
  • Citation — despite its name, this is a generalised reference to sources or information held elsewhere. It includes the location of information within a source, as well as the location of a source itself, and even allows for the representation of attribution.

Both of these entities share a mechanism of parameterisation. This means that they can both define a number of named parameters, each having its own specific data-type. These can be used in a similar way to the REFN tag, mentioned above, but with significant advantages. That GEDCOM tag allows only for a single amorphous code; a code that cannot be decomposed or reverse-engineered. Having the elements of a hierarchical reference in separate parameters allows them to be used in a more powerful fashion. For instance, returning to Sue’s example, the initial levels of her arrangement might be described using the following entities:

<Resource Name='rRWC' Abstract=’1’ >
    <Title>Raymond Walter Coulson (1922-1997) collection</Title>
        <Param Name='Lev1'>CWC</Param>
        <Param Name='File'/>
        <Param Name='Folder1'>collections/${Lev1}</Param>

    Papers, photographs, correspondence, memorabilia and probate
    documents of Raymond Walter Coulson of 322 Aston Hall Road,
    Aston, Birmingham, who died intestate on 24 May 1997

<Resource Name='rRWC_Probate' Abstract=’1’ >
    <Title>Probate file</Title>
    <BaseResourceLnk Name='rRWC’/>
        <Param Name='Lev2' Type=’Integer’>1</Param>
        <Param Name='Folder2'>${Folder1}/${Lev2}</Param>

    Compiled by [my dad], administrator for the estate of Raymond     Walter Coulson, between May 1997 and January 1998

These effectively construct a hierarchy of Resource entities describing the various units in that arrangement. An individual file, such as a marriage certificate, could then be specified through its file name and the Resource entity representing that archival unit. Yes, the folder names could have been hard-coded, and the Resource entities crafted independently of each other, but using the inheritance mechanism introduced in Genealogical Inheritance makes it more flexible and maintainable. Each Resource inherits an accumulated set of parameters from the higher levels.

The parameter mechanism is a general-purpose tool, and may be used to add specific items of meta-data that you want to separate out of the associated archival description. One of the developers independently writing software around the STEMMA specification recently presented me with a related question. He was transferring photographic slides to a digital organisation, and wanted to know how to deal with dates written on each slide frame. Since parameters can be defined freely then I pointed out that a Resource one could be defined for this purpose with a specific data-type of ‘Date’. In a Citation entity, the parameters may be used to define citation elements; those discrete values that would be later formatted into a traditional reference-note citation.

Since both Resource and Citation entities share this parameterisation mechanism then it is also possible to pass parameters from one to another. Imagine, for instance, that the citation for the aforementioned census page had parameters for the piece, folio, and page. If you had a local image copy of it then it could be located using the same parameter values, either substituted into a file name or a folder hierarchy. They could even be used to interrogate a Web site in order to summon the census image on demand (see ‘rCensusImage’ example at Resource).[7]

We’ve mentioned the hierarchy of a source inherent in its archival arrangement, but are there any other examples of a source hierarchy? Well, the chain of data provenance when we cite a source — that is, the relationship between records and the individuals or organisations that have created, maintained, reproduced, transcribed, indexed, otherwise modified them — also constitutes a hierarchy. When we cite a derivative source, such as an online edition, or some database, then we usually cite the source of the source in a secondary fashion.[8]  Provenance also applies to specific information as well as to a source or source data. A common example is when we’re citing an author who is citing other works; ones they have consulted but which we haven’t. We may feel that the provenance of the information is important to our case, but we cannot directly cite what we haven’t consulted. This scenario is covered in some detail by Evidence Explained[9], but consider the case where an author hasn’t cited their source, but we believe we have identified an earlier version of their claim or statement. This may be very important to our case, especially if there are subtle differences, but a simple comment in a reference note may be insufficient to encompass our justification and reasoning.

An important point here is that these forms of hierarchy are facets of the real world, and not some subjective notion that software might decide to support or ignore. This issue was recently discussed on the FHISO TSC-Public mailing list starting at Filing Sources. STEMMA’s Citation entity was endowed with two types of hierarchical linkage: ParentCitationLnk, in order to model provenance (see Cite Seeing), and BaseCitationLnk, in order to model the structure of groupings and the structure within a given source (see Genealogical Inheritance).

So, both local materials and consulted materials held elsewhere, including any associated digital images of them, can be represented using some combination of Resource and Citation entities. The core genealogical data is where we would analyse those materials and form our conclusions, and that will necessarily require links or citations to those materials. However, materials should never be catalogued according to such conclusions since they may change. If you’re cataloguing a photograph of ‘woman holding a baby’, or a painting of ‘a cracked vase with daisies’, then it must be independent of opinions or conclusions. Even their archival description must only record what we know about the materials rather than something we’ve determined from their contents. Our core genealogical data will also need to reference these materials from multiple points, and in different ways — something that renders a simple name-based arrangement redundant.

This article is placing great emphasis on both our sources and the local artefacts in our own micro-archives, but why? Isn’t one arrangement as good as another? Why do we need to be concerned with provenance, or with the arrangement used by some archive? The answer to this would be obvious to an archivist, or to an historian, but less so to most genealogists. The problem is that the majority of genealogy — and especially where it involves online family trees — is people-centric. The pursuit very often boils down to that of searching for a person’s name, or the vital events of a named person, and since the results will mostly come from online data — data that is deliberately keyed on personal names — then it has some consequences:

  • The source of the information is an afterthought. Although some Web sites allow a researcher to tag data with links to their relevant online content, that is merely an electronic bookmark (in the form of a URL) and not a real citation.
  • Even when a researcher references a source, it is only in the context of a citation. The belief that the data is the answer, as opposed to the source contains a clue, means that any reasoning for the making of a considered argument is being short-circuited.

Contrast this with the way someone might approach historical research, where individual sources are assimilated and relevant items analysed and correlated with information from other sources. That style of research begins with a source rather than with a name. The origin, nature, and quality of the source are then very important factors during its analysis.

This same point was recently raised by Jan Murphy on the FHISO TSC-Public mailing at Format for Raw Source Content, and I’ll leave you with her own words — words designed to keep the software mindset focused on the real world:

I hate to keep arguing this point over and over again, but we are looking at documents and other source material.  We are not looking at people.  We are looking at sources, most of which (but not all) contain names. 

A lot of beginning researchers, including many of the people in the Genealogy Do-Over group, struggle to learn how to cite their sources, and why? Because if you work in a people-centric system the sources are always an afterthought.

[1] CBPS - Sub-Committee on Descriptive Standards, "ISAD(G): General International Standard Archival Description - Second edition", International Council on Archives (ICA) ( : accessed 16 Jan 2015); attached document CBPS_2000_Guidelines_ISAD(G)_Second-edition_EN.pdf; glossary, p.10, s.v. “archival description”.
[2] “Model of the levels of arrangement of a fonds”, ISAD(G), appendix A-1, p.36.
[3] Whereas these TNA census references apply to England & Wales, they do not apply to Scotland. Scotland has its own system (see and this has caused issues for sites such as findmypast that try to provide a UK-wide search form. The Ancestry equivalent only solicits criteria such as piece/folio/page when specifically searching, say, the census of England, but findmypast currently solicits them in all UK cases, whether relevant or not. See Chris Paton’s views on this at FindmyPast - Scottish censuses.
[4] "Citing documents in The National Archives“, The National Archives of the UK (TNA) ( : accessed 16 Jan 2015).
[5] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace, 2nd ed. (Baltimore, Maryland: Genealogical Pub. Co., 2009), p.116–119.
[6] E. S. Mills, sec.3.3 “International Differences”.
[7] The idea of a reliable, non-internal URL for summoning the image of a particular census page is a nice idea that could help when sharing data with friends and relatives, or between researcher and clients, without the paranoia associated with T&Cs or copyright. Although the recipient would need a subscription to the site, the idea could be adopted by other providers to create a sort of genealogical "Open URL" variation. It would be quite easy for them to offer because they already have form-fill functionality that achieves the same type of lookup. However, the idea is strangely ignored.
[8] E. S. Mills, p.180 under “Citing the Source of a Source”.
[9] E. S. Mills, sec.2.21 “Citing the Source of a Source”.