Saturday, 20 September 2014

The Lineage Trap

Back in May of 2014, I used the term “lineage trap” to refer to the distortion of historical research and its representation resulting from an undue focus on biological lineage. A good case in point is the GEDCOM data format which has steered the evolution of genealogy for far too long.

The reason for using this term is that the field of genealogy is more about history than mere lineage. It would be wrong to say that it’s specifically about family history since there are some side-lined activities, such as One-Name Studies and One-Place Studies, which are not fully embraced by genealogy, or by the software products that it uses. I have employed the term micro-history — effectively a fine-grained local history — as it more accurately encompasses these activities, as well as being inclusive of histories relating to places, houses, military groups, organisations, clubs, etc. I know of genealogists who have ventured into one-or-more of these activities. Although the research, analysis, and write-up are basically the same as for family history, our existing software products fall woefully short of accommodating them.

This opinion has confused some people, and scared the bejesus out of others. Why do genealogists need support for such a wide scope? Why should they aim for such a nebulous target and risk over-complicating both our products and our data standards?

Well, these are valid questions and deserved of a considered answer.

My contention is that most genealogists want family history rather than mere representations of lineage, and that micro-history can be accommodated with only a slight generalisation from family history. Furthermore, that generalisation gives a cleaner picture of history generally, and provides the flexibility necessary for any unexpected or fringe avenues of research.

Let’s start by just looking at the case of history relating to people. Family historians cannot guarantee that their person references will all be representable on a single family tree. Obviously there may be adoptions, fostering, step-families, and half-siblings, but there may also be mentions of unrelated people who played some profound part in their history. Should they be relegated to a simple note, rather than being represented by a full Person entity, simply because they’re unrelated? Most people would argue ‘no’, and that implicitly means that any lineage details must therefore be disjoint, i.e. the collection of people in the data may belong to multiple, independent trees.

By turning this statement on its head, though, then we see it in a fundamentally different way: the lineage is a property or attribute applied selectively within some set of people. In other words, it is the set of people, and their relationships to historical events, which is important, irrespective of whether there’s any shared lineage. You may be thinking that this is a trivially different viewpoint but the repercussions will hopefully make you think again.

Let’s just examine how we might represent unrelated people and the shared events in their lives.

This is basically the STEMMA® approach. Each Person entity can be connected to multiple, shared Event entities. The sources are usually associated with the Event, as befits their STEMMA definition as “a representation of a date, or range of dates, for which source information exists”.

The Person and Event entities are both what might be termed “conclusion entities” because they’re made up of the most accurate and verifiable properties determined through research.

The associated information for the Person properties is attached to the Event-to-Person linkage, which in turn will be specific to one of the sources supporting that Event. Hence, there may be multiple sets of Properties: one from each of the supporting sources. These are similar to, but not exactly the same, as the concept known as ‘personaa’ (see Genealogical Persona Non Grata).

The information for the Event properties (e.g. start date, end date, place) is attached to the Event-to-source linkage, and again there may be multiple sets if there are multiple sources for the Event. Such a set of event properties is sometimes informally called an “eventa” in acknowledgement of the persona concept.

This is all very symmetrical and nicely takes care of timelines, and the separation of information and conclusion. However, STEMMA has several distinct subject entities[1]: Person, Group, and Place. It treats these uniformly so the above diagrams could equally be changed to put Place or Group entities in place of the Person ones. The only difference would be an alternative set of Property names applicable to each of the entity types. This symmetry allows software to implement the subject entity relationships to both Events and sources in the same way, and similarly with tricky issues such as multiple names and name matching (see The Game of the Name). This is ideal fodder for designs based on “classes” and Object Orientated Programming (OOP).

So, here’s one important aspect of micro-history support. If you were studying the history of an organisation — say the masons and their many lodges — then any software that handled Persons in a generic, non-lineage fashion could easily be extended to do the same for those entities. Indeed, it has even been suggested to me that groups could be modelled using a pseudo-person concept, but why cheat like that? Why not do it properly?

In reality, whether you’re researching family history, or the history of the people in a given place (One-Place Studies), or the history of people with a common surname (One-Name Studies), or military history, etc., then we will need a mixture of these subject entities; any single source may contain references to persons, and groups, and places. For instance, a report of a soldier travelling with his regiment, by ship, from one posting to another.

But what about lineage? Well, lineage is just one form of hierarchical arrangement that happens to be applicable to Person entities. A hierarchy of biological lineage[2] is characterised by each Person having a fixed relationship to just one father and one mother. Places also have a hierarchical arrangement, such as a house, on a street, in a city, in a state, in a country. Place hierarchies are characterised by being time-dependent, and a Place may be split or merged (see Related Entities). Groups have similar hierarchical considerations to those of Places (see Revisiting the Family Group). These different types of hierarchical linkage can be applied to their respective entities without, in any way, changing the diagrams above; they are independent, and optional, types of linkage that do not impact the entity relationships to Events or to sources. Even STEMMA Events have hierarchical arrangements (see Eventful Genealogy).

In OOP terms, the specific classes representing these subject entities implement their own hierarchy semantics, but they share Event/source relationships and name handling from a generic subject-entity base class. It should be noted that these structural differences are not one-to-one with an on-screen representation. For instance, a family tree is just one representation of lineage; a pedigree chart being another. Similarly, there may be multiple ways of depicting a Place or Group hierarchy. The important point, here, being that any product that starts with a family tree as its core concept is artificially limiting its scope and distorting the historical picture. Whether you want to view a specific hierarchy type, or a timeline for any or all entity types, or a geographical representation of the entities, or some mixture of these, is a product visualisation feature rather than a core structural concept.

Another component of STEMMA that is essential for any type of historical representation is narrative. If you want to document the fruits of some research then you want narrative, not a family tree. If you want to explain how you arrived at your conclusions then you want narrative, not some stepwise recipe expressed in “computer speak”. If you want to share your family history with relatives then you want real narrative, not some bunch of fields in a database table or some computer-generated “narrative”.

This is one of the features that I’ve found hardest to explain to people, and yet it’s probably one of the simplest. The problem is that that non-software people are familiar with word-processors and so as soon as you mention narrative then they think of separate documents, such as Word or PDF. Separate documents, like these, would not be integrated with your data, and references to people, places, groups, events, and dates, would not be connected-up to the relevant parts elsewhere in your data. This simply means that you need a new document format that provides the necessary semantic mark-up to achieve this, as well as more usual mark-up for presentation. STEMMA goes further by including mark-up to represent transcription anomalies too. So do the software people get it? Yes, they do, but since most products will want to squeeze your data into some relational database then there’s no easy way to include such marked-up text in an indexed fashion; the result being that you’re limited to little snippets of plain text instead.

Isn’t this the same as a wiki-type approach to stories? Absolutely not! Those approaches are both a product and a data model, although I haven’t seen one where these can be factored apart. Even if it supported multiple marked-up documents, and events, and the historical subject entities (person/place/group), and their respective hierarchies, and sources, then it would still need a separately documented data model that other applications could read. But hang on, that’s what I’ve already done!

[1] A STEMMA ‘subject’ is something that we’re likely to find references to within historical sources.
[2] More-personal relationships, or non-biological relationships, are modelled via the Relationship Property.

Tuesday, 9 September 2014

Cite Seeing

It’s about time that I presented my STEMMA® approach to sources and citations[1]. Although the initial design approach wasn’t unusual, it has since evolved by trying to match all the real, hand-generated citations in my own narrative reports, and without having to restrict things to some “standard” list of source-types, or some formatted samples published on paper or online.

The concept of a citation depends somewhat on the context. Some view it as the abstract act of citing a source of information or some scholarly work — ignoring contexts such as military awards and traffic citations. In STEMMA, a Citation entity (capitalisation deliberate for clarity) is a generalised representation of information location, sources, and repositories. For most genealogists, though, the term has come to mean the formatted reference notes appearing in a footnote or endnote; even more so than the source-list and source-label variants.

A citation has a number of purposes: intellectual honesty (not claiming prior work as your own), to allow your sources to be independently assessed by the reader, and to allow the strength of your information sources to be assessed. In order that a citation can be understood by other readers, there are conventions for the ordering, formatting, and separation of the elements that depend upon the type of source being cited. Probably the best known resource for genealogists crafting citations is Evidence Explained[2] (hereinafter EE).

Despite any overlap, we should not confuse the concept of a footnote/endnote with that of a reference-note citation. That is a general mechanism that may also be used for annotation (e.g. clarifying a word or phrase) or discursive notes (commentary which digresses from the main subject). There are cases for all of these in a narrative report and so STEMMA had to accommodate each of them.

It’s reasonable to ask why computer storage needs to somehow encode a citation. Why not simply retain the carefully-crafted formatted version? Well, that version effectively sets in concrete things such as the layout of the terms (someone may want a different ordering, say for ISO 690 compliance), the punctuation characters (e.g. see International Variations in Quotation Marks), the general style (CMOS, EE, others), and the locale. The last one of these covers a number of subtle aspects that should differ for users in different locales. The formatting of a date might be an obvious example, but whether you put punctuation characters inside or outside quotation marks is a less-spoken-of one. Since computer software cannot reliably decompose a formatted citation then it also means that it cannot indicate which piece is a title, which is an author, which is a date of publication, etc. This is semantic information that would need to be attached to the relevant parts if anything other than a human was to make use of it.

There are several design schemes that suggest breaking apart citations into a number of separate citation-elements (e.g. an author’s name), and relying on a separate citation-template system to regenerate a formatted edition appropriate to a given reader. The main differences between them might be summarised as follows:

  • Whether there’s a fixed, master list of source-types.
  • How the source-types are named or catalogued.
  • Whether the citation-element names are limited to convey the semantics.

STEMMA also uses citation-elements but with some important differences. Each source-type is identified by a Uniform Resource Identifier (URI). A URI generally looks like a URL but it may be defined freely if you own the root domain name. Digital Freedom explained how their visible semantics, decentralised creation, hierarchical derivatives, and versioning make them a cornerstone for extensible systems like STEMMA. The result is that you can define as many custom source-types as either your research or your locale require.

The citation-elements are defined as part of each source-type. That means their names and properties (e.g. their data-type, whether they’re optional, and whether they’re multi-valued) can be chosen independently for each source-type. Any semantic information can be attached to the individual citation-elements as necessary. For instance:

<Dataset Name=’Example’ xmlns:DC=’’>

<Citation Key=’cBook’ Abstract=’1’>
    <Title> Generic Citation for published books </Title>
    <URI> <URI>
        <Param Name=’Author’ SemType=’DC:creator’/>
        <Param Name=’Title’ SemType=’DC:title’/>
        <Param Name=’Publisher’ SemType=’DC:publisher’/>
        <Param Name=’Date’ Type=’Date’ SemType=’DC:date’/>
        <Param Name=’Page’  Optional=’1’/>

This STEMMA Citation entity can then be used to describe any number of simple book references. This example employs the ‘Dublin Core’ semantic tags, including their tentative refinements, but STEMMA can select other systems by using a different namespace (as indicated here by the “DC:” prefix). Such a custom entity is guaranteed not to clash with any others, and citations that use it are transportable. What is required in a receiving product is a citation-template that can format it appropriately, and — if you wanted to generate new instances of it — the verbiage associated with the source-type and its citation-element names for your locale.

This diagram illustrates how the main components of this scheme operate together. The source-type URI is used to fetch the definition of the source-type, either through a discovery service (on the Internet) or from a local repository. That definition will also include the verbiage appropriate to one-or-more locales.

User input for a source reference is solicited using that locale-specific verbiage and acknowledging the citation-element data-types and other properties in the process.

When generating a formatted citation — say for a report — the software product must interface to some citation-template tool which has a relevant template for that source-type. Developer note: STEMMA currently passes objects to a primitive tool, which then calls back on well-defined methods to obtain the specific details required by the template, e.g. a contact’s formal/informal name, a contact’s address, or a formatted place-hierarchy. This is more flexible than passing fixed items of text.

A nice feature of this scheme is that there is a lot of freedom, and it’s not expecting some standards body to define the many hundreds of samples that are published in EE. It works equally well for different preferences and different locales since it is merely a mechanism, not a standard list. Software developers sometimes think too much in terms of a formulaic approach to citations (‘you plug these values into a template and out pops your formatted reference’) whereas real-life citations need much more freedom. Those same developers may also view EE as just a list of prescribed citation forms for all conceivable sources rather than a comprehensive work on analysing evidence and crafting whatever citations we find necessary. As Elizabeth Shown Mills says herself: citations are an art rather than a science.

I now want to describe some basic STEMMA mechanisms for attaching information to a body of text, and then illustrate how they would be used in combination to replicate my hand-crafted editions. I won’t suggest that my own citations are good examples for anyone to follow, but I do strive to make them functional and relevant. That means that they sometimes get quite complicated, involving separate layers, analytical notes, and occasionally more than one source reference in the same reference note.

Case 1 – Simple reference-note citation

The following shows a simple sentence that references a certificate for a ‘death overseas’ in the UK. The associated citation is generated in a footnote

The certificate came through and confirmed the location of her death as Park Hotel, Ingenbohl, Canton [Kanton] Schwyz, Switzerland.16


16 England, death certificate for Mary Phyllis Ashbee, died 13 Jun 1984; citing location Switzerland; Death Abroad (1966 to 1994), General Register Office (GRO), Southport.

The certificate came through and confirmed the location of her death as Park Hotel, Ingenbohl, <Alt Value=’Kanton’>Canton</Alt> Schwyz, Switzerland.<CitationRef Key=’cDeathsOverseasUK’>
    <Param Name=’Name’> Mary Phyllis Ashbee </Param>
    <Param Name=’Date’> 1984-06-13 </Param>
    <Param Name=’Country’ Key=’wSwitzerland’/>

The CitationRef could have specified an explicit Mode=’RefFootnote’ but that’s the default and so is unnecessary. Note that it is a layered citation indicating where the originals are held. This is achieved through the Citation entity (cDeathsOverseasUK) linking to another one (cDeathsAbroadGRO) using a ParentCitationLnk; thus creating a citation chain.

The example also uses a second mechanism to provide annotation on the text; in this case, to provide the alternative German spelling for the cantons of Switzerland. Notice that this annotation is correctly placed in editorial brackets when the final form is non-interactive, such as on a printed page.

Case 2 – Discursive notes

This example uses a different mechanism to create a footnote that simply contains discursive notes. There is no source reference in this case.

This confirmed the death occurred at the British Military Hospital, Peshawar, and the cause of death as ‘Cerebral Haemorrhage, result of motor accident’. It also gave his rank as Lance Corporal in the 14th/20th [King’s] Hussars4, and his service number as 551091.


4 British cavalry regiment created through the merger of the 14th King's Hussars and the 20th Hussars in 1922. The honorific "King's" was added back into the title in 1936.

This confirmed the death occurred at the British Military Hospital, Peshawar, and the cause of death as ‘Cerebral Haemorrhage, result of motor accident’. It also gave his rank as Lance Corporal in the
<NoteRef Mode=’Footnote’>14th/20th [King’s] Hussars
British cavalry regiment created through the merger of the 14th King's Hussars and the 20th Hussars in 1922. The honorific "King's" was added back into the title in 1936.
</NoteRef>, and his service number as 551091.

This NoteRef element creates a footnote and inserts a footnote indicator into the main text. There are other options, though, such as Mode=’Inline’ which would place the text in editorial brackets at that location.

In this example, the relevant text was placed inside the NoteRef, but it could equally have been placed in a Text element elsewhere, and the FromText element (new in V4.0) used to include it.

Case 3 – Analytical notes

This case attaches a simple analytical note to a citation in the form of another layer (i.e. separated by a semicolon). That extra layer is achieved by appending the note in a local footnote rather than in the Citation entity itself; thus clearly separating personal opinion from the details of the citation.

Near the end of the burial register3 were the entries for all three of the soldiers who died in that road accident:


3 Burial register held at Garrison Church, Risalpur, NWFP, Pakistan (1915–1947), photocopy; Asia, Pacific and Africa Collections (APAC), The British Library, 96 Euston Road, London; source of photocopy was a typed document so unclear whether original was typed or whether it was a transcript itself.

Near the end of the burial register
<NoteRef  Mode=’Footnote’>
<CitationRef Key=’cAPACBurialReg’ Mode=’RefInline’>
    <Param Name=’Church’ Key=’wGarrisonChurch’/>
    <Param Name=’From’> 1915 </Param>
    <Param Name=’To’> 1947 </Param>
    <Param Name=’Media’> photocopy </Param>
</CitationRef>; source of photocopy was a typed document so unclear whether the original was typed or whether it was a transcript itself.
</NoteRef> were the entries for all three of the soldiers who died in that road accident:

This may take a couple of glances to see what is happening. The outermost NoteRef is generating a footnote, but inside the footnote is a citation generated inline and followed by a layer representing the analytical note. In this instance, the cAPACBurialReg also points to a parent entity representing APAC in a chain.

Case 4 – Multiple sources

Cases of reference notes mentioning multiple sources may be relatively rare outside of professional circles but I do have the following instance:

A check in the GRO index of births and deaths only gave one real possibility: Elsie Evelyn Emms, born 16 Feb 1913 in Wooldridge, West Ham, Essex; died 2003 in East Surrey.3


3 Transcribed GRO Index for England and Wales (1837–1983), database, FreeBMD ( : accessed 5 Aug 2014), birth entry for Elsie E. Emms; citing West Ham, 1913, Mar [Q1], vol. 4A:642. "England & Wales deaths 1837-2007", database, FindMyPast ( : accessed 5 Aug 2014), entry for Elsie Evelyn Emms; citing East Surrey, 2003, Mar [Q1], district number 7551B, register number ESB5, entry number 184, date of reg. 0303.

A check in the GRO index of births and deaths only gave one real possibility: Elsie Evelyn Emms, born 16 Feb 1913 in Wooldridge, West Ham, Essex; died 2003 in East Surrey.
<NoteRef  Mode=’Footnote’>
<CitationRef Key=’cFreeBMDBirth’ Mode=’RefInline’>
    <Param Name=Name’> Elsie E. Emms </Param>
    <Param Name=’RegDistrict’ Key=’wWestHam’/>
    <Param Name=’RegDate’> 1913-01:03 </Param>
    <Param Name=’Accessed’> 2014-08-05 </Param>
</CitationRef>. <CitationRef Key=’cFindMyPastDeath’ Mode=’RefInline’>
    <Param Name=Name’> Elsie Evelyn Emms </Param>
    <Param Name=’RegDistrict’ Key=’wEastSurrey’/>
    <Param Name=’RegDate’> 2003-01:03 </Param>

The reason that these two sources are included in the same reference note is that the conclusion was derived from a correlation of the two, and the details cannot be factored into two independent references.

This case also employs the NoteRef mechanism to generate a footnote containing two inline citations. Note that the dates of registration (e.g. 1913-Q1) are provided using the STEMMA date-value string format.


The terms citation and attribution are often confused and used interchangeably. In principle, a citation references an information source, such as a prior work, whereas attribution gives appropriate credit to individuals. In a journalistic context, though, the act of citing ones source (e.g. an interview with someone) is called attribution. For the purposes of genealogical and historical research, I usually reserve the term attribution for when someone’s material has been directly included in a report or collection (e.g. an image), which then contrasts with referencing some external source of information consulted during my research. Even then, though, there are grey areas. The point of mentioning attribution here is that the same Citation entity can be used to model attribution, too, and without any confusion or loss of functionality. In other words, the underlying mechanism is sufficiently flexible and general-purpose that the two become syntactically equivalent.

** Post updated on 22 Nov 2015 to align with the changes in STEMMA V4.0 **

[1] This is a STEMMA-specific article, and so is not directly related to recent FHISO sources & citations discussions, or to Louis Kessler’s recent post, or to Randy Seaver’s recent post.
[2] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore, Maryland: Genealogical Pub. Co., 2009),