Saturday, 2 January 2016

Anatomy of a Source

What is a source? When are sources independent and when are they not? How do citations describe related sources? These questions may seem to have obvious answers, but a detailed analysis of the relationships is essential when working electronically with sources.

Traditional genealogical work may take these relationships for granted, but software considerations are too often relegated to a mere reference-note citation added at the end of some narrative, or an electronic bookmark added at some point in a tree. By ‘working electronically with sources’, I mean source-based genealogy using a computer, and so the organisation of entities related to both sources and citations is then a fundamental consideration. The use of computer software as a tool during research, rather than simply for the maintenance of some database afterwards, is still not the norm, but it will be — one day — and not before time.

Anatomy of a Source
Figure 1 - Anatomy of a Source.

Background Knowledge and Observations

Let’s begin with some basic knowledge to set the scene. I will frequently refer to terms and concepts from the works of Elizabeth Shown Mills, and also Thomas W. Jones, which I hope will be obvious enough that I don’t have to cite every single occurrence.

Expressed as succinctly as I can, information is semantic data (data with meaning), and a source is the someone, something, or somewhere from which the information was obtained. Citations are statements that identify such sources, and these are most commonly recognised as the sentences used in footnotes/endnotes. In truth, a footnote/endnote may contain multiple citation sentences, each referencing a distinct source, but we’ll ignore that here for simplicity. For the curious, an example may be found at Cite Seeing.

A citation has a number of purposes: intellectual honesty (not claiming prior work as your own), allowing your sources to be independently assessed by the reader, and allowing the strength of your information sources to be assessed. To this end, there are a number of core principles to good citations: identification, description, and evaluation. The identification of the source is what existing software tends to focus on most, and it amounts to naming it and citing its location. If we include the location of the specific information within the source then these details are conveniently summarised as the five Ws: who, what, when, where-is, and where-in.

One of the most important mechanisms in a citation sentence is layers: the segments separated by semicolons that are used to describe the provenance of the source information, or of the source itself, and to provide analytical notes. When only one copy of a source exists, or when it is very rare with only a few copies existing, then citing the repository is the correct thing to do — this contrasts with widely published materials such as books and newspapers. When a source identifies the source of its own information then it is usually termed the source-of-the-source, and the corresponding layer would indicate this with a preceding “citing” or “the author cites”. However, provenance may have been determined by independent means, or there may be a mixture as in this example:

“Literary Miscellanea: Sketch of a Railway ‘Navvy’ ”, book extract, Bath Chronicle and Weekly Gazette (23 June 1859): p.6; citing: [Samuel Smiles,] The Story of the Life of George Stephenson [, Railway Engineer (London: John Murray, 1859)]; abridged by the author from the original and larger work: The Life of George Stephenson (1857).

A point to note is that a layer isn’t just another source; it could also be a repository, it could be provenance details, or it could be analytical notes. This means that layers cannot be modelled by simply linking software entities that describe sources.

In one of the papers submitted to FHISO[i], a set of desirable properties were presented for citations. One of these (section 2.10) is described as Canonical, and it suggested that citations should be one-to-one with sources. Expressed symbolically for two sources: S1 and S2, and their respective citations: C(S1) and C(S2), this would mean that S1=S2 and C(S1)=C(S2) would be equivalent statements. However, real-life citations can never be that precise.

A case where this immediately breaks down is when the same citation is expressed in different languages for users from different locales. In other words, even for two absolutely identical sources, with identical provenance and evaluation, there could be several different citations required for the same information. Going further, there may be a need to support different citation styles (e.g. CMOS, MLA), or different variants of a reference note for the first and subsequent usage. These facts mean that any organisation of software entities must support a one-to-many relationship between sources and citations.

Real sources are not all equivalent, even when the associated information has a common origin. Sources may be categorised as original, derivative, or authored works. The latter is effectively a hybrid of original opinions and conclusions, but derivative sources of information.  An original is when the material exists in its first oral or recorded form, and a derivative source is one produced by the copying an original or where the content has been manipulated, such as transcripts, abstracts, extracts, translations, and databases. Image copies are a special sub-category of derivatives that include digital scans and photographic facsimiles. Since they should capture the information exactly as it was then they are often treated the same as originals. However, they are still technically derivatives since the contrast may be lacking, or a film may be scratched, or the resolution too low; even the loss of colour in a monochrome image may have removed essential information. This doesn’t necessarily mean that a copy is always poorer than an original since you could have a damaged original and a copy that was made before the damage occurred — each case has to be evaluated on its own terms.

One of the really difficult areas to deal with — and one I’ve been slowly building up to with this summary of basic knowledge — is where derivatives were formed by manipulation of the content. This is because there are so many ways that they can be formed, and the associated chain of provenance can be difficult to determine. This subject was recently covered by Sue Adams in a series of blog-posts culminating in The Original in Context. For my own example, I’m choosing the baptism of an Amelia Kirk at Nottingham St Mary in 1809. The Nottinghamshire Archive has the original parish registers, and also the bishops’ transcripts: those hand-copied derivatives that were sent to the diocesan centre each month. The archive also has image copies of the parish registers on microfiche. The Nottinghamshire Family History Society (NottsFHS) has a searchable database of transcribed details that may be purchased on CD. Findmypast has incorporated a copy of the NottsFHS transcriptions into their own online databases. Ideally, the information in these derivative forms should be identical to that in the original, but that would be a rare event indeed.

Given that the information should (in principle) be the same, and ideally will not differ too much, how best would copies of alternative derivatives be organised so that they can be worked on (electronically) together? The derivation path may not be a straight line — there could be branches — and so you may have different copies that you want to compare and contrast. Treating them as wholly independent would be both wasteful and inaccurate. In other words, the required organisation must acknowledge the common origin of the information while still allowing the sources to have their own citations, their own evaluations, and their own resources (digital images, paper-based images, textual transcriptions, etc.).

There are two main ways that source references can be related to similar source references: by containment and by derivation. We’ve just looked at the derivation case where the information has a common origin but has been copied or manipulated along the way; containment is the case where you’re looking at a part of a larger source unit. The most common example of this may be when referring to specific pages or chapters in a book, but it can apply to many sources, including separate households or schedules on a given census page. One requirement, here, is that it must be possible for the software to know that two source references are to parts of the same unit (or item, in archival terms), or that one of them is, itself, a reference to that larger unit. For example, that two source references are to pages in the same book, or that one is to the book as a whole.

When working with information from different parts of a source, each distinct where-in reference will have an associated context which must, at least, specify the where and when. For instance, if citing pages from a biography then they may mention the subject when he was in a particular city, and during a particular time frame. In a census page, information for two different households would have been taken at the same time but they would have distinct addresses. This ability to dissect a source, and to characterise references according to their context, is essential if assimilating information for later analysis.


I’ll first look at STEMMA’s source-citation relationships. As of V4.0, the Source entity connects to multiple Citation entities and/or multiple Resource entities (e.g. media files). This structure has gradually evolved through trying to model my existing source-based research.

STEMMA Source entity
Figure 2 - STEMMA Source entity.

Looking at each of the functional issues mentioned above:

Support for derivatives. The Source entity brings together source information that has a common origin so that it can be compared and contrasted, and generally assimilated in one place. In other words, it is not representing one unique source, but sources with a common information origin. The context section embraces citations for all of those corresponding sources. Scheduled for V4.1: the <Quality> setting must be in respective <Frame> elements.

Support for containment. The SourceLet sections allow the dissection of a source into parts with a related context, and the associated citations would refer to specific where-in parts of the sources already identified outside of the SourceLets. Each SourceLet can also provide where and when contextual details for the associated information. The option to have two tiers of SourceLets — one for related derivatives and one for those specific where-in parts — was not taken for simplicity. Scheduled for V4.1: a ‘WhereIn’ attribute on selected optional citation-elements so that a single Citation entity (with a unique URI) can model references both the source as a whole and specific parts of it.

Support for citation language, modes, and styles. The Citation entity supports sets of preformatted citation text strings in alternative languages and citation modes (e.g. first reference note and subsequent ones). There is currently no support for alternative citation styles such as CMOS or MLA. These preformatted strings are all optional but the same Citation entity also supports a mandatory set of citation-elements, such as author, title, and publisher in the case of a book. These may be tagged with semantic types from alternative taxonomies, such as Dublin Core, and inclusively so if desired.

Support for analysis. Terms such as quality, reliability, and credibility may be used to describe a source or some information obtained from it. Analysis of the source information as a whole, including the derivatives with the same origin, is all done within a single Source entity. Although comments and observations on those individual derivatives can still be made, they are not divorced from the context of that shared provenance.

Support for layers. The Citation entity models the layers in a citation using its parent hierarchy (see below). This is possible because its extreme flexibility allows it to model any of the following: a simple citation, a repository, provenance details, and even attribution. The fact that the layers are supported by the Citation entity, and not the by Source entity, is dictated by the observation that layers are not just other sources.

As of V4.0, the citation layers may be characterised according to the terms in the following table. Scheduled for V4.1 is an additional Reworked category.

A brief summary or a précis of --
Information cited by the source. Source-of-the-Source.
Analytical comments.
Database extract (usually cited in first layer)
Database extract with images
Extracted portion from --
Scan, photocopy, photograph, etc.
Media conversion from --
Other provenance information, differing from ‘Citing’.
Location of source.
Transcribed details from --
Translated details from --

In GEDCOM, there are Repository, Source Repository Citation, Source Record, and Source Citation records. These are designed to implement a relatively straightforward, but limited, citation→source→repository model with a scope for supporting bibliographic citations — not for source analysis or source-based genealogy.

GEDCOM-X has a wider focus than GEDCOM, although the current draft specification is known to be incomplete in this area at the time of writing.

GEDCOM-X SourceDescription entity
Figure 3 - GEDCOM-X SourceDescription entity.

Although the SourceDescription relates to a unique source, there are sets of linkages to other sources related by derivation and by containment. It’s probably too early to see how this would work in practice, but the fact that ‘analysis documents’ are tied to individual SourceDescriptions would suggest that some collective analysis of derivatives would be difficult to organise. Also, the treatment of parts of a source (i.e. containment) by distinct SourceDescriptions would seem to compound the issues of collective analysis.

There is no specific support for layers; they cannot be handled by those derivation links since, as I’ve already mentioned, layers are not just other sources. Each SourceDescription may have multiple citations supporting different languages but there is no explicit consideration of alternative citation modes (e.g. first/subsequent), or styles (e.g. MLA). The SourceCitation entity is not hierarchical, and there are currently no citation-elements in either SourceCitation or SourceDescription.

The SourceReference appears to be provided solely to allow the attribution information of a SourceDescription to be overridden.

Relevance to the Reader

While software developers may understand what I’ve written here, I know there will be a significant number of other people thinking ‘I don’t get it. My software already handles sources’. There’s probably a good reason for this and I want to mention it in rounding-off this article.

A scenario that those same people might be able to associate more-closely with would be when adding a marriage date to their tree. They might have found a record with the date and place recorded so they add it to their tree and then include a citation for the record — or an electronic bookmark if found in the online databases of the tree’s host. Where does my “organisation of entities related to both sources and citations” fit into that? Well, in that scenario, it has little relevance. As my second paragraph suggested, most software only thinks about sources and citations as they appear in this limited type of scenario. But if we’re doing source-based genealogy then that organisation is a fundamental requirement.

Tradition genealogists — and good researchers everywhere — would look at more than the one date and place mentioned in such a source; they would look at other information, and the associated context of that information. If anything caught their eye as potentially significant, or requiring further research, then they would note it. That may happen in their heads (yikes!), or be written with pencil & paper, or captured on their computer using a text editor (e.g. Notepad) or some rich-media editor (e.g. Evernote). That note-taking is essentially what I mean by “assimilation” of the source information, and my approach to source-based genealogy is just including the note-taking and the initial analysis into my main genealogy program. There’s no onus on me to draw conclusions, or to attach any of it to a tree; it is a working area where sources can be dissected, and information partially digested so that I can find it and use it later.

I personally find this very natural since my career has hitherto involved developing and using cutting-edge software , but I also appreciate that the majority would not feel as comfortable relying on software to this extent — especially when most of it appears to be form-fill data entry of conclusions, and any basic methodology or representation of real-life scenarios are denied.

[i] Luther Tychonievich, "Desirable Citation Properties", FHISO Call For Papers, CFPS 112 ( : accessed 15 Dec 2015); this paper was not listed on the main 'papers received' page ( but was referenced from other papers.