Thursday, 19 February 2015

Source Mining

It’s time to look at how we work with our sources, and the impact that this has on citations. I expect many people to say that we all work differently, but do we? If we fall into a small number of distinct cases then a computer representation of our research, as opposed to merely our conclusions, is an achievable goal.

Source Mining

Whether we like it or not, and irrespective of how we conduct our research, there are different scopes within genealogy. Some researchers are content with establishing their lineage or pedigree; some would like to look at the history of their family; some have a much more general interest in history, including the backdrop to their family’s lives, and the micro-history of places, groups, and other subjects.

It’s difficult to break these into hard categories as it is really a greyscale dependent upon our personal goals and interests. However, it is much easier to categorise the fruits of our research. The consensus seems to be that this is either conclusions or evidence-and-conclusions, with citation of sources being the differentiating factor, but this is certainly an oversimplification, and maybe even a misuse of those same terms.

Family Trees

Let’s start by analysing the most simplistic of scope: that of a plain family tree or pedigree. This would consist of an assemblage of so-called “facts”: the names, dates, and places corresponding to the family’s vital events, and their lineage-based relationships. Without any sources then these “facts” are merely unsubstantiated claims, but what would source citations add to them? As I recently commented on one of James Tanner’s posts (Why not a ranking and review system for online family tree databases), it doesn’t necessarily make the data any more accurate. I have seen many online trees that cite census entries, or vital events, and yet are entirely wrong; often with clearly impossible implications. Conversely, the absence of source citations may mean that the tree was posted as "cousin bait" rather than being a complete genealogy. At best, we might deduce that the inclusion of citations means (a) that the data wasn’t simply copied from another tree, and (b) that some effort was made to include that source information. However, the ease with which online trees can add electronic citations — more accurately described as electronic bookmarks (see Citations for Online Trees) — weakens that latter deduction. Also, those electronic citations are usually constrained to online data hosted by the same provider, and so would not be a general mechanism.

A deeper issue is that these family-tree citations — whether electronic or in traditional reference-note form — only work because the underlying data is a mere assemblage of “facts”. A simple list of sources might constitute a proof summary, but that assumes that the evidence from those sources is direct and non-conflicting for each claim. Dealing with the more complex cases is often referred to as Inferential Genealogy, but the representation of these cases, such as my establishing the parentage of Sarah Hunt in the latter part of My Ancestor Changed Their Surname, cannot suffice with a plain citation, or even with a group of plain citations. If there isn’t a direct relationship between a “fact” and its source then you need a proof argument, and that may require a little more than just narrative. Although you would write such a proof argument using narrative, it may need to make correlated references to multiple subjects, such as people, and to multiple sources of information. If the online tree allowed you to upload this narrative as plain text then it would have to be associated with a specific person, or family, and the essential structure and relevance to other subjects would be lost as a result.

On the surface of it, this appears to be saying that it’s not enough to say where your information came from, and that you must also indicate how it relates to your claims. The issue is more subtle, though, for any computer representation — including online trees — since the structure and position of that proof argument is crucially important. Returning to the case of Sarah Hunt’s parentage, an associated proof argument would be as relevant to both of her parents as to herself, and it may include non-familial persons in the general case, so where do you attach it? Also, simply identifying a Frances Hunt as her mother because such-and-such wouldn’t be enough if there were multiple people with that name in the same tree.

Ideally, we need something in between our conclusions and the underlying sources; something that not only provides a link but a structured pathway explaining how and why. This is even an issue for trees that want to cite sources that do not quite agree on someone’s date of birth. Obviously a person should not have multiple birth events recorded, but it must be possible to trace any selected date, or date-range, back to some correlation of those source differences. As explained in Hierarchical Sources, when fellow researchers examine your data, it is the combination of your proof argument and its sources that they should be interested in, rather than just your subjective conclusions. They would want the option to form different or modified conclusions, and so that full story is a fundamental issue for any type of data sharing.

Historical Research

Moving away from the simplistic scope of a family tree, and along that greyscale to the more historical pursuits, reveals something quite profound: a change of emphasis and a different approach. Since such researchers are no longer looking for discrete “facts” to add to their existing tree then they’re much more interested in anything and everything from a given source. This change of emphasis results in the source being the main focus — not the tree — and so the citation of that source is formed much earlier, and is rarely an afterthought. Having selected a relevant source, such as a diary, will, military record, letters, or an old book, then there’s typically an assimilation process of deconstruction and interpretation — which is what I refer to by the title of this article: source mining.

If I were to describe this source-mining process as locating the subject references[1] in the source, identifying their documented properties (e.g. person’s age, person’s occupation, place type), identifying their documented relationships (including person-to-person, person-to-place, place-to-place, etc), and then incorporating that information into your main historical data — a process that would require correlation with other analysed sources, and resolution of conflicts or other differences — then how many people would identify with that approach? Or, putting it another way, how many people’s approach would be substantially different?

My contention is that this general approach is more common during historical research. The converse is where you are asking a specific question, or making a specific claim, and might be described as a goal-directed approach. In the aforementioned case of family trees then such goals would include finding data for a vital event, identifying a marriage partner, or finding offspring of a couple; the tree effectively sets those goals. Rather than suggesting that a goal-directed approach doesn’t exist, I’m suggesting that it depends on the scope of the research, and that it becomes less common as the research scope broadens. I believe historical researchers may still have specific interests, such as a particular person, family, village, or event, but would be more reliant on the serendipity of each source than answering specific questions.

If this contention is true then it has implications for the digital representation of our research, and for the support of any standard of research. With no animus intended, let me select GEDCOM-X as an example. This data model made an admirable attempt at supporting a research process, but this was primarily a goal-directed process. The page at GEDCOM X and the Genealogical Research Process describes the first phase as: “Question Asking: The research process begins with a focused research question”. Irrespective of whether the model can find a way of representing the source-mining approach, its initial design was based on a more restricted concept of research.

Am I suggesting that the Genealogical Proof Standard (GPS) is not applicable to source mining? Well, that is a really good question. It is true that discussions of the GPS nearly always present it in the context of goal-directed research, such as establishing the truth or falsity of some claim, but its core principles would still apply to the incorporation of the mined data into your other data. Assessment of information based on the nature and provenance of its source, resolution of conflicts, etc., would be just as relevant.


At the time of writing, I am working towards a representation of mined data that I hope will provide a crucially missing piece in the STEMMA data model. Back in STEMMA V3.0, I introduced the concept of a References element in which the details of the subject references were assembled into prototype subject entities describing persons, places, groups, etc., and their relationships to each other, using the information from a given source. Since this was dealing with the documented properties and relationships, and it allowed me to analyse them and to correlate them with other sources, then it was a good starting point for source mining — except that it was in entirely the wrong place! It was part of the Event entity, but that embraced conclusions and so was too high on the structured pathway mentioned above. Source mining is largely a bottom-up approach, and the References element was designed to facilitate the extraction and summarisation of source information in a manner from which inferences and conclusions could be built. For instance a documented name may not have been someone’s registered name, a documented age may have been estimated, a relationship of “cousin” could have meant a lot of different things, and to say that one place was “within” another wouldn’t necessary identify either of them — some analysis and correlation is required. From this perspective, those prototype entities were fairly similar to Personae, except that I had generalised the concept to include other subject types, and I had endeavoured to keep their shared context. A strong criticism I have of the accepted persona concept is that it extracts names, and other details, from different sources and treats them all equally, and in isolation from their original context, including the background context of the source information, the nature of the source itself, and the documented relationships between subjects (not just persons) in the same source.

This work is still in its early stages, but it will involve moving those Reference elements into some new entity that bridges between my sources and my conclusions. Part of the problem is that information must be associated with its relevant context, which generally equates to a where, when, and by-whom (or as much of it as can be deduced), but any single source may have multiple contexts. An obvious case would be a diary, but other cases may involve information making reference to prior events. You then have the context of the body information and the context of the reported or recollected events within that body.

The following schematic diagram illustrates the current direction of this work, noting that it’s still subject to revision. The source material would describe local material (where you may have an original or an image copy) or remote material (such as a cited work or document), or both when they’re related. The new entity would identify the different contexts within the source and assemble prototype subject entities, together with their documented properties and relationships. I say documented because these would not be conclusions at this point. Hence, if the relationship between a particular person reference and a place reference was that of “present at” then I wouldn’t assume that it was their residence. Those subject references would be connected to appropriate points in any transcriptions of the material, thus providing a connection all the way from a conclusion entity (or associated datum), through the correlation between different sources, through the analysis of the different contexts within a given source, and finally to an actual textual reference.

Building from source fragments

A source sentence that I’d used as a simplistic example on a FHISO mailing list at Entity Relationships went as follows:

"In 1963, John Smith, of 10 Front St, brother of Simon Smith of Woodstown, married Ann Jones".

The obvious context is the year 1963. It has three person references (John Smith, Simon Smith, and Ann Jones) and two place references (10 Front St and Woodstown), none of which have been specifically identified at this stage. There are relationships indicated between the person references (brother-of, and married) but also relationships between the persons and places (expressed here as “of”).

What this representation may not deal with is the concept of multi-tier personae, and the equivalent for non-person subject types. They would be relevant to the ‘correlation’ item in the diagram but I’m yet to be convinced that a formal entity representation is any better than a narrative explanation. 


I’m suggesting that research is primarily sourced-based, working upwards from the information we find in each source, and that the converse case, where we have a specific question, is less frequent, except for family trees where the acquisition of details for vital events constitutes such questions. Consequences of this are that the source details (including its assessment as well as its citation) are a secondary consideration, and also that other information from the same source may be ignored. The sharing of research information, as opposed to just conclusions, must take account of this bottom-up approach, but there are currently no comprehensive mechanisms for this level of sharing. GEDCOM only shares conclusions, and those attempts that have been made to share research have become mired in the goal-directed notion. Whilst there may be a lot of variety in how readers approach their own source-based research, the digital representation would try to encapsulate the core elements in a manner that could be built upon to support correlation with other sources and the generation of conclusions.

[1] This is STEMMA terminology for references to subjects in a source, i.e. person references, place references, group references, etc. The distinction between subject references and subject entities (in the digital representation) was proposed as a clean break from the more contentious evidence person and conclusion person on the FHISO mailing list at: The Preferred Vocabulary. This suggestion sank without a trace as it was attached to some unrelated post, but it also needs rethinking as there are three distinct representations rather than just two; prototypes subjects being the third.