Thursday 10 December 2015

Our Days of Future Passed — Part III

This is the final part of my trilogy on the philosophy behind STEMMA V4.0. Part I covered its application to both arboreal (tree) genealogy and event-based genealogy, while Part II covered narrative genealogy; I now want to expand on its support for source-based genealogy.

Source-based genealogy is both a research orientation and an organising principle where the source is the primary focus. The majority of software, and especially Web sites, are focused on conclusions; users are asked to provide names, dates, and locations without having to indicate where or how their information originated. At best, they might be given the opportunity to retrospectively add some citation or electronic bookmark. When starting with the source, though, all the relevant resources (images, documents, artefacts) can be organised according to the source provenance and structure, a citation can be created as soon as information is acquired, and the information can be assimilated before you decide how and where to use it.

People like Louis Kessler have advocated source-based genealogy for several years, and the term itself has displaced the more naïve notion of evidence-based genealogy. Since evidence is only information that we believe supports or refutes some claim then a focus on that alone would ignore the source of the information, and any context or other information therein. For instance, imagine that you have used certain information from a source to substantiate a particular claim. What happens if you later feel that the same source might help with a different claim, made elsewhere? Do you have to assimilate the contents all over again in order to decide whether that’s true or not? How would you become aware of its possible relevance?

Link Analysis

Let’s think how source-based genealogy might work, conceptually, and especially the assimilation phase. Anyone who remembers studying text books in preparation for an examination may also recall annotating pieces of text: underlining phrases or circling sections that we believed were going to be important, and which we wanted to ensure that we fully grasped. What we were doing is reinforcing our mental model, or mind-map, and creating structure and order from the text.

I’m sure you’ve all seen detective films, or TV series, where someone solves a complex puzzle using notes and images on a pin-board with string connecting the pieces together. This technique really does exist, and it’s called link analysis. It’s a type of graphic organiser used to evaluate relationships (or connections) between nodes of various types, and it has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, and medical diagnosis. Although most online sources present it in terms of software implementations, it is much older, possibly going back before WWII.[1]

Conceptual link diagram
Figure 1 - Conceptual link diagram.

This rather imaginative depiction of the concept illustrates some of the advantages and benefits quite nicely. The information of interest in each source is somehow marked-out, and used to build any of: concepts, prototype subjects[2], conjectures, and logic steps — whatever the researcher wants from them. Because of the freedom in choosing those pieces, this method is as useful to my non-goal-directed Source Mining as it is to goal-directed research. Essentially, one approach is aimed more at collecting material to paint a history or biography of subjects, whereas the other is focused on solving a specific problem or proving one-or-more claims.

When those pieces are connected all the way up to the conclusion entities in your data, then it also provides a trail by which some user could drill-down[3] in order to see how a conclusion was arrived at, what information was used as evidence, and where that information originally came from.

When a marked item is a person reference then it is effectively the same as what’s often termed a persona, and the ability to connect personae from different sources provides a way of supporting multi-tier personae. Although I have previously been critical of the use of personae because of the separation of the person reference from its original source context — including place, date, and relationship context that might be instrumental in establishing the identity behind the reference — this approach retains a connection to the source information, and even allows it to follow the subsequent use of a persona. This is particularly important because it appears to be an inclusive handling of certain contrasting approaches advocated by other data researchers: Tom Wetmore has long believed in personae, and Louis Kessler in source-based genealogy, but those approaches have sometimes appeared to be contradictory and have led to strong differences of opinion.

Source and Matrix Entities

Part I of this series introduced STEMMA’s Source entity as joining together citations and resources (such as transcriptions, images, documents, and artefacts) for a given source of information, but it encompasses much more than this. The semantic mark-up, described in Part II, allows arbitrary items of information to be labelled in a transcription. This includes subject references (person, place, animal, and group), event references, date references, and any arbitrary word or phrase. Those labelled items, called profiles, can be linked together using simple building-blocks in order to add structure, interpretation, or deduction to them. STEMMA doesn’t mandate the use of a visual link chart, or link diagram, since that would be something for software products to implement, but it does include the essential means of representation.

The Source entity normally specifies the place and date associated with the information, but these can also be linked to profiles if they have to be associated with some interpretation, or even some logic. For instance, when information relates to a place whose identity cannot be resolved beyond doubt from the mere place reference.

When a source is comprised of discrete or disjointed parts — such as a book’s pages, or a multi-page census household — or it contains anterior (from a previous time) references — such as a diary, chronological narrative, or recollections during a story — then smaller sets of linked profiles can be grouped within the Source entity using SourceLet elements, and these may have their own place and date context. Each of those discrete parts may have their own separate transcriptions and specific citations, although they’re related to the same parent source by containment — a subject for a future presentation. The network of linked profiles can bring together information and references from these different SourceLets for analysis.

The Source entity is a good tool for assimilating the information from a given source in a general and re-usable way. However, that information may need correlating with similar information from other sources, and this process may need to be repeated for different problems and with different goals. STEMMA accomplishes this with a related Matrix entity[4] that carries those networks of linked profiles outside of their source context and allows them to be worked on together.

Mechanics of a link diagram

Figure 2 - Mechanics of a link diagram.

Notice, as usual, that the building of these networks, and the association of them with corresponding conclusion entities, is independent of the relationship those entities have with their respective hierarchies and events (Part I), and narrative (Part II). In other words, the four main approaches to genealogy that I identified (arboreal, event-based, narrative, and source-based) can be inclusive of each other.

Compare and Contrast

Part I introduced the concept of Properties — items of extracted and summarised information — that could be associated directly with subject entities. The Source entity, which is part of the informational sub-model rather than the conclusional sub-model, can also achieve this but with additional flexibility. Because it is working directly with source information, and not obliged to make any final conclusions, it means that references to incidental people, or otherwise unidentified subjects, can still be assimilated but left in the Source entity for possible future use. The power of this should be obvious where, say, a referenced person later turns out to be a relative or in-law. Another difference is the vocabulary used to describe data and relationships. The Property mechanism uses a normalised computer vocabulary so that information can be consistently categorised as name, occupation, residence, etc., and relationships can be categorised precisely as things like spouse, mother, son, and so on. In the Source entity, though, what you record and what you call it are free choices; if you encounter a relationship provided as grandchild, nephew/niece, or cousin, where the interpretation may not be obvious, then you can keep it as-written and work on it. For the masochists, a comparison of these two mechanisms being applied to the same source may be found at: Census Roles.

It might be said that ignorance of prior work is bad during any type of research and development, but my software history is full of such cases where it has yielded a route to genuine innovation.[5] When I finally decided to look at whether other data models had addressed this approach, I was surprised to find that the old GenTech project from 1998 had documented a similar approach. GenTech and STEMMA had both try to build a network of extracted information and evidence from “source fragments” — and actually used this same term — but the similarities applied mainly to the intention rather than to the implementation.

The GenTech data model V1.1 is hard to read because it has no real examples, it presupposes a concrete database implementation — which I’m not alone in pointing out to be inappropriate in a data model specification — and it talks exclusively about evidence, and analysing evidence, rather than information. The latter point is technically incorrect when assimilating data from source fragments since the identification of evidence, or the points at which information can be considered evidence, is dependent upon the researcher and the process being applied rather than some black-and-white innate distinction.

GenTech’s ASSERTION statement is the core building-block for its network. This is simply a 5-tuple[6] comprising {subject-1 type/id, subject-2 type/id, value} that relates two “subjects”. Those subjects are limited to its: PERSONA, EVENT, GROUP, and CHARACTERISTIC entities — concepts which differ from STEMMA’s use of the same terms — and there are some seemingly arbitrary rules for which can be specified together. This restricted vocabulary means that it does not clearly indicate how its CHARACTERISTICs are associated with a particular time-and-place context (I couldn’t even work out how); it has no orthogonal treatment of other historical subjects (STEMMA terminology), such as place, group, or animal; and it cannot handle source fragments with arbitrary words and phrases. By contrast STEMMA’s profiles can deal with source fragments containing references to persons, places, groups, animals, events, dates, or arbitrary pieces of text. It’s <Link> element is the low-level building-block that connects these together, and to other profiles, but with much more freedom. For instance, the equivalent of a multi-tier persona is achieved by simply connecting two prototype-person profiles. GenTech uses its GROUP entity to achieve this, and effectively overloads that entity for grouping PERSONA and CHARACTERISTICs rather than using it only to model real-world groups.

Some other philosophical differences include the fact that STEMMA profiles represent snapshots of information and knowledge at a particular point in the assimilation process (or correlation process, in the case of Matrix); the actual information effectively flows between those profiles. This is hard to describe in detail, and I may save it for a later post. Another difference is that the profiles allow steps of logic and reasoning to be represented in natural language; the connections are not just a bunch of data links or database record IDs. That text would be essential if some user wanted to drill-down to understand where a claim or figure originated, and STEMMA allows multi-lingual variants to be written in parallel. Reading GenTech’s section 1.4.2 suggests (to me) that its ASSERTION may have more in common with STEMMA’s Property mechanism that with its Source and Matrix entities.

An interesting corollary is that conclusions are easily represented in software data models, and they will usually employ precise taxonomies/ontologies to characterise data (such as a date-of-birth, or a biological father), or equivalent structures (such as a tree). In effect, these conclusions are designed to be read by software in order to populate a database or to graphically depict, say, biological lineage. Source information, on the other hand, cannot be categorised to that extent — it was originally intended to be humanly-readable, it must assimilated and correlated by a human, and all analysis must be understood later by other humans.

There have been a number of attempts to represent the logical analysis of source information using wholly computerised elements (see FHISO papers received and Research Process, Evidence & GPS) but these are far removed from employing real text. As a result, they lose that possibility of drilling-down from a conclusion to get back to a written human analysis, and to the underlying information fragments in the sources. While allowing analytic notes to be added directly to data items might be one simplistic provision, connecting notes and concepts together to build structure must have written human explanation, not “logic gates” and other such notions. One reason for these overtly computerised approaches could be that software designers feel an onus on them to support “proof” in the mathematical sense rather than in the genealogical sense; a result of possibly misunderstanding the terminology (see Proof of the Pudding).

Concluding Remarks

So what have I accomplished in this trilogy? I have given insights into a published data model for micro-history that has orthogonal treatment of subjects and inclusive handling of hierarchies, events, narrative, and sources. Did I expect more? Well, I did hope for more, when I first started the project, but it was not to be. There are people in the industry that I have failed to engage, for whatever reason, and that means that the model will probably finish as another curiosity on the internet shelf, along with efforts such as GenTech and DeadEnds. Complacency and blind acceptance have left genealogy in a deep rut. In the distant future — when technophobe is simply a word in the dictionary — people of all ages will look back at this decade and laugh at what our industry achieved with its software.  When paper-based genealogy (using real records) has probably gone down the same chute as paper-based libraries, and we're left with software genealogy and online records, then we'll wish that we had built bigger bridges between those worlds. If our attitudes and perceptions don’t rise above the horizon then we'll never see the setting sun until it’s too late, and we might as well say: RIP Genealogy!

[1] Nancy Roberts & Sean F. Everton, "Strategies for Combating Dark Networks", Journal of Social Structure (JoSS), volume 12 ( : accessed 24 Oct 2015), under “Introduction”; publication date uncertain but its References section suggests 2011.
[2] In STEMMA terms: persons, places, groups, or animals.
[3] A BI process where selecting a summarised datum or a hierarchical field usually with a click in a GUI tool revealed the underlying data from which it was derived. The term was used routinely during the early 1990s when a new breed of data-driven OLAP product began to emerge.
[4] "An environment or material in which something develops; a surrounding medium or structure", from Oxford Dictionaries Online ( : accessed 28 Oct 2015), s.v. “matrix”, alternative 1.
[5] Modern software development is less about ground-up development and more about creating solutions by bolting together off-the-shelf (e.g. open-source or proprietary) components. Both have merit but the pendulum is currently in the other quadrant.
[6] An n-tuple (abbreviated to ”tuple”) is a generalisation of double, triple, etc., that describes any finite ordered list. The term was popularised in mathematics, and more recently in OLAP technologies such as Holos and later Microsoft OLAP.

No comments:

Post a Comment