Thursday, 26 November 2015

Our Days of Future Passed — Part I

In this three-part article, I want to summarise the current state of the STEMMA® research project. Changes on the Web site have been deliberately infrequent of late to enable me to find the time to finish it, but that closure is now in sight. This data-model specification was designed to represent our “days of future passed”,[1] or the future laying-down of our daily events — using less surreal syntax.

The changes in the latest version of the specification were recently summarised at STEMMA V4.0, but I want to take this extra time to put those changes, and the overall philosophy of STEMMA, into a perspective that can be recognised by genealogists and software designers alike.

The original goals of STEMMA were twofold: to develop a data model that could represent what I was already doing in genealogy, and to investigate innovations that were not constrained by legacy products or models. Too often, research for the future is limited by the legacies of the past, but I found that I was in an ideal situation to try new solutions, and to look at software genealogy from a different perspective.

The development of the specification, and of associated software, has involved several iterations as one would expect from cycles of experimentation. However, as I became a better genealogist then my requirements also changed, and that meant unpicking some parts in order to re-knit them differently. Early incarnations were primarily conclusion-based, but as I tried to link conclusions (including so-called “facts”) to their supporting evidence, and ultimately to the information in the underlying sources, then I realised that this was a huge field that needed considerable thought — not just a bunch of citations and some data links.

I tried hard to accommodate the main approaches to genealogy in the data model (e.g. family trees), in additional to my own much-broader scope, and to some of the more prevalent software schemes — albeit with much generalisation. I did wonder, several times, whether it was indeed possible to have a single model encompassing all of:

  • Family Trees and pedigrees.
  • Event-based genealogy, where we look at the events in the lives of the persons, or other subjects.
  • One-name and one-place studies.
  • Handling of non-family and non-familial relationships.
  • Generalised micro-history, including additional subjects such as places, animals, and groups.
  • Looking at places as another hierarchical type of subject rather than simply a name, and using a similar approach for groups.
  • Clear separation of conclusions from evidence, and from source material.
  • Personae, including multi-tier ones.
  • Source-based genealogy, where conclusions are built from source information, rather than simply tacking citations onto conclusions.
  • My bottom-up non-goal-directed approach to assimilating sources, described as Source mining.
  • Integration of stories, research, proof arguments and other forms of narrative.
  • Representation of diplomatic transcriptions.
  • International applicability.
  • Extensibility of type systems using namespaces.
  • Generalised approach to sources & citations that accommodates layers, analytical notes, and even attribution.

You may think that such an ambitious set of goals would yield a hugely complex Frankenstein’s monster of a model, but the more I worked on it, the more things would slot into place. At a certain point, a design — any design — reaches a level of order and elegance that compares favourably with its functionality and capabilities, and I believe it’s about there.

Previous attempts to describe STEMMA haven’t gained much traction and this is partly due to the prevailing notion that genealogical software products merely maintain a database of discrete data items. For instance, QuickLesson 20: Research Reports for Research Success, on the Evidence Explained site, relegates software considerations (other than using a word-processor) to the final step: “Step 4: Data entry?” on the basis that you will want to “… cherry-pick individual bits of data and record them in a spread sheet or other data-management software”. If you’re going to read on then you will need to exorcise all such notions and familiarities for this will be fundamentally different!

One of the foundational elements of the STEMMA design is that there are multiple independent sets of linkages within the model. What this means is that the various entities, such as Persons, Places, Events, etc., are linked in multiple ways, each according to some real-world rationale, and these cooperate to deliver a very rich structure. For instance, the lineage of a Person is a set of hierarchical linkages that is independent of any association with Events, and that means that the same model can be applied to a tree-based arboreal genealogy or an event-based history, or a combination of these. Also, the endless ways in which these linkages can be visualised is not prescribed by the data model; that’s the prerogative of the software product.

This concept was eventually used to provide another set of linkages that connected conclusions to evidence, to information, and to sources. All the right concepts were there in the earlier incarnations, but it wasn’t until v4.0 that they were connected properly.[2]

On the surface of it, the direction in which the specification has proceeded has widened the scope of a data model far beyond what many genealogists and software vendors have considered, or would like to have considered. Indeed, it was pointed out to me during discussions within FHISO that I have the luxury of not having to worry about backwards compatibility. This is partly why I now wish to illustrate how this one data model can be applied to each of the main genealogical approaches, and implicitly to suggest that these approaches do not have to be exclusive of each other; we need to avoid the little-endian versus big-endian[3] arguments and see that they all have merit.

STEMMA has two notional sub-models: conclusional[4] and informational, and the following sections will make reference to them.

Arboreal Genealogy

Arboreal (tree) genealogy is characterised by a focus on biological lineage. This is often mirrored by an underpinning database schema designed specifically to support a tree-based view of lineage, or of pedigree.

What the diagram below illustrates is that each Person entity in a STEMMA tree can be associated with multiple Source entities, each describing a specific source of information, and encapsulating the relevant resources (such as images, documents, and artefacts) and citations.

A lineage hierarchy for Persons or Animals

Each of those sources can yield Properties — items of extracted and summarised information — for the corresponding Person. For instance, if a person was mentioned in multiple census sources then each of them might yield a different residential address, differing ages (of course), but even conflicting places of birth. Properties are one of two mechanisms for associating information with a subject entity. The other (via the Source entity) is part of the informational sub-model but Properties are part of the conclusional sub-model. That is because they represent normalised information, and any relationship or other subject identification involves a direct connection to a conclusion subject entity, such as another Person. Each Property basically consists of a name and one-or-more values (see Is That a Fact?), and may be used to represent simple information, such as a name or age, or a relationship to another subject, such as a Person or Place. Although each Property can also retain a copy of the associated source fragment, indicating how the information was originally expressed, the overall mechanism is primarily designed for database-orientated products. They are useful for presenting a synopsis of that subject, but they cannot be used for detailed analysis or correlation.

As shown here, they are an ideal mechanism for arboreal approaches where information is directly associated with the relevant Persons. Although this linkage was designed to represent static Properties (ones that do not change over time, such as a blood group), it could be used to represent dynamic ones, such as a marriage date — but more on that later.

STEMMA V4.0 introduced the Animal entity as another subject type, in addition to the existing Person, Place, and Group. Some might ask ‘why animals’ but they are important to a great many people’s history. If anyone ever writes about me in the (far-off) future, and fails to mention my dogs, then I would haunt their hard-drive. Interestingly, it wasn’t difficult to generalise the software support for Person entities to include Animal entities; they both have biological lineage, and STEMMA’s name support already coped with their differences.

The astute reader may be asking where marriages fit into this arboreal scheme. It’s true that I mentioned handling a marriage date as a static Property, but ideally they would be handled as Events (next section), along with every other thing that happened in their lives. Not making it a fundamental part of a tree actually made the inclusion of Animals easier since it emphasised that marriage is not a prerequisite for lineage — trying to blend the concepts together will fail, and quickly so!

One subtle but important point to note here: there is no “STEMMA tree”, per se; a tree is just a way of visualising the hierarchical linkages associated with lineage. All Person entities may or may-not be linked in such a way, and that implicitly means that a STEMMA Document (i.e. a file) can describe multiple independent trees.

Just as Persons and Animals share many characteristics, and especially their lineage-based hierarchies, so too do Groups and Places; they both have a type of hierarchy that is time-based. With lineage, every subject has just two parent subjects — one male and one female — but with organisational hierarchies, each subject has just one parent that may change over time.

The following diagram illustrates how a place hierarchy has a very similar relationship to sources and Properties in STEMMA.
An organisational hierarchy for Places or Groups

Event-Based Genealogy

Events are something that happened in a given place on a particular date, or range of dates. Event-based genealogy gives a more dynamic representation of information related to Persons, or other subjects, and so is more applicable to family history than to genealogy in its limited literal sense (i.e. lineage).

Organising information both by geography and by time is an essential step in the representation of history. The following diagram illustrates a single Event that is supported by two sources. As above, the associated Source entities can embrace multiple resources and citations. The diagram shows that these sources may make reference to subjects of each of the types supported by STEMMA: persons, animals, places, and groups; but they can now yield dynamic Properties rather than the static ones mentioned above. That is, each of the Property values can be traced to a particular time and place via the Event entity and its supporting sources.

Event linkages to the relevant subject entities

The Event entity is still part of the conclusional sub-model, even though the Source entities encapsulate details of the supporting sources. For instance, two mentions of a marriage date or place, say from a certificate and from a newspaper announcement, may differ slightly, and yet the Event entity would represent the conclusions about the true details.

Note, too, that each of the subject types in the above diagram indicate that they are still part of their respective hierarchies. In other words, the Event linkages are independent of the hierarchical linkages of those subjects.

Unusually, STEMMA Events are also hierarchical. This means that a complex event — one with structure that can be broken into separate phases or layers — can be represented as a whole. A simple example of this involves a voyage event whose embarkation and disembarkation occurred at different times and places, and which can be represent as child events.

Narrative Genealogy

Narrative genealogy involves the use of humanly-generated natural language to describe the persons, and other subjects, in our history, as well as all the events that touched them. In common with a number of other people, I strongly believe that software cannot generate anything resembling readable narrative, and that advocates demonstrate more misplaced pride than real-life use-cases.

Narrative can be used for essays, notes, reports, and many other purposes, but STEMMA also includes transcriptions. This includes their presentational aspects such as paragraphing and line-counting, original emphasis such as underlining or italics, corrections and other annotation, and marginalia or footnotes. Since transcribed extracts will often appear in essays or reports then narrative and transcription are both supported as a single feature.

I will continue discussing this genealogical approach in Part II of this series.

Source-Based Genealogy

Source-based genealogy involves a focus on the source, and the assimilation of the information therein. For instance, working with a simple birth certificate might yield the names of the child and parents, mother’s maiden name, father’s occupation, birth sex, the date and place of birth, name and residence of the informant, and the date of registration. Beginning from the source means that we can organise our copies of the information (usually images and transcripts), create a source citation, and have all of that information available before we start any detailed analysis.

Conversely, and with online genealogy especially, the norm is to cherry-pick selected names that have been extracted and entered into some index for the user’s benefit. This divorces those names from any context associated with the source, and so is insufficient for a detailed analysis. Unfortunately, the underlying source is too-often ignored leaving users working with only partial information. It also means that citations are generally an afterthought.

I will continue discussing this genealogical approach in Part III of this series.

Software Design

I want to round-off the first part of this series of blog-posts by making some observations about genealogical software.

There are two broad approaches to any software design: the first involves designing the code to provide specific product functionality, usually as dictated by some product manager. The second involves taking a step-back and designing for the bigger picture. This usually involves a software architect and results in a more adaptable design with greater potential for evolution. A case where the former has happened in genealogy is where products were designed to support trees, and hence the biological lineage of persons. Notwithstanding that lineage is not a true tree, those designs then found it hard to represent history, evidence & sources, geography, reports & essays, or anything other than persons (see The Lineage Trap).

A User Interface (UI) is a crucial part of a software product, not just because it can make a product easier or harder to use, but because a well-designed UI can give a sense of the physicality of the data being manipulated. When the computer world introduced Graphical User Interfaces (GUI) then it became possible to depict things using pictures rather than text, but also to give graphical control to the end-user. That meant the ability to do such things as drag-and-drop or manipulate parts of a picture. A simple example might be to indicate a data relationship by drawing a line between two entities on the screen, as opposed to filling in a textual field. Unfortunately, genealogical products tend to use a lot of form-fill, and present a bunch of boxes rather than a tangible UI. Part of the reason for this may be that such UIs are harder to create for the world of the Web, and harder to use on hand-held devices. A consequence, though, is that those products largely solicit conclusions. When asked to provide details of a spouse, say, an end-user is typically invited to provide name, date-of-birth, etc., without having to say where the information came from. At best, the user can tack on some citation, or electronic bookmark.

Although STEMMA was initially conceived as supporting import/export or long-term storage of data, that quickly became a secondary feature. A result of its deep level of representation meant that no database-orientated product could index it adequately to achieve its full potential. However, indexing it into memory, on-the-fly, meant that (a) full and efficient indexing was possible, (b) that no import/export was necessary as the definitive source format could be exchanged, and (c) that no special consideration was needed for long-term storage or backup of database content. The article Do Genealogists Really Need a Database? explained how reliance on a conventional database is folly, and that it introduces performance degradation, risk of corruption, incompatibility between different database vendors or proprietary schemas, and forces the need to invent other representations for import/export, etc.

[1] No, not the X-Men film title, which uses “past” rather than “passed”; I am from a different generation. The title borrows heavily from the 1967 concept album called Days of Future Passed by the English rock group: The Moody Blues, of whom I was, and still am, a huge fan.
[2] In the words of British comedians Morecambe & Wise, I wasn’t “playing all the wrong notes", I was “playing all the right notes but not necessarily in the right order”.
[3] This terminology comes from the satirical novel Gulliver's Travels by Irish writer and clergyman Jonathan Swift, in which two religious sects of Lilliputians are divided between those who crack open their soft-boiled eggs from the little end, and those who crack from the big end.
[4] The word conclusional is not in most English dictionaries. The usage here ("of or pertaining to a conclusion") may be found in: Bryan A. Garner, Garner on Language and Writing: Selected Essays and Speeches (American Bar Association, 2009), p.330, where it compares the use of: conclusory, conclusional, and conclusionary.

Sunday, 22 November 2015


A little later than I had expected, but I have now completed the changes necessary for STEMMA V4.0. This specification is now published on the STEMMA Web site and is anticipated to be the last major revision necessary for this micro-history data model (small refinements continuing).

The main focus of this change have been the correct separation of conclusion from information and evidence, and allowing them to support drill-down (inspecting a conclusion to see the associated how and why), and to support the alternative bottom-up approach of Source Mining. Although this has been a goal from the earliest work on this project, the associated research and experimentation hasn’t always taken the correct path — but then that’s the nature of research, and the model is better for it.

Much of the text on the Web site has been revised, often with significant re-wording, and similarly with some of my older blog-posts. Although this particular subject sits between two different worlds (genealogy and software), each with their own vocabulary that may clash or cause ambiguity, I also admit that some of my older word choices were the result of genealogical inexperience.

Changes to the data model include:

  • Introduction of a new Source entity that embraces both Citations and Resources for a particular information source. Citations and Resource entities are now connected to Source entity rather than to each other.
  • Support for source assimilation & analysis, source mining, and the ability to drill-down on conclusions, all provided via the Source entity.
  • The <References> element, within Events, is now superseded by <SourceLnk> which links to the new Source entity. Enclosed *Ref elements (e.g. <PersonRef>) changed to *Lnk elements for consistency. Removal of the ID attribute introduced in V3.0.
  • Support for cross-source analysis and correlation via a new Matrix entity.
  • Support for a generalised approach to multi-tier personae.
  • Additional of Animal entity, strongly modelled on Person entity, including related mark-up and namespaces.
  • <CitationLnk>/<ResourceLnk> from Person, Place, Group, and Event entities, changed to <SourceLnk>.
  • Reviewed the goal of sticking to XHTML tags for presentation, replacement of the <Hi> element with HTML-like ones, and the addition of support for <sup>/<sub> elements, columnar text, simple tables, and indentation.
  • Removal of ‘Unreadable’ mode from the <Anom> element.
  • Support for distinguishing manuscript and typescript transcriptions in the <Text> element. Support for numbering lines and pages in transcriptions. Positional control over annotations such as marginalia.
  • <FromText> element added to <Narrative> in order to share re-usable sections of text. This has meant that the NoteKey attribute, in the semantic mark-up, was no longer required and so was deleted.
  • Categorisation of the layers in a Citation chain.
  • The optional <DisplayFormat> element of the Citation entity has been re-interpreted as a set of pre-formatted language-specific strings. This may exist in addition to the mandatory set of named parameter values, and the two together can also be used as a simple citation-template.
  • The Intrinsic Functions, mentioned at the end of Semantic Mark-up, have been changed to Intrinsic Methods in preparation for defining a run-time object model. The set is also supplemented by ones for accessing subject-entity names.
  • Small changes to subject-entity *-name-mode vocabularies to factor-out a generic name-mode (missing from previous specification).
  • Place coordinates (including bounding shapes) are now time-dependent, the same as any parent-Place link.
  • Added Canton and Colony to place-type vocabulary. The place-type of House is now replaced by Number and Apartment for flexibility.
  • <Quality>, <Reliability>, and <Credibility> elements moved from the Citation entity to the new Source entity.

Although small refinements will continue, I want to concentrate subsequent efforts on describing advantages and philosophy of the data model, and in providing more worked examples.

There will be a series of blog-posts following this one that will provide a high-level introduction in order to set the scene.

Monday, 26 October 2015

Summarised Blogger Tips

I have posted the following list before but it has become out-of-date and lost in that morass that is Google+ gems and past-posts. It consists of a summary (oldest-first) of tips that I have previously posted for using Blogger, and for creating more sophisticated blog-posts. I will update this list if I find any more tricks and tips that I think others may find useful.

The uses of blogs are manyfold. Sometimes your posts may be simple "newsy" paragraphs, but sometimes you will want carefully-crafted posts that will be relevant and readable years from now -- especially if you're using a blog for family-history write-ups. This is how many of these tips came about and we bloggers need to raise the bar for such work.

Using Microsoft Word with Blogger (

Using Feedburner with Blogger (

Using Google Maps with Blogger (

Using Footnotes with Blogger (

Using Bookmarks with Blogger (

Putting a tiny blog icon on your browser tab (

Using Documents or Script with Blogger (

Saturday, 10 October 2015

How Not To Design a Database Search

In this article, I want to examine the user interface (UI) to a recently launched database, and to analyse just how much thought really went into providing it. Is this an area where database providers can make an important contribution, or is a simple set of search fields and some SQL tables enough for our needs?

Design something useful
Figure 1 - Design something useful.[1]

The host of this database is Findmypast — again — but my goal is not to berate them; I want to dissect this clinically, and objectively, to see how a bit of forethought could make the difference between a tick-in-the-box and a genuinely useful genealogical resource.

The resource is the “England & Wales, Electoral Registers 1832-1932”, and I’ve been waiting a long time for this to come online as there is a wealth of information in the associated records. The data is made available in conjunction with the British Library, and I am hoping that the digitisation of these records will not stop at 1932 as they would become more and more useful as they approach the present day. I am assuming that privacy will not be an obstacle as Findmypast already host “UK Electoral Registers 2002-2014”, and Ancestry host “London, England, Electoral Registers, 1832-1965” along with some incomplete regional variants.

Electoral Registers are annual lists of people who were eligible to vote and these usually included their residential address, although the right to vote was primarily linked to property ownership until the Representation of the People Act 1918. The way that regions were divided up for voting purposes in Britain was, and still is, a little complicated, but the page at Electoral Divisions of the UK may be of some help.

The eligibility to vote varied greatly between the Boroughs until the Great Reform Act of 1832. As well as streamlining the criteria, this also led to a greater number of men being able to vote, but it was still the case that only one million of the seven million adult males in England and Wales could vote. This was doubled by the Second Reform Act of 1867, but even further reform in the Third Reform Act of 1884 still left 1 in 3 adult males, and all females, without the vote in England and Wales.

Although women could vote in local elections as of 1869, they wanted equal eligibility to vote in Parliamentary elections. Several Suffragette and Suffragist groups were established throughout the country to campaign and lobby the government for equal eligibility, and these groups were eventually brought together under the name of the National Union of Women’s Suffrage Societies (NUWSS) in 1897. The 1918 reforms, where the property qualifications for all men over the age of 21 were abolished, were strongly influenced by the effects of WWI, but it wasn't until the Equal Franchise Act of 1928 that men and women over the age of 21 could vote equally.

So did the online information meet my expectations? When the paralysis from my stunned amazement had subsided then I did find some useful details, but it was hard work! The fields in the search form comprised:

Polling district or place
Additional keywords

OK, not all of these fields are going to be useful for the majority of researchers, but the form did include the primary ones. When the search results were displayed, though, only the following information was presented:

Polling district or place
Image number

Where is the voter’s name, or their address, you might reasonably ask — and I certainly did ask. The country is a waste of space given that you have the county, and I have no idea what use the image number is. The constituency is of dubious use in the search results but it is also very long. For instance: “P[arliamentary] C[ounty] of Nottinghamshire, Bassetlaw Division” was wrapped over 5 separate lines for each row of the associated search results.

So are the personal names important? Surely, you know what names you entered into the search form. No — the given name and surname are individually optional, so you may be looking at a group of related names. For instance, you may have entered just the surname and place to find members of the same family. The normal Variants option on the name fields is documented as not working (despite being present on the form), but wildcards are allowed. Hence, you need to see the names as they are recorded in the associated Electoral Register pages.

The root of the problem is that the data is available in the form of discrete PDF files, and although this isn’t a problem by itself, there is no database search being performed; your search criteria are used to perform a direct textual search of those PDF files, and that is less than satisfactory.

The problem with a file search is that there is no context; it is searching for words anywhere in the file, just as a newspaper search works. I wanted to find people in Nottingham with the surname Kirk but several of the hits were for Kirk Street. The help claims that the search results are ordered by the proximity of the words, but that does not guarantee that the given name and surname are on the same line; if you were looking for a John Smith then you might be presented with a file that happened to have “John” at the top of the page and “Smith” at the bottom.

Ideally, the text from those PDF files should be extracted, parsed, and then keyed according to the actual names. This isn’t rocket science, but it can be a little messy. The problem is that there isn’t a single layout: sometimes the residential address is on the same line as the personal name, and sometimes it’s on a sub-heading of its own; sometimes there are other fields associated with the name; sometimes the data is in multiple columns, and sometimes just one. I have noticed before that Ancestry’s attempt to identify names in Electoral Registers and the British Phone Books have often led to the misidentification of the correct address. In the meantime, the only recourse is to examine the image in every single case of the search results, but that’s actually impractical at the moment.

Clicking the ‘View document’ icon to the right of one search result gave more details information about the PDF file as a whole — not of the information alleged to have been found by the search — and some of the details shown in the initial search results might have been better if moved here instead. Clicking a further button downloaded the PDF file, and (in Firefox on Windows) I had to then select it from the Downloads area (3rd click). Unfortunately, the files do not have a file extension (there’s no excuse for that!) and so the browser didn’t know what type of file it was. A dialog therefore asked whether I wanted to open this “unknown file type” (4th click) and I was presented with a list of possibilities. Selecting ‘Adobe Reader’ (5th click) and clicking “OK” (6th click) caused a further dialog again asking whether I wanted to open this “unknown file type” (7th click).

I can’t heap the blame for this on Findmypast since there was obviously some collaboration with the British Library, but what happened to the project plan? The requirements to make this project useful must have been laid out before they started. The project was advertised as far back as 2011 so the issues of parsing the list text must have been considered, and solved. That problem is quite common in manipulating legacy business data, and there are tools that can help. These tools can be programmed to handle a particular layout, or data pattern, and there will only be a finite number of possibilities in the registers. Maybe this is the plan for the future. Maybe the project overran by an enormous amount and they had to make something available quickly. I have not seen anything written on these hopefully short-term limitations.

I don’t want to make a habit of these software reviews because if this UI ever gets fixed then my post will be redundant, and I would rather that they all stay as relevant as I can make them. Ironically, I was Findmypast’s biggest fan in their early days, and much favoured them over the other database providers.

[1] Image courtesy of Alan Chapman/Businessballs ( : accessed 8 Oct 2015).
[2] This asks for “First Name” and “Last Name”. Although “Given Name” and “Surname” would be more appropriate, I’m going to resist discussing that here. Suffice to say that their own help text mentions “Surname” rather than “Last Name”.