Parallax View ®: November 2013

Saturday 30 November 2013

Is That a Fact?

When you record such things as a name, age, occupation, place-of-birth, etc., do you refer to them as ‘facts’ or something else? Are they held as simple text values in your database? Have you thought about the true nature of those data items?

As usual in the digital side of genealogy, we have a plethora of alternative terms for the same thing, and ambiguous interpretations of the more common terms. Genealogists are encouraged to refer to these data items as ‘facts’, although I have already made the point in Evidence and Where to Stick It that their facticity is dependent upon the source from which they came. A number of software developers prefer the term ‘PFACT’, which stands for property, fact, attribute, characteristic, or trait. However, this is squandering five perfectly good words – each with distinct meanings in normal usage – and so reducing the possibility of any of them being given distinct genealogical uses. I will be employing the more generic STEMMA® term of ‘Properties’ in this post.

So, what is a Property? You might say that it is an item of evidence[1] taken from a given source of information. This is a fair description, but as soon as you acknowledge that a Property is “an extracted and summarised item of information” then a number of issues have to be considered and solved for their digital representation. What I’m about to present is my own approach as to-date I’m not aware of any product that tackles all of these issues.

Foremost amongst the issues – and yet rarely discussed in the context of Properties – is the difference between what was written and your interpretation of it. Although this is a fundamental part of supporting evidence and conclusion, or E&C, I need to clarify that, here, this is purely the analysis and interpretation of each item rather than building them into any proof argument; that being a separate phase. For instance, if a place name has been misspelled, or is hard to read, then you need to record it as it was written (indicating any uncertain characters) together with your interpretation of what it should have been. In effect, each Property has two distinct values: the recorded one, including any transcription anomalies, and the interpreted one. As with any form of conclusion-making, you’ll also need a way to add any explanatory notes, and possibly add some level of confidence in your result. I will come back to this duality of Properties in a moment.

All Property values are implicitly associated with a particular time and place. For instance, someone’s name may have changed during their life, and someone’s age will certainly have changed over time. STEMMA copes with this because the Properties are associated with specific Event-to-Person connections[2] in the data, and the Event entity implicitly provides a relevant date for the interpretation and applicability of the value.

Another issue to consider is the nature of the Property. Is it the name of something (e.g. a person or place), a description (e.g. cause of death), a date, or a measure of something (e.g. age, height, weight)? This is termed its data-type. The importance of it lies with the interpreted value (rather than the written value) which should be computer-readable in order to make the most use of it. Whilst I acknowledge that there may be detractors to this statement, let me try and make a number of observations to justify it.

For the simple expedient of consistency checking, software needs to know whether a value should be textual, numeric (integer or real), or a date. More than this, though, a value such as a date can be used in a timeline, and an age can be used to derive dates and to separate events, so their values should be accessible to software. In the case of a person or place reference, these can be linked (using some type of pointer mechanism) to the corresponding Person or Place entity in the data. That linkage, which is as much a conclusion as the interpreted value of any date, is required in order to allow you to follow the reference to the entity’s details. However, the duality of the Property values doesn’t require you to change the name from how it was recorded at that time. Finally, in certain cases, a Property may have a representation that doesn’t correspond to a value in the normal sense, either because the written form was undecipherable or it had a special meaning. For instance, the use of “Full Age” for a young married couple, or “Unknown”, “N/A”, or “LNU” for an unknown name, are special non-values. There’s a golden rule that you do not record anything in a name field that isn’t actually a name[3]. Being able to distinguish the recorded form from an interpreted form avoids this issue.

If a Property is a measure of something, such as a height or weight, then the interpreted value needs to identify the units. In all but one case, it is debatable whether or not software will want to make use of these units themselves as opposed to simply distinguishing values held in different units. That exception involves the age of a person. Ages are normally recorded in years, but ages in months, weeks, or even days, are quite common for infant deaths. These may also be fractional rather than integer values, e.g. “3 ½ weeks”.

Some Properties are necessarily multi-valued. The most obvious case is a Role (i.e. the part a Person plays in an Event). For instance, a witness at a wedding may also have been a relative of either the bride or the groom. A computer representation must accommodate multiple values, and support the duality for each instance.

It would be folly to try and enumerate all possible Properties in advance of them being used. Different researchers, different sources, and different cultures, may all result in unanticipated Properties having to be recorded. What is required, therefore, is a scheme that allows custom Properties to be freely defined without some onerous, centralised registration process, and yet still allows those custom Properties to be loaded by any compliant product. This is certainly possible but it is such a widespread requirement – applying to many types, subtypes, and other sets of named values – that I plan to write about it separately.

If you’re still with me then you’re probably about to say ‘this is way too complicated Tony’. Before you finish preparing your response, though, consider these points:

We cannot assume that a recipient of your data has access to the same online images, and the T&C’s that you’ve checked probably prohibit you from sharing your images. Also, if you’re one of the minority who still visit archives, etc., then the originals may be locked away, and not copiable or online at all. In other words, our transcriptions can be invaluable. Hence, if we take shortcuts with those transcriptions – even for mere Properties – and assume that we know what the author meant without recording things verbatim (or even literatim), or fail to mention crossings-out and other annotation, then we’re diluting that effort and “short changing” some later recipient.
Do we want our genealogy products to simply record what we type in? If so then we might as well just use a word-processor. Providing more detail, and making it machine-readable, means that our products can work with the data to provide such things as analysis and consistency checking.

I’ll close by providing some links to a couple of worked examples in STEMMA for any code-junkies: Transcription Anomalies and Census Roles. Between them, these deal with many of the cases discussed here, including transcription anomalies, spelling errors, clarifications, and mis-recorded information.

[1] If anyone wants to comment that the evidence in any given source is more than a set of discrete values then I entirely agree. There is usually much context and information that cannot be distilled down to simple values. What we’re discussing here is just the digested pieces of information that many genealogists store in their databases, but also acknowledging that this alone is not fully representative.

[2] For historical references to places, the corresponding STEMMA Properties would be associated with Event-to-Place connections.

[3] This issue is covered in excellent detail by Tamura Jones, “FNU LNU MNU UNK”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml : accessed 22 Nov 2013); Also his previous works: “The Lnu Family Mystery”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/TheLnuFamilyMystery.xhtml : accessed 22 Nov 2013); “Unk is a Real Name”, Modern software Experience, 10 Aug 2013 (http://www.tamurajones.net/UnkIsARealName.xhtml : accessed 22 Nov 2013).

Sunday 24 November 2013

Evidence and Where to Stick It

… so to speak. Do you attach census pages to people in your tree as evidence of a birth date? If so then there is a good chance that you are currently attaching items to the wrong entities in your data. Read about the pitfalls that we all face when associating evidence, why we often do this incorrectly, and what the future holds for us.

If you find one of your ancestors in a census page, do you cite that page as evidence of where they lived, or their date of birth, or their place of birth? It may yield such evidence but then what do you attach the census information to? Many people would add a citation in the details of that person, and also attach any image of the census page directly to that person, but this is demonstrably wrong. Although this sort of evidence can be gleaned from a census record, the actual record wasn’t generated as a proof of any of those items. In effect, you would be confusing the relevance of extracted information (e.g. something about a given person) with the nature of the source itself (i.e. what the record was originally intended for). This may be a subtle point but it has profound implications when modelling the real-life data relationships.

This issue is as much about data organisation as about philosophy so let’s just take a moment to look at some simple practical problems resulting from this common approach.

The chances are that the same census page includes other family members and relatives so do you duplicate this operation for every member? You cannot always attach a scan to some type of family record since the members may be more distant or loosely-connected relatives, or they may even be unrelated until some later marriage.

We’re not just talking about census pages either. Consider a marriage certificate. You might be attaching details (citation or scan) to both the bride and the groom, but what about their fathers? Both of these would be mentioned on many certificates so you might be able to glean evidence of their names and occupations too. Do you also transcribe the marriage date and record it separately in the timeline for each of these individuals? If you had initially misread the date because it was so faint then does that mean you have proliferated the error? There’s always the very real risk, too, that you may have picked the wrong marriage and need to undo those associations.

The same issue applies to a birth certificate since it probably contains the parents’ names (whether married or not) and the father’s occupation. You may be surprised but this issue even applies to photographs. For instance, I have a group photograph of my grandparents and their family that was printed in a Nottingham newspaper in the 1950s. Do I attach that image to every one of those people in my data, together with details on where and when it was taken, plus the newspaper citation, plus the newspaper caption that went with it? Your choice of software makes a big difference in how serious an issue this is to you, but there is a better way.

[…come on, Tony, get to the point…]

OK, I think the astute readers have already guessed where I’m taking this, especially since I have set the stage by writing about the importance of events in recent blog posts[1][2]. The thing is that the vast majority of our evidence – if not all of it – relates to events; things that happened in a particular place at a particular time. The people involved in all of the cases illustrated here are sharing certain events (i.e. a census, a marriage, a birth, and a family group outing). The record (or document, or artefact) details are therefore best associated with the Event entity in the data, and the relevant Person entities linked to the Event with their respective roles.

Multi-Person Events were described in more detail in my previous posts, but where does that leave the information extracted from one of these event-orientated data sources? For instance, if source details are now associated with an Event then how do you associate the items of extracted information (i.e. Properties[3]) with each of the Persons sharing that Event?

This figure illustrates a marriage event using similar symbols to those of my previous posts. We can see that bride and groom are both connected to the Event entity, as well as the pairs’ fathers, and they would each be distinguished by their respective Role[4]. The source details (citation, image, etc.) are associated only with the Event, but the Properties – the items of extracted information that are relevant to each of those Persons – are associated with the individual Event-to-Person connections, not specifically the Persons or the Event. This is important since an Event may have more than one Person connected to it, and each Person will have more than one Event connected to it.

This natural factoring of the data results in less duplication and redundancy, but at no loss of information. What we’re avoiding is dumping the same source details on every associated Person simply because that source yields some evidence about them. When several Persons are sharing the same Event, different Properties may be derived for each of them but the source information as a whole describes the Event.

For any code-junkies, an example of how this is represented in STEMMA^® can be found at Single Source Events, and a further example involving multiple sources for the same event at Multi-Source Events.

Of course, not all software can actually do this since it requires support for shared Events. You’re probably thinking, though, ‘what if I have conflicting Properties such as a date-of-birth?’. We all know that we may get conflicting Properties from different sources, but I haven’t changed that by describing this approach. The final set of conclusion Properties that you associate with each Person will be the result of assessing the aggregated evidence for them – the evidence Properties from each of the Events in their timeline. This has always been the case since no one has multiple dates of birth! All I’ve done is re-factor the sources and the evidence.

So what’s the advantage of this? Well, apart from avoiding unnecessary redundancy, the scheme is modelling the true nature of the data relationships. This is what I meant by the issue being partly philosophical above. Perhaps more important, though, is that your data is then organised according to the natural timeline. If you want to present a timeline, either in a report or on your screen, then it doesn’t have to be forced, and it can accommodate the lives of multiple people when necessary.

Unfortunately, the future looks a little bleak for this. There are many people who do not adopt this approach, either because their software cannot handle it or because it’s not the way that they were taught, and their data will never change. Even if their software improves or changes, and even if some new data standard emerges that better models real-life, then their data cannot be re-factored automatically to become better organised. It’s set in stone.

By far the biggest issue, though, is the use of so-called collaborative online family trees. When these models actually accommodate sources, and when their contributors actually enter them, then they have no choice but to associate them directly with Person entities because that’s all they have. They’re not representing event-based history and so their misplaced sources will be inherited by anyone copying from them. This does not bode well for anyone wanting to adopt an event-based approach to family history, or even micro-history. Genealogy’s preoccupation with mere family trees will continually pollute the waters.

[1] See “Eventful Genealogy”, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html).

[2] See “Eventful Genealogy - Part II”, Blogger.com, Parallax View, 6 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy-part-ii.html).

[3] ‘Properties’ is the terminology adopted by STEMMA for items of extracted and summarised information such as a date-of-birth (see http://parallaxview.co/stemma/home/document-structure/person/properties). I feel strongly that the word ‘facts’ is misleading since the possibility of something being factual depends on the nature of the associated source. I also do not like the software term of ‘PFACT’, which stands for property, fact, attribute, characteristic, or trait.

[4] The Roles might be something like Bride, Groom, Bride.Father, and Groom.Father in this case. I will discuss how extensible roles can easily be accommodated in a future post.

Wednesday 20 November 2013

The Future Representation of the Past

I want to explore the way we represent the past in our computerised data, and to question whether this is good enough. What do we really want to represent? Are we constrained by technology or by convention?

Asked who I am, people would probably say ‘Oh yeah, he’s the one that writes those technical blog posts that no one understands’. Some might even associate me with STEMMA®. However, there’s a rationale and a message behind my work so I want to pull it all together for this special post. Hopefully, people will then understand me a little better, even if the technical stuff is still over the rainbow somewhere.

Family trees are a very limited form of data that I’ve sometimes described as ‘genealogy in its literal sense’. They describe the lineage of a number of related people. Although they typically include the dates of vital events, they do not try to create a history from their data. Many people new to genealogy assume that they have to create a family tree because they do not realise that there’s anything more[1]. A huge amount of marketing talks specifically about “family trees”.

Most genealogists realise that they need to capture the history of a family in order to create a picture of their lives, and maybe to understand how they, themselves, came into being. Although genealogy as a discipline is widely considered to incorporate such history, the term family history is sometimes preferred in order to emphasise the nature of that pursuit.

However, restricting ourselves to family history alone is rather artificial. The history of the places they lived in, of the occupations they worked in, of their neighbouring families, and even of world events, will have had an impact on their lives. Also, they will undoubtedly have played a part in some of those events themselves. Although you may not use the term micro-history yourself, this is what that study would be described as[2].

This is all very well and experienced genealogists may be nodding in agreement (hopefully). When a genealogist performs their research, they will try to assimilate all relevant data and write it up according to their professional standards. My short study on Bendigo’s Ring[3] was partially designed as a case in point since it is specifically — and unusually — about a place rather than a person or a family, but the research principles are exactly the same. A traditional research report would have no trouble in representing this case, but what about our software products? That’s a different story.

Unfortunately, our software is mostly preoccupied with the representation of a family tree. Even when it allows you to enter historical notes, the framework used is still that of a family tree. If you want to generate a timeline then this has to be inferred from your tree-based data because it was never entered using a timeline paradigm.

I realised, very early on, that none of the products I’d looked at would be useful to me. I wanted to be able to record micro-history, not just family history, and certainly not just a family tree. It shouldn’t matter whether I wanted to record information about a family, or a person (related or not), or a place, or a surname, or specific events – I wanted to be able to record all such data in a structured way for the computer. In other words, I wanted something much richer than a simple narrative report. As a result, I began an R&D project back in 2011, later to be named STEMMA[4], to define a culturally-neutral computer representation for micro-history (if not generic history) and implement the associated software.

That project is still ongoing but it has already demonstrated that this goal is achievable; it’s not unrealistic at all. Although its data model is still being refined, I intend to use it to representation my own data. This rather puts me out on a limb but I have no choice in the current climate. In the meantime, I have used this blog to try and raise peoples’ awareness of what could be possible if we took a step back. The STEMMA Web site makes freely-available the current specification, my research notes, various downloads, and a number of example case studies for anyone with a coding background.

So do I expect the STEMMA research to affect the software market? I’m rather pessimistic about this. Ideally, an organisation such as FHISO could try and build the underlying concepts into a more powerful, and standardised, data model but then software vendors are unlikely to take up the challenge of moving towards a micro-history approach[5]. There is no demonstrated market for this enhanced scope so why would they take the risk? I feel the situation is possibly chicken-and-egg since there’s no precedent upon which peoples’ expectations can be lifted.

So what would I consider to be the essential elements of a data model that would be more aligned with a representation of micro-history?

Events are an essential element. These must be top-level entities, be shared (i.e. allow multiple people to be associated with each Event), and have sufficient internal structure to be able to model real-life events (i.e. durations and hierarchical arrangement)[6] [7].
Structured Narrative. Rather than plain-text notes, the model must make copious use of rich-text with semantic mark-up (i.e. inline meta-data). This must cater both for new narrative and for transcriptions of evidence. It must allow references to persons, places, events, and dates to be clearly marked, and linked to other data entities when relevant. It must also support citations and general reference notes, and support transcription anomalies such as marginalia, uncertain characters, original emphasis, and interlinear/intralinear notes[8] [9].
Place Support. The model must treat Persons and Places on an equal footing[10], and support a hierarchical organisation of Place entities (not just an issue of how to name them)[11].

If I could only make one final blog post then this would be a close candidate as it defines me as well as my work.

[1] See “OK, I have a Family Tree. Now What?”, Blogger.com, Parallax View, 5 Oct 2013 (http://parallax-viewpoint.blogspot.com/2013/10/ok-i-have-family-tree-now-what.html).

[2] See “Micro-history for Genealogists”, Blogger.com, Parallax View, 30 Oct 2013 (http://parallax-viewpoint.blogspot.com/2013/10/micro-history-for-genealogists.html).

[3] See “Where is Bendigo's Ring?”, Blogger.com, Parallax View, 15 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/where-is-bendigos-ring.html).

[4] STEMMA (Source Text for Event and Ménage MApping) R&D Project, Family History Data (STEMMA).

[5] See “Commercial Realities of Data Standards”, Blogger.com, Parallax View, 26 Aug 2013 (http://parallax-viewpoint.blogspot.com/2013/08/are-we-modelling-data-or-commerce.html).

[6] See “Eventful Genealogy”, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html).

[7] See “Eventful Genealogy - Part II”, Blogger.com, Parallax View, 6 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy-part-ii.html).

[8] See “Semantic Tagging of Historical Data”, Blogger.com, Parallax View, 5 Sep 2013 (http://parallax-viewpoint.blogspot.com/2013/09/semantic-tagging-of-historical-data.html).

[9] Tony Proctor, “A Story of Olde”, Family History Data (StructuredNarrative.pdf).

[10] This issue is discussed at http://parallaxview.co/stemma/research-notes/persons-places#Similarities, but an example treatment of a Place in STEMMA can be found at http://parallaxview.co/stemma/data-model/more-case-studies#CSPlaces.

[11] See “A Place for Everything”, Blogger.com, Parallax View, 19 Aug 2013 (http://parallax-viewpoint.blogspot.com/2013/08/a-place-for-everything.html).