Thursday, 5 September 2013

Semantic Tagging of Historical Data

You may have heard about semantic tagging of data, especially in connection with something called the Semantic Web. Example schemes include:, RDF,,, and microdata. Have you thought deeply, though, about what is being tagged and how?

Semantic tagging is the adding of meta-data, i.e. data about data, in order to attach some meaning to words and phrases in digitised text. This, in turn, is necessary so that computers have a clue what they’re processing in that text. A typical example might be when searching through a text document. Without any of this meta-data, the document is just a bunch of words. The computer search doesn’t know whether any of those words refer to a person, place, date, event, or anything that has its own name. If you’re looking for someone with the surname ‘Butcher’ then the computer would otherwise have no indication of whether it has found a reference to a person or a profession. Even capitalisation doesn’t help because it may be at the start of a sentence, or the reference being searched for does may not constitute a proper noun.

Dates are even more problematic. We all know that there are as many ways of writing a date as there are dates that can be written. If you want to search for a specific date, or even an approximate date, it is virtually impossible with a simple textual search. There are people with software backgrounds who believe that you can write a general parser that can recognise all the most common date formats. This may be true in isolation but when that date is embedded within narrative text then it’s rather optimistic. The impracticality of it becomes even more obvious when you consider the extra work you’d be expecting all searches to perform, and the fact that a date may not be written in English, or it may not even be a reference to the Gregorian calendar. There are many other calendar systems used for recording the passing of days – some ancient but some used to this day in other parts of the world.

Although more commonly associated with the content of Web pages, tagging can also have applications in archives and digital libraries. An obvious example might be a newspaper archive, although in practice these are mostly digitised as plain text. The problem is that OCR (Optical Character Recognition) can only generate plain text from a scan of the print – it cannot attach any of those semantics for you. That work is going to be a labour-intensive extra step that would dramatically increase the cost.

So, in principle, this all makes perfect sense. You should be able to see how it would greatly increase the accessibility of such text, and the power of our searches. Ah, but what happens when the text isn’t clearly readable, or the interpretation is ambiguous? An example of the first of these situations is where a sequence of one-or-more characters is uncertain. This is obviously important to your search. If you’re looking for the surname ‘Jesson’ but the printed word might have been ‘Jesson’ or ‘Jessen’ then you’d still want to see it. The second of the situations is slightly different though. The transcription may be accurate but the printed word may be ambiguous, or may have a spelling mistake. An occupation of “charrer”, for instance, might be someone who burns wood in the making of barrels, or it could be a misspelled version of “charer”, as in charwoman. Which did the original author mean? There’s a difference here in that the interpretation is becoming more subjective.

There are schemes that define mark-up (i.e. inline meta-data) to identify both of these situations. The FreeBMD project uses an Uncertain Character Format (UCF) notation of its own devising in order to identify unclear text: UCF Notation. The schemes TEI (Text Encoding Initiative) and both define mark-up for uncertain characters, uncertain interpretations, and other transcription anomalies.

There’s a hidden risk with this goal when it comes to user content on the Web, especially for genealogical and family history data; a risk that seems to have gone largely unnoticed. It is one thing to indicate a reference to a person or a place, but it is entirely another to indicate which person or which place. We describe this differentiation as “evidence and conclusion”, or E&C, and genealogists (if not historians in general) are very fussy about it.

Any rightly so! A 19th century newspaper reference that simply read ‘William Elliott was fined 5s and costs for drunken behaviour’ obviously references a person, and could be used as evidence, but an identification with an actual person constitutes a conclusion. There are good reasons why a link from the reference to the details of an actual person would be useful, including the ability to search on their alternative names and other details, but the evidence and the conclusions must be clearly distinct. Different people may reach different conclusions when identifying a person reference, or a place reference. The original reference may be less precise, as in ‘Mr. Elliott’, or even personalised as in ‘Papa’ or ‘My grandmother’. These are still person references but the identification is subjective and must use context from outside that piece of text.

A similar case can easily be made for dates too. A reference to ‘Last Friday’ or ‘Next Christmas’ are references to dates but their interpretation requires context from elsewhere, such as a newspaper publication date or a letterhead. In the situation with dates, the printed or written form is the evidence whilst the computer-readable version (e.g. in ISO 8601 format) is the conclusion. In principle, the interpretation of even a fairly clear date reference such as ’10 March 1923’ is still a conclusion. A less obvious instance concerns non-Gregorian calendars. Not all calendars have an agreed synchronisation throughout their range, and that means algorithmic conversions to the Gregorian calendar may be uncertain. Although a reference to an historical date in one of the older Hindu calendars may have been precise when written, the conversion to a Gregorian date may involve the addition of an error range (an upper and lower possibility) which then makes the conclusion very different to the evidence.

Unfortunately, online family trees are virtually all conclusion with no supporting evidence. As someone who takes them all with a pinch of salt, it concerns me deeply that semantic tagging may give them the same weight as the non-associative references in a newspaper archive.

When the STEMMA® Data Model was first conceived, one of its primary goals was of comprehensive support for narrative content, both for transcribed evidence and for written conclusions and rationale. Its support for transcription anomalies may be found at Recording Evidence. When it came to the support for semantic mark-up, though, it set out to clearly separate objective evidence from subjective conclusion. It introduced the terms shallow semantics and deep semantics to refer, respectively, to the nature of a datum (whether it’s a reference to a person, place, event, or date) and the association with a conclusion (which person, place, event, or date). A brief summary can be found at: Semantic Mark-up.

The relevance of this to online family trees is fairly clear but comprehensive support for narrative text is currently outside of the capabilities of most genealogy software products. This is unfortunate and another of the driving forces behind the STEMMA R&D project. The net effect is that it’s difficult to produce a convincing argument about the categorisation of mark-up, and the use of mark-up in general, that people can relate to. In order to illustrate this better to genealogists, an end-to-end example was published at: Structured Narrative. This uses a transcribed 19th century letter from a child to her father and takes the reader from a mere scan to a marked-up transcription, and finally to an example depiction on the computer screen.