You may have heard about semantic tagging of data,
especially in connection with something called the Semantic Web. Example
schemes include: schema.org, RDF, historical-data.org, microformats.org, and microdata. Have
you thought deeply, though, about what is being tagged and how?
Semantic tagging is the adding of meta-data, i.e. data about
data, in order to attach some meaning to words and phrases in digitised text.
This, in turn, is necessary so that computers have a clue what they’re
processing in that text. A typical example might be when searching through a
text document. Without any of this meta-data, the document is just a bunch of
words. The computer search doesn’t know whether any of those words refer to a
person, place, date, event, or anything that has its own name. If you’re
looking for someone with the surname ‘Butcher’ then the computer would
otherwise have no indication of whether it has found a reference to a person or
a profession. Even capitalisation doesn’t help because it may be at the start
of a sentence, or the reference being searched for does may not constitute a
proper noun.
Dates are even more problematic. We all know that there are
as many ways of writing a date as there are dates that can be written. If you
want to search for a specific date, or even an approximate date, it is
virtually impossible with a simple textual search. There are people with
software backgrounds who believe that you can write a general parser that can
recognise all the most common date formats. This may be true in isolation but
when that date is embedded within narrative text then it’s rather optimistic.
The impracticality of it becomes even more obvious when you consider the extra
work you’d be expecting all searches to perform, and the fact that a date may
not be written in English, or it may not even be a reference to the Gregorian
calendar. There are many other calendar systems
used for recording the passing of days – some ancient but some used to this day
in other parts of the world.
Although more commonly associated with the content of Web
pages, tagging can also have applications in archives and digital libraries. An
obvious example might be a newspaper archive, although in practice these are
mostly digitised as plain text. The problem is that OCR
(Optical Character Recognition) can only generate plain text from a scan of the
print – it cannot attach any of those semantics for you. That work is going to
be a labour-intensive extra step that would dramatically increase the cost.
So, in principle, this all makes perfect sense. You should
be able to see how it would greatly increase the accessibility of such text,
and the power of our searches. Ah, but what happens when the text isn’t clearly
readable, or the interpretation is ambiguous? An example of the first of these
situations is where a sequence of one-or-more characters is uncertain. This is
obviously important to your search. If you’re looking for the surname ‘Jesson’
but the printed word might have been ‘Jesson’ or ‘Jessen’ then you’d still want
to see it. The second of the situations is slightly different though. The
transcription may be accurate but the printed word may be ambiguous, or may
have a spelling mistake. An occupation of “charrer”, for instance, might be
someone who burns wood in the making of barrels, or it could be a misspelled
version of “charer”, as in charwoman. Which did the original author mean?
There’s a difference here in that the interpretation is becoming more
subjective.
There are schemes that define mark-up (i.e. inline
meta-data) to identify both of these situations. The FreeBMD project uses an
Uncertain Character Format (UCF) notation of its own devising in order to identify
unclear text: UCF
Notation. The schemes TEI
(Text Encoding Initiative) and igenie.org
both define mark-up for uncertain characters, uncertain interpretations, and
other transcription anomalies.
There’s a hidden risk with this goal when it comes to user
content on the Web, especially for genealogical and family history data; a risk
that seems to have gone largely unnoticed. It is one thing to indicate a
reference to a person or a place, but it is entirely another to indicate which
person or which place. We describe this differentiation as “evidence and
conclusion”, or E&C, and genealogists (if not historians in general) are
very fussy about it.
Any rightly so! A 19th century newspaper
reference that simply read ‘William Elliott was fined 5s and costs for drunken
behaviour’ obviously references a person, and could be used as evidence, but an
identification with an actual person constitutes a conclusion. There are good
reasons why a link from the reference to the details of an actual person would
be useful, including the ability to search on their alternative names and other
details, but the evidence and the conclusions must be clearly distinct.
Different people may reach different conclusions when identifying a person
reference, or a place reference. The original reference may be less precise, as
in ‘Mr. Elliott’, or even personalised as in ‘Papa’ or ‘My grandmother’. These
are still person references but the identification is subjective and must use
context from outside that piece of text.
A similar case can easily be made for dates too. A reference
to ‘Last Friday’ or ‘Next Christmas’ are references to dates but their
interpretation requires context from elsewhere, such as a newspaper publication
date or a letterhead. In the situation with dates, the printed or written form
is the evidence whilst the computer-readable version (e.g. in ISO 8601 format) is the
conclusion. In principle, the interpretation of even a fairly clear date
reference such as ’10 March 1923’ is still a conclusion. A less obvious
instance concerns non-Gregorian calendars. Not all calendars have an agreed
synchronisation throughout their range, and that means algorithmic conversions
to the Gregorian calendar may be uncertain. Although a reference to an historical
date in one of the older Hindu calendars may have been precise when written,
the conversion to a Gregorian date may involve the addition of an error range
(an upper and lower possibility) which then makes the conclusion very different
to the evidence.
Unfortunately, online family trees are virtually all
conclusion with no supporting evidence. As someone who takes them all with a
pinch of salt, it concerns me deeply that semantic tagging may give them the
same weight as the non-associative references in a newspaper archive.
When the STEMMA® Data
Model was first conceived, one of its primary goals was of comprehensive
support for narrative content, both for transcribed evidence and for written
conclusions and rationale. Its support for transcription anomalies may be found
at Recording
Evidence. When it came to the support for semantic mark-up, though, it set
out to clearly separate objective evidence from subjective conclusion. It
introduced the terms shallow semantics
and deep semantics to refer,
respectively, to the nature of a datum (whether it’s a reference to a person,
place, event, or date) and the association with a conclusion (which person,
place, event, or date). A brief summary can be found at: Semantic
Mark-up.
The relevance of this to online family trees is fairly clear
but comprehensive support for narrative text is currently outside of the
capabilities of most genealogy software products. This is unfortunate and
another of the driving forces behind the STEMMA R&D project. The net effect
is that it’s difficult to produce a convincing argument about the
categorisation of mark-up, and the use of mark-up in general, that people can
relate to. In order to illustrate this better to genealogists, an end-to-end
example was published at: Structured
Narrative. This uses a transcribed 19th century letter from a
child to her father and takes the reader from a mere scan to a marked-up
transcription, and finally to an example depiction on the computer screen.
No comments:
Post a Comment