Friday, 18 April 2014


Just a brief note for anyone who might be keeping an eye on the development of STEMMA®: V2.2 of the specification was finally published yesterday.

It always takes a while to update the Web site as the main documents are in Microsoft Word format, and heavily linked using cross-references and bookmarks. Converting these into equivalent linked HTML pages is painstaking.

This release contains the following main changes:

  • Person – The Birth and Death Events have been restructured to allow local Eventlets as well as full Event entities.
  • Group – Now revised as discussed in Revisiting the Family Group. They can now represent real-world group entities such as organisations, regiments, classes, clubs, etc. They also support alternative names, events, sources, hierarchies, attachments, and Properties in the same way as Places and Persons.
  • Place and Group – Now have elements analogous to a Person’s Birth and Death elements, called Creation and Demise. In the case of a Place, this supersedes the previous Void element.
  • Place and Group – Now accommodate non-hierarchical relationships to related entities, as discussed in Related Entities.
  • Resource – Now distinguishes physical artefacts (e.g. photographs) from images thereof. The range of predefined physical artefacts has been extended.
  • Resource – Uses MIME data-types for improved representation of digital media.
  • Resource – Control over sensitivity now possible for photographs, documents, etc.
  • Header – Persisted Counters in Dataset header for assisted key generation.
  • General – Person, Place, Group, and Event entities can now be given IDs applicable to arbitrary external systems (e.g. databases, online resources). This allows STEMMA to link into those external systems.

Monday, 14 April 2014

Using Footnotes with Blogger

Anyone who has been following my advice in Using Microsoft Word with Blogger in order to generate footnotes/endnotes in your blog may be wondering why they don’t quite work. Well, it’s a Blogger fault but it’s also easy to fix.

If you’ve generated a footnote in Word then you may be looking at something like the following with superscripted numeric footnote indicators:

Here is a footnote reference: 4


4 Here is the target footnote.

When you paste your article into the Blogger compose window, the first thing you’ll notice is that the superscripted numbers are now normal digits enclosed in brackets, but that’s an accepted alternative to superscripts. The second thing you’ll notice is that both the reference indicator and the target indicators have been made into hyperlinks:

Here is a footnote reference: [4]


[4] Here is the target footnote.

The idea is that if you click on a reference indicator then it will take you directly to the target footnote. Also, if you click on the target indicator then it will take you directly back to the reference point. This sounds great as there’s no scrolling involved, but unfortunately it doesn’t work out-of-the-box.

Don’t be too scared but we’re going to have a peek at the HTML code that Blogger will have generated.

<a href="" name="_ftnref4" …>

<a href="" name="_ftn4" …>

There’s one of these <a> elements generated for both the reference and target points. The ‘name’ attribute simply gives each point a label, and the ‘href’ attribute makes a hyperlink to go a named label. All they’re doing is creating mutually referencing hyperlinks. The reference point is labelled ‘_ftnref<n>’ and the target is labelled ‘_ftn<n>’.

The reason these don’t work is that the URL that Blogger has inserted is referencing the design-time compose window — remember that when you pasted your article into Blogger, it hadn’t yet been published, and so it didn’t have a proper URL. The sad thing is that it shouldn’t have put any URL in there at all. What the HTML should have looked like is the following:

<a href=" #_ftn4" name="_ftnref4" …>

<a href=" #_ftnref4" name="_ftn4" …>

This is much simpler, right? All I’ve done is to remove the explicit URL before the hash (‘#’) character. That hash part is technically called a Fragment Identifier, and that’s the only important bit when creating an intra-page link.

<a href="" name="_ftnref4" …>

The part I removed (shown in red) will be fixed for all posts on your own blog as the number is just a global ID for your blog. There’s no real difference between footnotes and endnotes for a Web page, including single-page blogs, so I now use endnotes since it retains some consistency between my Word edition and my blog edition. The only difference you’ll see in the HTML is that the labels are then ‘_ednref<n>’ and  ‘_edn<n>’, respectively.

The Solution

After pasting your Word article into the Blogger compose window, switch to the HTML window and search for either “#_edn” or “#_ftn”, as appropriate. Delete the part preceding the hash (but not the hash itself) on each match. Then save or publish as per normal. If you have a lot of footnotes/endnotes (I have 30+ in some of my own posts) then you can copy the complete HTML to your favourite text editor (e.g, Notepad), do a global replace to remove all those URLs with the blogID on, and then copy the complete result back to the Blogger HTML window (making sure you replace all the old content). Using familiar short-cut keys such as Ctrl-A (select all), Ctrl-C (copy) or Ctrl-X (cut), and Ctrl-V (paste) then this doesn’t take too long at all.

If you try out some of my own blog posts then you’ll see how it should work. In fact, if you’d like to visit every single post then I would be dead chuffed[1].

[1] Dead chuffed, and similar variants, are part of British slang. Roughly translated, it means 'exceedingly pleased'. However, there are no whoops and hollers with it; it is usually delivered in a subdued, almost dead-pan, tone that's particularly associated with Yorkshire. In March 2012, President Barack Obama welcomed British Prime Minister David Cameron to the White House by saying he was "chuffed to bits", and this still makes me smile.

Saturday, 12 April 2014

Handling Transcriptions

Making transcriptions of records is not as common amongst genealogists as you might expect, but why is that? What do we need in order to create useful transcriptions? If we’re part of the minority who do make them then where should we attach them?

Because of the availability of online data sources, and the ease with which digital copies can be created (owner permitting of course), many people believe they do not need full transcriptions of records. They might claim that since they can visit an online image, or they have a digital scan in their own data collection, then they can read it perfectly well without having it typed out. Whether it’s a baptism entry, a newspaper report, or a census page, many genealogists therefore find they have a growing collection of equivalent JPEG files sitting on the periphery of their data.

What I mean by this is that such a file can be pointed to, or referenced, by other data, but it cannot reference anything itself[1] or be textually searched. This means the information is not truly integrated into your data. The arguments for adding mark-up to a transcription in order to achieve this are almost exactly the same ones that I made for using mark-up in authored narrative at Semantic Tagging of Historical Data. This allows, for instance, references to people, places, events, dates, etc., in that transcription to be connected to the relevant entities in your data.

A transcription requires more though. It also requires a way of indicating transcription anomalies — parts that deviate from the normal flow — such as marginalia, footnotes, interlinear/intralinear notes, struck-out text, and uncertain characters or words. Both the uncertain characters and the uncertain words may require annotation to provide suggestions and possibilities, both of which must be honoured during searches. A transcription also requires an indication of any original emphasis, such as italics or underlining. NB: the original use of italics, underlining, footnotes, etc., in something being transcribed is different to their deliberate use in a written report, and so must use a distinct form of mark-up.

Traditional editorial notations for transcriptions are not well-suited to digital text as they do not facilitate efficient and accurate searching. TEI has comprehensive sets of mark-up for handling transcription issues but falls short when applied to genealogical data, and probably historical data in general. It is certain that some specialised mark-up is required, but how you visualise a transcription on-screen is a separate consideration. The same mark-up could alternatively show multi-coloured and hyper-linked text, or the plain editorial notation. That sort of flexibility only comes from using a computerised annotation rather than human annotation.

The fact that both transcription and authored narrative may co-exist in the same written report led to STEMMA® unifying them in its own mark-up. Those distinct usages — for transcriptions and for generating new narrative (including reasoning and notes) — have some similar and markedly different characteristics as follows:

  • Transcriptions – requires support for anomalies, indications of original emphasis (e.g. italics), indications of alternative spellings/meanings, and semantic mark-up for references to persons, places, events, groups, and dates. The latter semantic mark-up also needs to clearly distinguish objective information (e.g. that a reference is to a person) from subjective information (e.g. a conclusion as to whom that person is).
  • Authored narrative – requires support for layout and presentational mark-up. It needs to be able to generate references to known persons, places, events, groups, and dates that result in a similar mark-up to that for transcriptions. The difference here is that a textual reference is being generated from the ID of a Person entity, say, as opposed to marking an existing textual reference and possibly linking it to a Person with a given ID. Also needs to be capable of generating citations and general reference notes.

Actually, transcription isn’t just an action associated with a manuscript or typescript document; it could be associated with speech too. In those circumstances then it must reflect speech levels and emotional emphasis, but I haven’t even thought about that field yet.

As you can imagine, in order to generate a quality transcription, and to incorporate semantic links and annotation, a very good software tool is needed. It would be something like a specialised word-processor tool, but most of us are left using general-purpose word-processor tools that have none of the required facilities. This will be a secondary reason why so few transcriptions are made.

So where do I attach transcriptions in my own data? In order to explain, I first need to convey something of the structure of my data.

This simplified view of the rich connections in the STEMMA tapestry doesn’t show its places, or groups, or lineage links between people, or hierarchical/protracted events. That would be too complex! What it does show is a network of multi-person events and the relationship of sources to those events. Notice that the sources are attached to the events, and not to the people. As already explained in Evidence and where to Stick It, the vast majority of our evidence – if not all of it – relates to events; things that happened in a particular place at a particular time. In other words, our entire view of history rests on discrete and disjointed pockets of evidence describing a finite set of events. Everything else is inference and interpolation creating as smooth a picture as we can.

So what is the general form of these underlying source entities in the data? Our real-life sources may be remote, such as a document in an archive or a book in a library, or local, such as a family letter or a photograph. In both cases, we may have a digital scan of the items. STEMMA[2] has two important concepts that it employs for sources:

  • Resource – This describes some item in your local data collection, including not just files on your disk, but also physical artefacts or ephemera.
  • Citation – Despite the name, this is merely a link to some source of information. A traditional printed citation may be generated from it, but this software entity also incorporates collections, repositories, and even attribution; possibly chaining them together.

Either of these may apply, therefore, and both entities may accommodate a full transcription as appropriate. In the case where you may have transcribed a document in an archive, or even from one of the online content providers, the transcription may be placed in the Citation entity that references it. If you’re lucky enough to have captured your own digital image of the document then the Citation would point to an associated Resource entity which would hold the transcription. Alternatively, if you have an original letter, with or without a digital scan of it, then the associated Resource entity would hold the transcription.

Genealogist Janice Sellers, in her blog-post at Transcription Mentioned on Television, explains how transcriptions of documents are valuable for sharing the details with family and friends. She recounts how she tried to convince a well-known British TV program to advise their guests to make transcriptions of their historical documents and heirlooms.

STEMMA’s mark-up is primarily about semantics. Shallow semantics would mark an item as, say, a person reference but without forming a conclusion about who the person was. Deep semantics involve cross-linking references to persons, places, groups, events, and dates, to the relevant entities in your data. I have previously tried to convey this using the worked example of an old family letter at Structured Narrative.

Genealogist Sue Adams has taken the concept of semantic mark-up in transcriptions to a deeper level on her Family Folklore Blog. Her worked examples clearly demonstrate the temporal nature of historical semantics. Anyone with a passing interest in the Semantic Web and RDF is encouraged to read about “temporal RDF” and consider why it doesn’t yet exist. You may find a lot of theoretical work that considers things like temporal graphs but very few real examples like hers. In an ideal world, the developers of such technology would be working closely with the people who need to utilise it.

[1] I’m ignoring the issue of meta-data held within an image until a future post. The issue here is one of the text in an image making discrete references to its subjects rather than anything to do with image cataloguing.
[2] STEMMA V2.2 — which includes important refinements here — has just been defined but, at the time of writing, I am still preparing to painstakingly update the Web site. The landing page will indicate when this is complete.

Thursday, 3 April 2014

What to Share, and How – Part II

In the first part of this blog spot, What to Share, and How, I suggested that if collaborative Web sites were designed to accommodate creative works, rather than mere trees, then it would better accommodate family history, and it would encourage more sharing by ensuring accreditation and integrity. I now want to suggest how this might work in practice. I also want to conclude with a potential sting-in-the-tail for those, like me, who believe that simple trees cannot be copyrighted.

So what do I mean by a creative work here? Many of us have had to write up some form of narrative, whether for a client or for publication in an article, book, or blog. Such work is no different from an original work of research or fiction in that it is automatically copyright by virtue of the Berne Convention. Hence, this would be a prime component of the improved sharing.

STEMMA could take this further since it has the ability to package an integrated set of data that includes both narrative and transcriptions, and the entities that they reference such as people, places, events, and groups. The whole bundle could be indexed by the people (which includes their lineage), or a timeline, or their locality.

All of these components are cross-linked, thus making it an integrated bundle. The lineage section connects the people in the normal way, according to their biological lineage, but there may be multiple, disjoint trees. In other words, the bundle may represent distinct sets of people.

If such a bundle were uploaded to a collaborative site then none of it need be undone. Instead, each of the lineage sections would be anchored to a corresponding person entity in a lineage-based framework. The overall framework would be constructed based on the lineage of all the uploaded contributions, thus making it dynamic.

In this ideal world, therefore, there would be no need to edit or copy other people’s contributions. Multiple contributions could be associated with a single person entity (and their family) in the overarching framework. The accreditation and integrity of individual works would be preserved, and citations (or attribution) used when necessary. This is a collaborative model far-removed from what we have now, and I’ve glossed over issues of voting up/down contributions that may disagree, but let me know if you like the concept.

I want to conclude this post, though, with something a little unexpected. In the earlier piece of this two-part blog, I explained that mere collections of facts available in the public domain cannot be copyrighted. This is true, but does that description include the family trees that we currently see online? Dick Eastman recently blogged on this subject at Genealogical Privacy, and he explains the “legal and practical” fallacy of many researcher’s views that they own the data they’ve collected, and that publishing it allows others to freely steal it. The premise for this is that the data is “freely available to everyone in the public domain”. Although he does add the caveat that this is the case in the US, and so recognises that someone in the UK, say, may have had to pay for the information they’re publishing, there are a couple of other issues. Not all data may have been in the public domain, although this is usually associated more with data related to recent generations. Also, as we all know, the published details may not be clearly visible in the public-domain data; meaning that some effort may have gone into determining someone’s true lineage.

The reason I am picking on these points, and in doing so questioning my earlier statements, is that legal precedents may exist in similar, but non-genealogical contexts. One that I am aware of is a case of copyright that was tried in England in 1868, and due to its unusual nature is still referenced in many academic books on copyright law. The reason I am aware of this is that one of my ancestors was on the receiving end, and the judgement went against him, breaking him in the process.

The case is that of Morris.v.Ashbee. William Ashbee, in order to create a new trade directory of London, took an existing trade directory, compiled by John Morris, and gave the alphabetical list of names to his canvassers to check. Although Ashbee didn’t pass-off the earlier directory as his own, the judgement went against him because Morris had incurred the labour and expense of getting the information and of making the compilation. Ashbee had therefore benefitted from Morris’s work and that was considered an infringement of copyright. Ashbee later went bankrupt and died not long after.

Although this case was never envisaged in the context of genealogy, the general concept of benefitting from someone else’s labour and expense being an infringement of copyright must have potential implications for people who advance their genealogy by “stealing” data from others — even though it may have been publicly available.