Wednesday, 2 March 2016

The Power of Annotation

We most of us believe that we know what annotation is. However, the basic concept has been applied to several different fields, for quite different purposes, and in many different ways. A review of the landscape for textual annotation was very useful to me, and I hope that others may find this useful too.

A term that goes hand-in-hand with annotation is mark-up (or “markup” in the US), to the extent that they have become virtually synonymous in certain areas. One of the first things to consider is the origin of the two terms, and how their meanings may have shifted over time.

I wanted to call this article “marking up the wrong tree”, but obscure titles are not always the best policy, no matter how side-splittingly hilarious they may seem to you. [Pull yourself together Tony]

According to the dictionary, to annotate is “to add notes to (a text or diagram) giving explanation or comment”, and an etymology is given of “Late 16th century: from Latin annotat- 'marked', from the verb annotare, from ad- 'to' + nota 'a mark'”.[1] This is probably the first usage that most of us would think of.

Annotated page of text
Figure 1 – Annotated page of text.[2]

As an aside, the analysis of an annotated document is interesting because it often involves a mixture of primary and secondary information whose layers must be considered individually, although not separately.

The term mark-up originates from the annotation of manuscript (and manual typescript) documents with symbols providing printer’s instructions, including corrections, layout, and typesetting. Similar systems of symbolic annotation are used in the field of textual scholarship, which is a collective term for textual studies that encompass analysis, description, transcription, editing, or annotation of texts. The branch of textual scholarship known as diplomatics (not to be confused with diplomacy) involves the scholarly analysis of documents and texts. In particular, a diplomatic transcription reproduces an historic manuscript as accurately as possible (a diplomatic edition) in typography, and including significant features such as original spelling and punctuation; contractions, suspensions, and other abbreviations; insertions, deletions, and other alterations; obsolete characters such as thorn and eth; superscript and subscript characters, and brevigraphs (e.g. the ampersand); these usually employ a system of mark-up in order to capture their essence in a modern typeface. A semi-diplomatic transcription relaxes the requirement for accuracy, usually for readability or practicality. For instance, some original forms are difficult to reproduce in simple typescript, particularly if the original was already marked up by hand, but more on that later. Mark-up may also be used during peer review of a document, or by an author themselves. One more field that I have to mention is corpus linguistics, or the analysis of language using selections of natural text compiled from transcribed writings or recordings (corpora). This uses annotation for such things as tagging parts of speech (POS tagging), e.g. “corpus_NN1 annotation_NN1 is_VBZ hard_AJO” where the suffixes categorise the words (e.g. noun, adjective).

We’ve seen that annotation may actually be symbolic or textual, and that mark-up often includes text as well as symbols or editorial marks. So what is the difference? In his work on corpus linguistics, Martin Weisser comes to the following conclusion:[3]

While the term markup is sometimes used to indicate the physical act of marking specific parts of a text using specific symbols, and, in contrast, annotation may often refer to the interpretive information added, the two may also be used synonymously.

It would seem that the modern usage of these terms employs annotation for the addition of meta-data (related textual or other information) to the text, and mark-up for the scheme by which such annotation is represented or encoded.

This is born out by the concept of mark-up languages, which are systems for annotating a document that are syntactically distinguishable from the text, and hence more structured than mere symbols or marginal notes. A very important distinction has to be made, therefore, between the following types of mark-up:

  1. Handwritten mark-up, as applied to a manuscript or typescript document.
  2. Typed mark-up, as applied to a typescript or digital document.
  3. Mark-up language, as typically applied to a digital document.

The first two of these are designed to be humanly-readable whereas the third type is designed to be computer-readable and so must involve grammatical rules that allow it to be parsed by software. To illustrate the difference, consider the following corrected sentence:

My favourite colour is blue red.

A representation of this using a simple typed mark-up (type 2, above) might be:

My favourite colour is <blue> ^red^.

whereas a mark-up language (type 3) might encode it as follows:

My favourite colour is <del>blue</del> <ins>red</ins>.

These may appear equivalent from a visual perspective, but consider the consequences if the altered text contained either angle brackets or carets, or if the replacement word required some clarification — a mark-up language would be able to represent these cases unambiguously so that software could process it. Also, since a mark-up language is designed purely to communicate the information to software, it means that the representation of that same information to the end-user is not fixed, and the choices would be dependent upon the capabilities of the display medium and the sophistication of the display software.

This leads us nicely to perhaps the two best-known mark-up languages: HTML (HyperText Markup Language) and XML (Extensible Markup Language). HTML was created with predefined semantics (for creating Web pages) but XML was created as a general-purpose syntax with no predefined semantics. Interestingly, the semantics associated with HTML have been refined since its initial development; the last example, above, shows a modern Semantic HTML form, but an older form might have been:

My favourite colour is <s>blue</s> red.

This is literally encoding the visual representation, as in the first example sentence, above. The modern shift in emphasis is from presentation to content structure, such that the mark-up would now show what was deleted and what was inserted rather than simply that a line was drawn through a word.

So why might someone use XML rather than HTML when transcribing an historical document? Well, despite that shift in emphasis, HTML is not a good tool for transcribing text. Consider a document that has original emphasis, such as underlines added by the author, or which has already been marked up by an editor; this information has to be preserved and yet be distinguishable from anything employed during the transcription process, and these are not just presentational matters — there are semantics associated with the original formatting. With a mark-up language such as XML then you have the flexibility to represent all the different types and levels of information without any conflict or ambiguity. Both TEI (Text Encoding Initiative) and STEMMA employ mark-up languages with support for transcription, and both have XML representations.

Using the terminology from Markup_language, there are several forms of mark-up that are required for micro-history narrative:

  • Descriptive: Marking the text in order to capture its structure and content, rather than specific visualisations of it. Ultimate control over explicit physical rendition such as colour, bold, italic, underline, font name, and font size are best left to the tool presenting the text (e.g. HTML+CSS).
  • Presentational: This mark-up would be essential for a faithful transcription of something. Although modern systems (such as HTML5) frown on explicit presentational information, it may provide important information necessary for the analysis and correct interpretation of transcribed material. STEMMA’s approach to transcription separates structure and content from presentation: see Descriptive Mark-up.
  • Semantic: Although the aforementioned link suggests that this is an alternative name for Descriptive mark-up, the usage here is more distinct. This mark-up provides information about the meaning or interpretation of textual references. It is therefore different from the structure and layout in a purely textual context. It is precisely what is needed to identify the entities listed above such as Persons and Places.

Semantic mark-up is especially important for narrative essays and narrative reports stored in a genealogical context. Although both TEI and STEMMA have their own schemes, there is a divergence that will become more important once the genealogical industry acknowledges a narrative requirement: the semantics are not independent of the data model. This may be hard to explain, but simply flagging a name as that of a person or place — irrespective of whether it makes a conclusional identification — is an isolated semantic that is addressed in a roughly similar fashion by the two schemes. However, linking such a reference into a chain of conclusion-evidence-information-source would not make any sense outside of a genealogical data model. In effect, TEI is a very comprehensive text-encoding scheme but it cannot deal with semantics associated with an all-embracing data model.

A familiar form of mark-up that we might encounter in wikis or blogs is a lightweight markup language. These have a simple syntax that can be entered directly by the editing user, as opposed to being generated in response to some graphical operation or option selection. Although still designed to be computer-readable, they are easier for a human to read — and, hence, to write. For instance:

**bold text** __underline text__   //italic text//

When looking at the mechanics of adding mark-up to an electronic document then there are two very different approaches. The most common is inline, or embedded, mark-up, where the mark-up language is interwoven with the text in a manner such that it can still be distinguished. For example:

Here is a link: <a href="">STEMMA</a>

The alternative is known as stand-off, or remote, mark-up and involves holding the mark-up in a separate file (or other location) to the underlying text, usually linking them by character coordinates. The concept of stand-off mark-up is attributed to Henry Thompson and David McKelvie in 1997,[4] and the advantages include:

  • The ability to mark-up read-only (protected) or very large files.
  • The ability to support mark-up from independent editors, held as separate layers, and without them having to form a single code hierarchy.
  • The ability to combine disjoint segments into a single annotation.

Others are stated but I’m less convinced of their value. In contrast, the advantages of inline mark-up include:

  • Simplicity. One file to maintain or distribute.
  • The text and mark-up are edited together, with less chance of them getting out-of-step.

Which is best really depends on the application requirements.

A common example of stand-off mark-up, which isn’t always viewed as such, is CSS (Cascading Style Sheets). It was mentioned above that Semantic HTML favours content structure in place of presentation. This works because modern HTML now goes hand-in-hand with CSS, which can describe the presentational aspects in a separate file. Rather than being linked by character coordinates, they are linked by such things as element type and class, collectively described as selectors, which may explain why CSS is rarely described as stand-off mark-up. In effect, HTML then becomes an inline mark-up describing content that relies on a stand-off mark-up for presentation. The advantages of being able to change the overall presentation style of a Web page in a consistent way, or share the style between multiple pages, should be clear.

I want to round off this review of annotation with a quick mention of the humble word-processor. So familiar and useful is this tool that we give little consideration to how it works, or what goes on inside — oh, how I wish genealogy would catch up there. It allows the end-user to add presentational mark-up (e.g. bold, or a specific font-face) and semantic mark-up (e.g. a hyperlink, or a review comment), but you don’t see the associated mark-up. The associated mark-up language is complicated and so made deliberately invisible to the end-user. The net effect of that is to reinforce the user-interface model and give the impression that the end-user is somehow annotating the visible text directly. This is an important distinction — that a hidden nuts-and-bolts mark-up supports the notion, and the physicality, of annotation in the user interface — and it should be an important consideration for future genealogy tools. There is no excuse for expecting the end-user to edit the raw mark-up rather than using a WYSIWYG (“What You See Is What You Get”) interface.

[1] Oxford Dictionaries Online ( : accessed 1 Mar 2016), s.v. “annotate”.
[2] John Keats, “Ode to a Nightingale(1819); image credit: Ryan Johnson ( : accessed 1 Mar 2016); Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0).
[3] Martin Weisser, Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis (John Wiley & Sons, 16 Feb 2016), ch.11.
[4] Henry S. Thompson and David McKelvie, “Hyperlink semantics for standoff markup of read-only documents”, May 1997, technical report, Language Technology Group, HCRC, University of Edinburgh ( : accessed 2 Mar 2016).