We most of us believe that we know what annotation is. However, the basic concept has been applied to
several different fields, for quite different purposes, and in many different
ways. A review of the landscape for textual annotation was very useful to me,
and I hope that others may find this useful too.
A term that goes hand-in-hand with annotation is mark-up (or
“markup” in the US), to the extent that they have become virtually synonymous
in certain areas. One of the first things to consider is the origin of the two
terms, and how their meanings may have shifted over time.
I wanted to call this
article “marking up the wrong tree”, but obscure titles are not always the best
policy, no matter how side-splittingly hilarious they may seem to you. [Pull
yourself together Tony]
According to the dictionary, to annotate is “to add notes to (a text or diagram) giving explanation or comment”,
and an etymology is given of “Late 16th century:
from Latin annotat-
'marked', from the verb annotare,
from ad- 'to' + nota 'a mark'”.[1]
This is probably the first usage that most of us would think of.
Figure 1 – Annotated page of text.[2]
As an aside, the analysis of an annotated document is
interesting because it often involves a mixture of primary and secondary
information whose layers must be considered individually, although not
separately.
The term mark-up originates from the annotation of
manuscript (and manual typescript) documents with symbols providing printer’s instructions,
including corrections, layout, and typesetting. Similar systems of symbolic
annotation are used in the field of textual scholarship,
which is a collective term for textual studies that encompass analysis, description,
transcription, editing, or annotation of texts. The branch of textual
scholarship known as diplomatics
(not to be confused with diplomacy) involves the scholarly analysis of
documents and texts. In particular, a diplomatic
transcription reproduces an historic manuscript as accurately as possible
(a diplomatic edition) in typography,
and including significant features such as original spelling and punctuation;
contractions, suspensions, and other abbreviations; insertions, deletions, and other
alterations; obsolete characters such as thorn and eth; superscript and
subscript characters, and brevigraphs (e.g. the ampersand); these usually employ
a system of mark-up in order to capture their essence in a modern typeface. A semi-diplomatic transcription relaxes
the requirement for accuracy, usually for readability or practicality. For
instance, some original forms are difficult to reproduce in simple typescript,
particularly if the original was already marked up by hand, but more on that
later. Mark-up may also be used during peer review of a document, or by an
author themselves. One more field that I have to mention is corpus linguistics,
or the analysis of language using selections of natural text compiled from
transcribed writings or recordings (corpora). This uses annotation for
such things as tagging parts of speech (POS tagging),
e.g. “corpus_NN1 annotation_NN1 is_VBZ hard_AJO” where the suffixes categorise
the words (e.g. noun, adjective).
We’ve seen that annotation may actually be symbolic or
textual, and that mark-up often includes text as well as symbols or editorial
marks. So what is the difference? In his work on corpus linguistics, Martin
Weisser comes to the following conclusion:[3]
While the term markup is
sometimes used to indicate the physical act of marking specific parts of a text
using specific symbols, and, in contrast, annotation may often refer to the
interpretive information added, the two may also be used synonymously.
It would seem that the modern usage of these terms employs annotation for the addition of meta-data
(related textual or other information) to the text, and mark-up for the scheme by which such annotation is represented or
encoded.
This is born out by the concept of mark-up
languages, which are systems for annotating a document that are
syntactically distinguishable from the text, and hence more structured than
mere symbols or marginal notes. A very important distinction has to be made,
therefore, between the following types of mark-up:
- Handwritten mark-up, as applied to a manuscript or typescript document.
- Typed mark-up, as applied to a typescript or digital document.
- Mark-up language, as typically applied to a digital document.
The first two of these are designed to be humanly-readable
whereas the third type is designed to be computer-readable and so must involve
grammatical rules that allow it to be parsed by software. To illustrate the
difference, consider the following corrected sentence:
My
favourite colour is blue red.
A representation of this using a simple typed
mark-up (type 2, above) might be:
My favourite colour is <blue> ^red^.
whereas a mark-up language (type 3)
might encode it as follows:
My
favourite colour is <del>blue</del> <ins>red</ins>.
These may appear equivalent from a visual perspective, but
consider the consequences if the altered text contained either angle brackets
or carets, or if the replacement word required some clarification — a mark-up
language would be able to represent these cases unambiguously so that software
could process it. Also, since a mark-up language is designed purely to
communicate the information to software, it means that the representation of
that same information to the end-user is not fixed, and the choices would be dependent
upon the capabilities of the display medium and the sophistication of the
display software.
This leads us nicely to perhaps the two best-known mark-up
languages: HTML (HyperText Markup Language) and XML (Extensible Markup
Language). HTML was created with predefined semantics (for creating Web pages)
but XML was created as a general-purpose syntax with no predefined semantics.
Interestingly, the semantics associated with HTML have been refined since its
initial development; the last example, above, shows a modern Semantic HTML form,
but an older form might have been:
My
favourite colour is <s>blue</s> red.
This is literally encoding the visual representation, as in
the first example sentence, above. The modern shift in emphasis is from presentation
to content structure, such that the mark-up would now show what was deleted and
what was inserted rather than simply that a line was drawn through a word.
So why might someone use XML rather than HTML when
transcribing an historical document? Well, despite that shift in emphasis, HTML
is not a good tool for transcribing text. Consider a document that has original
emphasis, such as underlines added by the author, or which has already been
marked up by an editor; this information has to be preserved and yet be distinguishable
from anything employed during the transcription process, and these are not just
presentational matters — there are semantics associated with the original
formatting. With a mark-up language such as XML then you have the flexibility
to represent all the different types and levels of information without any
conflict or ambiguity. Both TEI (Text
Encoding Initiative) and STEMMA
employ mark-up languages with support for transcription, and both have XML
representations.
Using the terminology from Markup_language, there
are several forms of mark-up that are required for micro-history narrative:
- Descriptive: Marking the text in order to capture its structure and content, rather than specific visualisations of it. Ultimate control over explicit physical rendition such as colour, bold, italic, underline, font name, and font size are best left to the tool presenting the text (e.g. HTML+CSS).
- Presentational: This mark-up would be essential for a faithful transcription of something. Although modern systems (such as HTML5) frown on explicit presentational information, it may provide important information necessary for the analysis and correct interpretation of transcribed material. STEMMA’s approach to transcription separates structure and content from presentational or stylistic matters: see Descriptive Mark-up.
- Semantic: Although the aforementioned wikipedia link suggests that this is an alternative name for Descriptive mark-up, the usage here is more distinct. This mark-up provides information about the meaning or interpretation of textual references. It is therefore different from the structure and layout in a purely textual context, and is precisely what is needed to identify entities such as Persons and Places.
Semantic mark-up is especially important for narrative
essays and narrative reports stored in a genealogical context. Although both TEI and STEMMA have their own
schemes, there is a divergence that will become more important once the
genealogical industry acknowledges a narrative requirement: the semantics are
not independent of the data model. This may be hard to explain, but simply
flagging a name as that of a person or place — irrespective of whether it makes
a conclusional identification — is an isolated semantic that is addressed in a
roughly similar fashion by the two schemes. However, linking such a reference
into a chain of conclusion-evidence-information-source would not make any sense
outside of a genealogical data model. In effect, TEI is a very comprehensive
text-encoding scheme but it cannot deal with semantics associated with an
all-embracing data model.
A familiar form of mark-up that we might encounter in wikis
or blogs is a lightweight
markup language. These have a simple syntax that can be entered directly by
the editing user, as opposed to being generated in response to some graphical
operation or option selection. Although still designed to be computer-readable,
they are easier for a human to read — and, hence, to write. For instance:
**bold
text** __underline text__ //italic text//
When looking at the mechanics of adding mark-up to an
electronic document then there are two very different approaches. The most
common is inline, or embedded, mark-up, where the mark-up language is
interwoven with the text in a manner such that it can still be distinguished.
For example:
Here
is a link: <a
href="http://parallaxview.co/stemma/">STEMMA</a>
The alternative is known as stand-off, or remote, mark-up
and involves holding the mark-up in a separate file (or other location) to the
underlying text, usually linking them by character coordinates. The concept of
stand-off mark-up is attributed to Henry Thompson and David McKelvie in 1997,[4]
and the advantages include:
- The ability to mark-up read-only (protected) or very large files.
- The ability to support mark-up from independent editors, held as separate layers, and without them having to form a single code hierarchy.
- The ability to combine disjoint segments into a single annotation.
Others are stated but I’m less convinced of their value. In
contrast, the advantages of inline mark-up include:
- Simplicity. One file to maintain or distribute.
- The text and mark-up are edited together, with less chance of them getting out-of-step.
Which is best really depends on the application
requirements.
A common example of stand-off mark-up, which isn’t always
viewed as such, is CSS (Cascading
Style Sheets). It was mentioned above that Semantic HTML favours content
structure in place of presentation. This works because modern HTML now goes
hand-in-hand with CSS, which can describe the presentational aspects in a
separate file. Rather than being linked by character coordinates, they are linked
by such things as element type and class, collectively described as selectors, which may explain why CSS is
rarely described as stand-off mark-up. In effect, HTML then becomes an inline
mark-up describing content that relies on a stand-off mark-up for presentation.
The advantages of being able to change the overall presentation style of a Web page
in a consistent way, or share the style between multiple pages, should be
clear.
I want to round off this review of annotation with a quick
mention of the humble word-processor. So familiar and useful is this tool that
we give little consideration to how it works, or what goes on inside — oh, how
I wish genealogy would catch up there. It allows the end-user to add
presentational mark-up (e.g. bold, or a specific font-face) and semantic
mark-up (e.g. a hyperlink, or a review comment), but you don’t see the
associated mark-up. The associated mark-up language is complicated and so made
deliberately invisible to the end-user. The net effect of that is to reinforce
the user-interface model and give the impression that the end-user is somehow annotating
the visible text directly. This is an important distinction — that a hidden
nuts-and-bolts mark-up supports the notion, and the physicality, of annotation in
the user interface — and it should be an important consideration for future
genealogy tools. There is no excuse for expecting the end-user to edit the raw
mark-up rather than using a WYSIWYG
(“What You See Is What You Get”) interface.
[1] Oxford Dictionaries Online
(http://www.oxforddictionaries.com/us/definition/english/annotate :
accessed 1 Mar 2016), s.v. “annotate”.
[2] John Keats, “Ode to a Nightingale” (1819); image credit: Ryan Johnson (https://www.flickr.com/photos/kmonojo/4288773728 : accessed 1 Mar 2016); Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0).
[3] Martin Weisser, Practical
Corpus Linguistics: An Introduction to Corpus-Based Language Analysis (John
Wiley & Sons, 16 Feb 2016), ch.11.
[4] Henry S. Thompson and David McKelvie, “Hyperlink semantics for standoff markup of
read-only documents”, May 1997, technical report, Language Technology Group, HCRC, University of Edinburgh (http://www.ltg.ed.ac.uk/~ht/sgmleu97.html : accessed 2 Mar 2016).
No comments:
Post a Comment