We all know about transcription, right … or do we? What are
the ultimate goals? What are the limits, and are they inherent ones or
self-imposed ones? I’m taking this opportunity to expand on some important
transcription breakthroughs in the recent STEMMA
V4.1 release.
Most people would begin by transcribing textual sources paragraph-by-paragraph,
or sometimes line-by-line, dependent upon the actual source. It would quickly
become apparent, though, that various scenarios cannot be transcribed directly
as literatim text, such as uncertain
characters or words, crossed-out text, text inserted or changed, and marginal
annotation. What those people then have to do is decide on some form of mark-up
to represent those scenarios (see Power
of Annotation), but which one?
There are many schemes, ranging from old-style manuscript
mark-up[1],
through simple ASCII-character mark-up, to full-blown mark-up languages such as
TEI (Text-Encoding Initiative).
This latter technology, for instance, can represent semi-diplomatic or full diplomatic
transcription of textual sources to digital form. Diplomatic transcription
might be valuable for preservation but is that what we need for analysis?
This should be the easiest of the cases; when given a page
of typed text then we might employ OCR to
automate the conversion to a digital form. This is all very well if it is
perfectly readable, but barely-readable sections, or additional hand-written
annotation, would require a mark-up scheme.
And yet there are some subtle, but profoundly important,
situations that rarely get mentioned. The presence of different fonts or
typefaces in a printed electronic document would be taken for granted as
indicating some semantic difference (e.g. a heading, abstract, or a footnote),
but what about documents produced on an old-style typewriter? The presence of
different typefaces might then indicate that a document was written on
different machines at different times. Similarly with the alignment of the
lines, or the marginal indent. But how do we indicate that in the digital form?
Suppose that there was a difference in the sophistication of
the grammar in different sections, one that might provide a vital clue to different
authors. How would that be represented?
A more important question is who would be the beneficiary of
those indications? Schemes concerned with preservation will employ software
taxonomies to categorise every eventually, but those subtleties — which could
be crucial to the analysis and interpretation of a document by a researcher — would almost certainly be
excluded as unimportant in the digital representation.
When transcribing manuscript documents then the points I’ve
just raised become much more prominent. Contributions from different authors
are generally more obvious because of their handwriting styles, and these obviously
need to be distinguished in order to support any analysis, but what about
stylistic variations?
Suppose that someone had underlined a word. That would
clearly be an indication of emphasis, and the transcriber might represent it
using some mark-up language (e.g. <u>word</u>) or some lightweight
mark-up language (e.g. __word__), but what if a different word was underlined
twice, or more times? This question also applies to text that has been
struck-out. My point is that this is an important piece of information to
capture, but how much more is required for analysis than for preservation?
As another example, consider if the author had used
different coloured inks. James Joyce and Virginia Woolf both used different
coloured pens or crayons in their work. Should a mark-up scheme have taxonomies
for the basic colours, or all possible shades and hues? Character size and
intensity (e.g. from a firm hand) can also be indicative of something. Who
would benefit, though, from knowing that one paragraph was in dark green and
another in light green: the software or the researcher? Is there a practical
limit to the number of important variations that software taxonomies can
distinguish, and if so then why do we insist on that route?
Schemes that deal with audio transcription are generally
specialist, and distinct from those related to textual transcription. The main
reason is that those stylistic variations multiply exponentially. Not only do
the transcriptions have to distinguish between contributions from different
speakers, but they also need to indicate such things as speaking
quickly/slowly, loudly/softly/whispered, singing, false accents, mimicry, and
even different intonation. Schemes for audio transcription try to define
taxonomies for these cases — although there will always be cases that aren’t
covered — and the area of intonation is
treated in a very formal way by linguistic analysis.
There may be cases of unknown words, slang, or strange
pronunciations, each of which may need clarifying annotation.
While it is clear that the field is complex, I want to make
an argument that there is a broad categorisation of the scenarios that has
parallels in textual transcription, and that a single approach can deal with
all three transcription source types. First, let’s look at some further complexities
for audio.
There may be utterances or sounds from a given contributor
that cannot be transcribed directly as text. For instance, a sneeze, cough,
sniff, yawn, whistle, laugh, or swallow.
There may be a significant pause in someone’s speech that is
important in the context of their words.
There may be any number of gestures or items of non-verbal
communication that are equally important to capture within the transcript. For
instance, a nod, smile, head-shake, squint, frown, or applause.
There may be instances where different voices — each of
which is being transcribed — are overlapping each other, or where there is some
untranscribed background contribution.
We can group all the above scenarios into the following
broad categories:
- Language from different contributors. Distinguishing different hands, voices, etc.
- Stylistic differences from any particular contributor. Different emphasis, emotional delivery, typeface, handwriting, etc.
- Annotation where explanation or clarification is needed. Examples are unusual words, unknown words, slang, or local pronunciations.
- Contributions that cannot be transcribed directly or wholly as text. This includes changes, marginal notes, noises, gestures, and pauses.
- Parallel Contributions. This category is specifically related to audio.
STEMMA’s transcription support is designed to make material
searchable, but also to support deep analysis. Some of these categories were
already catered for in the cases of textual transcription, but supplementing
them to cater for the remaining categories implicitly addressed audio
transcription too. For instance, the <Alt>
and <NoteRef>
elements already catered for category #3 and needed no changes. The <Anom>
element already represented textual anomalies, and so was extended to address
the other anomalies in category #4.
The way that <Anom> was extended set the scene for the
other extensions I will describe in a moment. Its existing taxonomy (see the http://stemma.parallaxview.co/anomaly-mode/
namespace) was given extra items of Gesture, Noise, and Pause. Within these,
though, the specific gestures and noises are described using text, by and for
the researcher, and not by using some limitless software taxonomy.
The STEMMA transcription elements <ts> (typescript
sources) and <ms> (manuscript sources) were supplemented by <voice>
(audio sources), and each were enhanced to cope with categories #1 and #2. They
were extended with new attributes of ‘id’ and ‘scheme’, For instance:
<ms id=’id’ scheme=’scheme’>An example sentence</ms>
What these attributes do is attach a key representing the
contributor (e.g. a hand, or a voice) and a specific stylistic variation of
that contributor. There are no taxonomies used here since the differentiation
and description may be subjective; the differentiation is designed to support
analysis, not simply a matter of rendition; and there need to be no
constraints.
The last category (#5) is addressed by specific variations
of the <voice> element that allow it to be used as a container for
multiple contributions.
A small example of an audio transcription employing these
features may be found at Dialogue
Transcription. The <ts>, <ms>, and <voice> elements are
documented at Descriptive
Mark-up.
The rationale behind this approach is actually quite a
well-known one, although not in this field. In the area of Web mark-up, HTML5 tries to separate
structure and content from presentation, the latter being left to something
like CSS.
For the formatting of Web pages, this avoids cluttering the mark-up describing
the structure and content of page information, and ensures a consistent
presentational style is applied across the pages. For transcription, it avoids
cluttering the mark-up describing the structure and content from various
contributors, but leaving complete freedom to the researcher to describe these
in narrative as part of their analysis process.
[1] Rarely usable in a
computer-based transcription because the old symbol set does not correspond
with available symbols in an electronic document.