GeneaBloggers

Sunday, 23 April 2017

Transcription Boundaries



We all know about transcription, right … or do we? What are the ultimate goals? What are the limits, and are they inherent ones or self-imposed ones? I’m taking this opportunity to expand on some important transcription breakthroughs in the recent STEMMA V4.1 release.

Transcription Sources
Most people would begin by transcribing textual sources paragraph-by-paragraph, or sometimes line-by-line, dependent upon the actual source. It would quickly become apparent, though, that various scenarios cannot be transcribed directly as literatim text, such as uncertain characters or words, crossed-out text, text inserted or changed, and marginal annotation. What those people then have to do is decide on some form of mark-up to represent those scenarios (see Power of Annotation), but which one?

There are many schemes, ranging from old-style manuscript mark-up[1], through simple ASCII-character mark-up, to full-blown mark-up languages such as TEI (Text-Encoding Initiative). This latter technology, for instance, can represent semi-diplomatic or full diplomatic transcription of textual sources to digital form. Diplomatic transcription might be valuable for preservation but is that what we need for analysis?

Typescript Sources

This should be the easiest of the cases; when given a page of typed text then we might employ OCR to automate the conversion to a digital form. This is all very well if it is perfectly readable, but barely-readable sections, or additional hand-written annotation, would require a mark-up scheme.

And yet there are some subtle, but profoundly important, situations that rarely get mentioned. The presence of different fonts or typefaces in a printed electronic document would be taken for granted as indicating some semantic difference (e.g. a heading, abstract, or a footnote), but what about documents produced on an old-style typewriter? The presence of different typefaces might then indicate that a document was written on different machines at different times. Similarly with the alignment of the lines, or the marginal indent. But how do we indicate that in the digital form?

Suppose that there was a difference in the sophistication of the grammar in different sections, one that might provide a vital clue to different authors. How would that be represented?

A more important question is who would be the beneficiary of those indications? Schemes concerned with preservation will employ software taxonomies to categorise every eventually, but those subtleties — which could be crucial to the analysis and interpretation of a document  by a researcher — would almost certainly be excluded as unimportant in the digital representation.

Manuscript Sources

When transcribing manuscript documents then the points I’ve just raised become much more prominent. Contributions from different authors are generally more obvious because of their handwriting styles, and these obviously need to be distinguished in order to support any analysis, but what about stylistic variations?

Suppose that someone had underlined a word. That would clearly be an indication of emphasis, and the transcriber might represent it using some mark-up language (e.g. <u>word</u>) or some lightweight mark-up language (e.g. __word__), but what if a different word was underlined twice, or more times? This question also applies to text that has been struck-out. My point is that this is an important piece of information to capture, but how much more is required for analysis than for preservation?

As another example, consider if the author had used different coloured inks. James Joyce and Virginia Woolf both used different coloured pens or crayons in their work. Should a mark-up scheme have taxonomies for the basic colours, or all possible shades and hues? Character size and intensity (e.g. from a firm hand) can also be indicative of something. Who would benefit, though, from knowing that one paragraph was in dark green and another in light green: the software or the researcher? Is there a practical limit to the number of important variations that software taxonomies can distinguish, and if so then why do we insist on that route?

Audio Sources

Schemes that deal with audio transcription are generally specialist, and distinct from those related to textual transcription. The main reason is that those stylistic variations multiply exponentially. Not only do the transcriptions have to distinguish between contributions from different speakers, but they also need to indicate such things as speaking quickly/slowly, loudly/softly/whispered, singing, false accents, mimicry, and even different intonation. Schemes for audio transcription try to define taxonomies for these cases — although there will always be cases that aren’t covered — and the area of intonation is treated in a very formal way by linguistic analysis.

There may be cases of unknown words, slang, or strange pronunciations, each of which may need clarifying annotation.

While it is clear that the field is complex, I want to make an argument that there is a broad categorisation of the scenarios that has parallels in textual transcription, and that a single approach can deal with all three transcription source types. First, let’s look at some further complexities for audio.

There may be utterances or sounds from a given contributor that cannot be transcribed directly as text. For instance, a sneeze, cough, sniff, yawn, whistle, laugh, or swallow.

There may be a significant pause in someone’s speech that is important in the context of their words.

There may be any number of gestures or items of non-verbal communication that are equally important to capture within the transcript. For instance, a nod, smile, head-shake, squint, frown, or applause.

There may be instances where different voices — each of which is being transcribed — are overlapping each other, or where there is some untranscribed background contribution.

Conclusion

We can group all the above scenarios into the following broad categories:

  1. Language from different contributors. Distinguishing different hands, voices, etc.
  2. Stylistic differences from any particular contributor. Different emphasis, emotional delivery, typeface, handwriting, etc.
  3. Annotation where explanation or clarification is needed. Examples are unusual words, unknown words, slang, or local pronunciations.
  4. Contributions that cannot be transcribed directly or wholly as text. This includes changes, marginal notes, noises, gestures, and pauses.
  5. Parallel Contributions. This category is specifically related to audio.

STEMMA’s transcription support is designed to make material searchable, but also to support deep analysis. Some of these categories were already catered for in the cases of textual transcription, but supplementing them to cater for the remaining categories implicitly addressed audio transcription too. For instance, the <Alt> and <NoteRef> elements already catered for category #3 and needed no changes. The <Anom> element already represented textual anomalies, and so was extended to address the other anomalies in category #4.

The way that <Anom> was extended set the scene for the other extensions I will describe in a moment. Its existing taxonomy (see the http://stemma.parallaxview.co/anomaly-mode/ namespace) was given extra items of Gesture, Noise, and Pause. Within these, though, the specific gestures and noises are described using text, by and for the researcher, and not by using some limitless software taxonomy.

The STEMMA transcription elements <ts> (typescript sources) and <ms> (manuscript sources) were supplemented by <voice> (audio sources), and each were enhanced to cope with categories #1 and #2. They were extended with new attributes of ‘id’ and ‘scheme’, For instance:

<ms id=’id’ scheme=’scheme’>An example sentence</ms>

What these attributes do is attach a key representing the contributor (e.g. a hand, or a voice) and a specific stylistic variation of that contributor. There are no taxonomies used here since the differentiation and description may be subjective; the differentiation is designed to support analysis, not simply a matter of rendition; and there need to be no constraints.

The last category (#5) is addressed by specific variations of the <voice> element that allow it to be used as a container for multiple contributions.

A small example of an audio transcription employing these features may be found at Dialogue Transcription. The <ts>, <ms>, and <voice> elements are documented at Descriptive Mark-up.

The rationale behind this approach is actually quite a well-known one, although not in this field. In the area of Web mark-up, HTML5 tries to separate structure and content from presentation, the latter being left to something like CSS. For the formatting of Web pages, this avoids cluttering the mark-up describing the structure and content of page information, and ensures a consistent presentational style is applied across the pages. For transcription, it avoids cluttering the mark-up describing the structure and content from various contributors, but leaving complete freedom to the researcher to describe these in narrative as part of their analysis process.




[1] Rarely usable in a computer-based transcription because the old symbol set does not correspond with available symbols in an electronic document.

Wednesday, 19 April 2017

STEMMA V4.1



An original goal of STEMMA was to be able to represent rich-text narrative that could be used for authored works, including essays, memories, and reports. In addition, it aimed to support transcription, including transcribed extracts, which has quite specific requirements of its own.

STEMMA V4.1 has concentrated on its mark-up in these areas and has solved a number of long-standing issues with some novel approaches. Such was the success of the approach to textual transcription that this version also addresses audio transcription as a companion to it. I know of no other system that addresses both of these in a consistent manner, and certainly not when including rich-text authored work and semantic mark-up.

Overview

A goal of HTML5 was to separate structure and content from presentation in Web pages. STEMMA has applied a similar principle to its descriptive mark-up for both authored work and transcription.

STEMMA is not a presentation format. It therefore concerns itself with narrative structure, content, and semantics, but not the finer details of the presentation such as colours and fonts. STEMMA narrative may be transformed into any number of presentational formats for visualisation (e.g. HTML+CSS), and it is in these formats that such things would be configured, including page size, style galleries, choice of footnote/endnote/tablenote indicators, heading and cell formatting in tables, caption position, paragraph separation, styles for semantic elements, and so on.

Unusually, it has also applied this principle to both textual and audio transcription. Identifying the structure and content is more important that the finer details of their style and presentation, and the interpretation of any stylistic differences requires analysis rather than it simply being a display matter. For instance, marking where a manuscript used different colours in different places is more important than the specific colours and shades — that level of detail can be written in narrative for the reviewer rather than trying to use some limitless taxonomy for the software. Similarly with different written styles, which may or may not have been evidence of multiple authors. In a typescript document, it would equally apply to different fonts, font-sizes, ink intensity, marginal alignment, or even usage of grammar; all of these could have a bearing on the analysis of that document.

In audio transcription, this approach simplifies a complex area by giving freedom to the transcriber to detail the different voices, intonations, noises, and gestures.

Authored Work

The functionality of STEMMA’s descriptive mark-up has now evolved to the level where I can automatically generate blog-posts for research articles directly from my internal representation.

In order to demonstrate the new version, I have used the recent 5000-word article entitled Jesson Lesson to generate a fully-worked STEMMA example, available at www.parallaxview.co/familyhistorydata/downloads/JessonLesson.xml. This genuine research article included precise layout, transcribed extracts, tabulation, endnotes and tablenotes, and hyperlinked images. Its 47 endnotes included examples of reference-note citations, discursive notes, analytical commentary, and multi-source references — the handling of which was outlined previously at Cite Seeing — but also included examples of conflated citations where details of multiple people are placed in a single note for readability.

It was always a personal goal to produce better quality research articles, and so force STEMMA to address real-world scenarios rather than “desktop scenarios”. As a result of this, STEMMA’s general approach to citations has shifted slightly. Although support of citation-elements — implemented using its Parameter mechanism — has been enhanced, the focus of the computer-readable form is now on correlation and interrogation rather than mere formatting. The number of real-world cases (see list under Citations) is just too great for authored works to delegate formatting entirely to software that acts blindly from mere values. This version, therefore, finds a bridge between preferred hand-crafted forms and computer-readable citation-elements.

Another area that has been enhanced greatly is tables, which now support control over table width, column widths and alignment, captions, and tablenotes (i.e. citations deposited at the foot of a table).

Textual Transcription

The existing <ts> element, used to mark text transcribed from a typescript document, and the <ms> element, used to mark text transcribed from a manuscript document, both have new ‘id’ and ‘scheme’ attributes. These label the respective contributions with user-defined tags — ‘id’ for distinct contributions, such as different authors, and ‘scheme’ for stylistic variations — that can be described separately for the benefit of the reviewer.

For instance:

<NoteRef><Text Class=’Legend’>
bold-blue – text was written with a broad-tipped turquoise felt marker.
</Text></NoteRef>

<ms scheme=’bold-blue’>This section is now out-of-date and is being reworked</ms>

The elements <page>, <col>, <p>, <line>, and <posn> now take SVG-like image coordinates (percentage displacement from top-left image corner) for linking transcription elements to a copy of the original document. One use of this is to support parallel scrolling of image and transcription for the end-user.

The associated image is specified by a preceding <ResourceRef> element identifying a Resource entity using the mode ‘SynchImage’.

Audio Transcription

For audio transcription, the <voice> element provides the analogy to <ts>/<ms>, and it similarly takes ‘id’ and ‘scheme’ attributes. This allows different vocal (or other audio) contributions to be distinguished, and also their intonation, emotional delivery, artificial accents, etc.

Additional features are supported in a way analogous to textual transcription:

  • Anomalous contributions from an individual that cannot be represented as text, including noises, pauses, and gestures – see <Anom>
  • Alternative word meanings, clarifications, or other notes – <Alt> and <NoteRef>, exactly as with textual transcription
  • Time synchronisation – time-stamping with <time>. This is analogous to the <posn> element, and other x/y coordinates, used for textual transcription.

For time-stamping, the associated recording is specified by a preceding <ResourceRef> element identifying a Resource entity using the mode ‘SynchAudio’, analogous to ‘SynchImage’ for images.

As well as marking distinct voices, these features include the ability to mark overlapping contributions and background contributions. An example demonstrating many of these features may be found at Dialogue Transcription.

Change Details

Specific changes to the data model include the following:

  • ‘WhereIn’ attribute added to Citation Parameter definitions. This finally provides the missing criteria necessary for the automatic generation of shortened subsequent reference-note citations. ‘Subst’ attribute added to Citation parameter values in order to override formatting, or provide a substitution for cases on of a value being unavailable.
  • <ParentCitationLnk> now allowed in both <CitationLnk> and <CitationRef> elements in order to create transient chained citations.
  • Quality element, within Source entity, moved inside the Frame element.
  • Review of entries in citation-layer-type namespace.
  • DataControl element of Resource entity supports attribution text.
  • Control of table widths, and individual column widths and alignments.
  • Ability to align images when embedded within narrative.
  • Ability to hyperlink images embedded in narrative.
  • Requirement for enclosing Narrative element dropped for Text elements, except for top-level Narrative entities. Text elements can now be nested.
  • <cb> replaced with <col>, and relationship between paragraphs and columns now reversed (paragraphs now within columns).
  • ResourceRef Mode=SynchImage allows synchronisation between images and transcriptions.
  • Corresponding SVG-x/y coordinates added to elements <page>, <col>, <p>, and <line>. Additional <posn> element defined to associate coordinates with arbitrary text locations.
  • <Page>/<Line> renamed to <page>/><line> and moved alongside <p>/<col> as related to structure and content rather than semantics.
  • Mode=Tablenote attribute supplementing Foonote and Endnote in various places.
  • Text-element Header=boolean attribute replaced with Class=Header | H1 | H2 | H3 | Caption | Footnote | Endnote | Legend | Tablenote.
  • Text-element Class=Caption attribute used in Resource/ResourceRef and tables for generating captions.
  • Text-element Class=Footnote | Endnote | Tablenote attribute used in CitationRef to allow pre-formed (preferred) citations.
  • Deprecated the <Text> attributes Abstract=boolean, Extract=boolean, Manuscript=boolean, and Transcript=boolean..<voice> mark-up added to supplement existing <ts>/<ms> mark-up. <ts>/<ms>/<voice> all enhanced to cope with different hands, voices, fonts, colours, etc.
  • In transcripts of audio recordings, support for multiple voices, overlapping dialogue, intonation, gestures, noises, pauses, timestamps, etc.
  • ResourceRef Mode=SynchAudio allows synchronisation between audio recordings and transcriptions, analogous to SynchImage for textual transcription (above).
  • Complete revision of Mode values for CitationRef element.
  • Relaxation of Date Parameters in order to cover the full range of calendars. One requirement was to represent the date-of-issue for newspaper sources that predated the Julian-to-Gregorian changeover.

Further refinements to this data model are uncertain as it has now achieved the level of stability and functionality that was required for its serious usage.

Wednesday, 15 February 2017

Feeding the Trees



Having just attended RootsTech 2017, I feel compelled to compare the state of genealogy with my previous observations and viewpoints, as reported last year in Evolution and Genealogy. What has changed, and in which direction? I will also make some concrete suggestions to the industry that could go a long way to averting the headlong demise of online genealogy.


Compost frenzy
Figure 1 – Compost frenzy.

RootsTech 2017

This year’s Innovator Showdown semi-finalists presented products with the following functionality: photograph/image tagging and organisation, indexing, DNA triangulation, transcription, stories and memories, celebrity/friend tree matching, and newspaper research. That’s quite a broad range, and by itself doesn’t give away much in terms of trending. Some of the products were specialised, but others offered insular functionality, divorced from complementary functionality elsewhere — a point that I also mentioned last year. You would be forgiven for asking why can’t I have that, together with that, and inside this?

The overall message of RootsTech was still about stories and memories, and I’m totally on-board with this, but it is just the tip of a bigger requirement involving narrative. I applaud any change of focus away from raw data on trees to descriptive and audiovisual media that real people can relate to — allegedly allowing us to become heart specialists — but narrative (as favoured by humans but not by software designers) has many critical uses that were not addressed at the conference. More on this in a moment.

On the Wednesday (Feb. 8th), there was a session entitled “Industry Trends and Outlooks” with a panel that included Ben Bennett, Executive Vice President of International Business at Findmypast, and Craig Bott, co-founder, President and CEO of Grow Utah. Their particular comments were enlightening about current thinking in the commercial sector.

Ben acknowledged that not everyone wants to build a tree (or at least not just a tree), and that companies needed to understand their “customer context”. He was making the point that there is a mass market — apparently 83M people in the US interested and willing to pay — that involves a broad range of skills and interests, so how do you engage it. He suggested that products needed differentiation, with functionality aimed at the requirements of their particular customer group. I’m sceptical of this suggestion since it could be interpreted as different skills and depth of work translating into functional differences rather than user-interface (UI) ones; does the fact that some people write or research better than others necessarily mean that they’re the only ones wanting to do it?

Ben also acknowledged that good ideas don’t just come from within companies, and that they [Findmypast] are looking externally and willing to talk about new innovation. I believe this meant demonstrable products rather than written ideas, but it’s probably as close as we can expect to outreach so I wholly welcome his comments.

Craig talked about new technology in the areas of OCR and handwriting recognition — functionality that we all want — but also went on to describe neural networks being applied to the identification of named entities and semantic links. What this means is being able to pick out personal names, places, dates, events, etc., from digitised text, and also the relationships between them: biological or social relationships between people, origin or residence of someone, and dates of vital and non-vital events. Well, I have to repeat something that I’ve said elsewhere: it’s people that perform genealogical research, not software. Highlighting named entities could be an aid to newspaper research, but the researcher would be analysing the text, and across multiple documents rather than just one at a time.

My take on all this is that the large companies feel obligated to throw technology at genealogical (and historical) research, but the more fundamental issues of real research are not being addressed, or even acknowledged.

Fundamental Failings

I make no secret of the fact that I dislike online family trees as they’re currently implemented. They do not capture history, they make it far too easy to connect the wrong dots, and they’re an inappropriate organisational structure (i.e. they should be simply a visualisation of lineage). I’ve justified these points in previous posts, but let me summarise some of their basic failings that really need tackling.

a)    They are person-centric when it is time to enter data. For instance, in order to enter all the people in a given census household, it is nearly always necessary to start with each person in the tree, and then add each so-called “fact” and associated source to them. This is quite laborious as you really want to work from the census household rather than from the tree, and you have to frequently re-consult and re-describe the same document. If you want to attach an image of some document, say because you have a paper copy that’s not online at the current host site, then you’ll also be forced to attach it multiple times (hopefully not independent copies).
b)    When a source is added to a “fact” then it is a direct connection with nothing in between: no analytical commentary; no transcription; no justification for why it’s appropriate to the selected person; and no explanation as to why the name might be slightly different, or the date-of-birth implied by an age slightly different, from your conclusions. A consequence of this is that there’s no way to determine how a given conclusion was reached by someone.
c)    There’s no obvious way to add material that relates to multiple people. Photographs and document images are obvious examples, but the same problem relates to stories/memories, transcriptions, and any researched histories of your ancestors.
d)    There’s no obvious concept of ownership in a unified family tree. While still controversial in some quarters, most users do want this. As I mentioned last year, certain contributions should be immutable, but which? While a mere collection of “facts” can have no ownership (and cannot be copyrighted either), authored works such as research articles and personal memories must have.
e)    There will always be multiple possible conclusions in unified trees; anyone disputing that needs to understand the concept of evidence better. If there are no controls then there will be edit wars, and potentially loss of valuable contributions, but what form should they take? Throwing complicated technology at this in order to support multiple versions of the “truth” isn’t necessarily the right solution, and we need to take a step back and look at the dynamics of real research. Consider: what we’re doing isn’t always what we think we’re doing.
f)     Copying is made too easy in online trees, either from someone else’s tree or from material found elsewhere. In an ideal world then it should not be necessary, but these trees offer no alternatives. Their lack of functionality may even force users to put certain material elsewhere, thus leading to other users feeling they have to copy rather than cite or link-to it. This all means that errors, or even tentative conclusions when a researcher hasn’t yet finished, will replicate like a virus. It also means that the provenance of a contribution is lost, and there can be no attribution to the original author, contributor, or owner.

While I dislike trees,[1] I do acknowledge the investment that sites may have in that paradigm. So what can be done to address these failings, and help trees evolve to meet more of the requirements of that mass market?

Layer Cake

The scheme I want to suggest to companies that host online family trees involves using separate layers. Back in Our Days of Future Passed — Part III, I explained how the STEMMA data model has two notional sub-models: conclusional and informational. The old GenTech data model also had separate sub-models, although its equivalent to informational was termed evidence. STEMMA purposely uses the term informational as its sub-model includes the information sources and the possible analysis of that information, irrespective of whether it contributes evidence relevant to some conclusion.

When information is cleanly separated from conclusions then it provides a natural distinction for controlling changes to the corresponding contributions.  Conclusions — which includes names, dates, and relationships in the online tree — would be editable by anyone, whereas information — which includes personal stories and memories, photographs and images of documents, source analysis, research, and proof arguments — would be editable only by the respective contributor (or possibly some registered agent, such as another family member).


Conclusional and informational layers

Figure 2 – Conclusional and informational layers.

If someone had uploaded a photograph then a person in the tree could be linked to it, and although the link might be changed by anyone, the photograph could not. Similarly, if someone had uploaded their written research then conclusions on the tree could link to its relevant parts, and although those links could be changed by anyone, the original article could not.

I’ll expand on how this would work later, but first I want to point out an important subtlety: the arrows in this diagram are shown as down-pointing, from the conclusions to the associated information (including evidence). This would not be visible to the end-user since a connection is simply that (with no direction), but it is important for the purposes of change-control. If the source of the link was in the conclusional layer then it could be edited by all, but if it was in the informational layer (i.e. up-pointing) then it would be classified as part of the information source, just as we treat opinions in an authored work.

This may sound as though it offers redundancy rather than flexibility, but the distinction will become clearer as I progress.

Source-based Input

The following example is from the 1861 census of England and Wales (Piece: 2560, Folio: 23, Page: 6), and represents the household of 8 Homleys Court, Heaton Norris, Stockport, Cheshire. It was used as an example on the STEMMA site because it contained a number of errors, errors that had to be explained before identification of the persons could be made. The family name was incorrect, relationships were ambiguous, ages were wrong, and place names were wrong. Simply connecting “facts” on a tree to this census page would be silly as there would be so many discrepancies.

Name
Relation
Condition
Sex
Age
Birth Year
Occupation
Birth Place
Samuel Bradley
Head
Married
M
30
1831
Nail Maker
Belper, Derbyshire
Mary Bradley
Wife
Married
F
24
1837
Cotton Weaver
Lougborough, Leicestershire
John Bradley
Boarder
Married
M
26
1835
Slater
Belper, Derbyshire
Selina Bradley
Boarder’s Wife
Married
F
22
1839
Doubler (Cotton)
Belper, Derbyshire
George Bradley
Boarder’s Son
-
M
3
1858
-
Heaton Norris, Lancashire
Table 1 – 1861: Household of Samuel Bradley. Extracted and corrected details.


1861: Household of Samuel Bradley. Cropped image
Figure 3 – 1861: Household of Samuel Bradley. Cropped image.

For a user-owned tree, using the informational layer provides the currently missing place to extract the details and to explain why they might be incorrect. This alone would prevent users trying to create multiple birth events when sources disagree, but it would also provide them with a chain of explanation that they could follow at a later time.

It would also allow the user to work with, and from, a document in a source-based manner, thus making their data entry more efficient. Any analytical commentary and citation (should one be needed) would be in one place that could be linked to all the relevant tree entries.

In a unified tree, adding a copy of an image (or a hyperlink to an online version) only need be done once, but the extraction of details and the associated analysis might be done by different people. In other words, there could be multiple contributions that don’t exactly agree. This is in the nature of research and it must be accommodated.

The case of a document transcription is analogous since one version may be more precise than another, or may have interpreted hard-to-read text differently, or may have added annotation clarifying some aspects.

Authored Works

Authored works, including personal stories/memories and research articles, are crucial for capturing history. The mere inclusion of these would provide additional source material that could make the overall experience in online trees much richer. Research material willingly shared by those who make that effort would also serve to help those who can’t or won’t. Currently, anyone wanting to share such material has to use a separate blog (as I do) or some personal Web site; simply dumping your work in a plain-text area, with no formatting, no tables, no pictures, and probably attached to a specific point in some tree, just doesn’t cut it in the real world.

This scheme would make it much easier to accommodate material that relates to multiple persons since it is not hung directly from any one tree branch.

A point I hinted at earlier is that the author of such works is making connections — opinions — that identify the persons referenced in various sources. Taking one of my articles as an example (Jesson Lesson), this makes a case for various family relationships and their vital events. So how would this get connected to a unified tree; how would my up-pointing opinions relate to the down-pointing conclusions on the tree?

Well, remember that what we’re doing isn’t always what we think we’re doing. The researcher will have put together the details and relationships of a small group of people, but they haven’t slotted them into any global tree; that’s manifested in the conclusional layer. Also, their opinions may differ from those of another researcher and so the final conclusions must arbitrate based on their narrative explanations.

STEMMA would rely on semantic tagging (i.e. mark-up) embedded within the text to identify individual references, but that would be too complicated for most online trees. Imagine, instead, that each work was annotated with a piece of structured meta-data[2] that enumerated the (possibly multiple) names of the referenced people, their relationships, and their vital events. This would represent the opinions of the author and so would be an immutable part of each work — effectively up-pointing connections, although we won’t use them like that.

The meta-data would be cataloguing the works as complete units rather than their individual references but there are some advantages to that. In fact, this is the same meta-data concept that I described in Blogs as Genealogical Sources, and so it would also cope when the authored work is published elsewhere, including blogs and even traditional books.


Meta-data for local and remote articles
Figure 4 – Meta-data for local and remote articles.

That article about using blogs as sources made the point that this meta-data should be created by the respective author — not by some neural net software trying to second-guess them — and that it could support even the most complex of genealogical searches that these sites have. In this scheme, it summarises the details that the article has found or derived in its narrative.

Maybe surprisingly, when a humble photograph is added to the informational layer then the situation is analogous to that of these authored works: the contributor may have identified the people present in the shot, but we all know that old photographs often get mislabelled. How nice, though, to be aware of who made the identifications, and how. If two people have differing information for the same image then we can arbitrate using their explanations.

Collaboration

Having source information, source analysis, and even authored works, in the informational layer would provide a rich substrate to feed the tree-based conclusions. Edit wars and accidental loss of data are avoided because the main user contributions are in the informational layer. But there will still be differences of opinion since nothing is certain when looking at past events. In this scheme those alternatives could co-exist with virtually no effort, but which do the conclusions point to?

What the scheme affords is the ability to arbitrate on the quality of some research, or other contribution, and not simply on the preponderance of conclusional instances.[3]

I now want to extrapolate to see how far it might be possible take this scheme. Back in What to Share, and How - Part II, I presented a diagram explaining about joining STEMMA contributions together to automatically form a tree. Well, the same principle could be achieved using the contributions in this informational layer when they have the appropriate meta-data attached, as described above. In other words, if all the contributions were in unanimous agreement then construction of the tree could be automated.

But what about when they disagree, as is the normal case? This was a concern that I had in the aforementioned article, but when compared with the current situation of disagreeing contributions, these would be backed up by material whose quality could be used for arbitration. Not only that, this arbitration could be achieved using the ubiquitous Like button, stressing again that those differing opinions would all still be available, and nothing would be lost or discarded.

I hasten to add, here, that any implementation should avoid the temptation to use the researcher’s reputation, whether based on their ‘likes’ or their external persona. When a name is recognised then it might be tempting to ‘like’ their research without actually reading it. I know through experience[4] that an amateur who is driven to solve a mystery that’s very close to them, without the constraints of time and money, can make a better job than a qualified professional.

This may be a step too far in evolving shared trees since it would mean a quite different way of working for users. But the use of a Like button can still be employed to rate contributions in the informational layer.

Conclusion

This categorisation and separation of data contributions is something I already do in STEMMA; however, as I’ve presented here, it does not mandate the STEMMA data model. In fact STEMMA’s very broad micro-history scope would be (currently) inappropriate for those sites hosting family trees. What I’ve done, here, is to explain the principles in terms that apply to online trees. FamilySearch are quite close to this already since they have a separate memories area with different change-controls. The connections between this area and their unified tree would need work, and their narrative contributions would need some form of mark-up (not just plain ASCII text), but these are doable. Rather more effort would be required to handle the analysis and extraction facilities for source-based input.

So what’s different here? Isn’t this an obvious approach? Maybe it is in retrospect, after reading this article. Fundamentally, it breaks with the traditional notion of a tree as the organising structure within the software. If the industry can move beyond that then it would help with engaging that mass market and its many requirements — ones that I believe are actually common to us all, but maybe to different depths.

For me, personally, it’s not just about a revenue stream; it’s about giving users what they really need, it’s about the reputation of genealogy as a pursuit, and it’s about leaving a valuable legacy for future generations.



[1] I am interested in lineage, but also family history and micro-history; a tree merely visualises that lineage, and is inappropriate for organising any type of history.
[2] Structured usually means XML these days, and that’s good for handling user maintenance operations. If they are going to be searched or manipulated in bulk then a database derivative will probably be required.
[3] I use this term deliberately since there are generally few independent conclusions, but many replicated instances of those same conclusions.
[4] My entry into genealogy involved solving a family mystery, and hence fulfilling a promise that I made to my mother. I was told, by a professional, that it was impossible; it took me several years but I succeeded and so changed a number of lives forever.