Saturday, 26 March 2016

Dynamic Genealogical Data



Yes, I will be discussing data models and software issues in this post — sigh! — but hopefully in a manner that may be instructive to those who want to know more. As well as introducing certain important concepts, I want to illustrate some typically tricky decisions that have to be made, before rounding off with a novel way of presenting custom data to the end-user. Know-it-alls who already have a software background can just fast-forward to the “Reporting” section.

Some time ago, I submitted a paper to FHISO on the subject of object models[1] in which I explained their relevance to scripting languages, query languages, and general dynamic data access. Admittedly, this paper was aimed at a software audience, but let’s try and pull it apart to explain what these languages are, and the difference between a data format, a data model, and an object model.

Dynamic Data

Data Model

A data model is a formalised description of the relevant data entities (e.g. person, or place), their properties (e.g. names, sex, coordinates, etc.), and their relationships (e.g. biological lineage, or place of birth). Issues such as indexes and database schemas are not applicable to data models as they are a more abstract definition of the data’s structure, or rather its pattern. But issues such as cardinality (how many items of one type can be related to another), ordinality (the ordering of items related to one instance of another), and optionality (whether an item is mandatory or optional), are relevant.

Let’s consider one aspect of a genealogical data model to help illustrate this point: biological lineage. Every person had one mother and one father, even if they are unidentified; there cannot be more than one of each (ignoring the possibility of donor DNA) but some representations of this will be better than others.

If the parent person entity (e.g. a mother) points to each individual child of hers then it accommodates the erroneous situation where multiple mothers might point to the same child. However, the converse of having each person entity point to their respective mother and father enforces the cardinal integrity of the relationship without having to perform constant error checking.

Entity relationship schemes

It should be noted, here, that the direction of the link makes no difference to the ability to find children-of-a-parent or parents-of-a-child; both can be indexed from either one of the schemes.

So what if you don’t know the father? Well, having a missing father or mother link could easily be recognised as an indication that they are unidentified, but suppose that you have some incomplete details for them? Suppose, for instance, that you don’t know the name of the mother but you have her date of birth. This is where we get into controversial territory, and where I’m going to make a very bold statement. There are many threads that advocate the substitution of underscores, question marks, or some special text such as “Unknown”, “LNU” (Last Name Unknown), etc., for a missing name. In a software context, all of these are absolutely wrong, and not the right way to handle the situation. This is not a personal belief — it is a best-practice in a profession that I have spent decades in. It doesn’t matter who is making those recommendations, and there are no special cases for genealogists.

OK, rant over; now let me clarify this. Ignoring the fact that any alphabetic text may not translate well when sharing your data with someone from a different locale, the choice over which substitution to use for a missing name, or any missing datum, should not be given to the end-user, or even to a specific software product. More than that, good software would represent non-value conditions of a datum, such as unknown, not applicable, or erroneous, in a different domain to real values so that there is zero chance of a clash. For instance, consider the difference in a SQL database between NULL (a special condition in a column) and “NULL” (a normal textual value in a column). Also, the display value by which those special data conditions are represented to the end-user is a choice by the product, and not dictated by the software representation in the corresponding data format or data model.

I appreciate that some software may not follow these best-practices, but it is important to understand why this is bad for everyone. Tamura Jones produced an excellent article in 2013 related to this subject that discussed the impact of using acronyms and other invented values.[2] As I recently commented on one of Randy Seaver’s blogs, “Fake, Fudged, Dummy, and other such ‘special’ values were bad choices even in the 1970s”.

We’ve just looked at some choices in the relationships between two entities, and in the representation of non-value conditions. These cases may provide a basic insight into typical design issues affecting a data model, but what is a data model good for? It allows two products — a producer and a consumer — to agree on the structure of real data being exchanged between them.  When real data is stored in a file (or serialised as bytes for some other purpose, such as transmission over a communications link) then the data format employs a given syntax. That data format is largely irrelevant in comparison to the data model; the GEDCOM data model could be expressed in its own proprietary data format, or in XML, or some other format, and it would be a straightforward mechanical operation to convert from one format to another if they all conform to the same model.

However, such files are a very static way to exchange data, and they have to be loaded into some organised indexed form before they can be interrogated, navigated, or manipulated. One example of such a form is a database, but this is not the only form and genealogical products have other choices (see Do Genealogists Really Need a Database?). The following diagram provides a simplistic depiction of how a program may access indexed data in a disk-resident database or in memory, but other variations will be possible. For instance, a memory-resident index pointing to files on disk, as might be the case with a collection of image files.

Data access

STEMMA deliberately describes its data format as a source format in order to attach extra semantics; this draws from the term source code, as used in programming, in order to emphasise that the data is a definitive source for other forms, whether generated by transformation or by indexing, and not simply an exchange format.

Object Model

A program that accesses a database is constrained by the data-types allowed in its columns, the nature of its indexes, normalisation of its table entities, and the associated query language used to access those tables (mostly but not always SQL). Software that uses a query language directly, rather than having some abstraction layer between it and the target database, may be reducing both its longevity and its portability.

An in-memory index is more efficient and more flexible, and the ever-increasing memory capacity of modern machines means that it is totally practical to have both the data and the index in memory together. But what would the in-memory data look like?

The modern answer to this question is objects. In object orientated programming (OOP), an object is a software entity representing one instance of some real-life entity. For instance, a Person object[3] would represent one named person, and a Place object one named place. Objects of the same type are instantiated (i.e. created) from a template called a class, which defines not only the allowable properties (e.g. names and sex for a person) but also small segments of code, called methods, that may be invoked on the associated objects. For instance, there may be a method to test whether any of the names stored in the current object matches a particular name provided as a parameter. All products utilising that class in its programming would therefore use a consistent algorithm, as implemented by the designer of the associated object model: the set of related classes intended to cooperate in order to deliver access to that data. An object model is not uniquely defined by a given data model, but they do go hand-in-hand; every object model has an underpinning data model. However, whereas a data model talks about entity relationships, it is the object model that talks about actual data linkages, indexes, and issues of efficient data access.

A very important aspect of OOP is software inheritance. This is where one class is derived from another class and allows it to share unchanged portions while overriding others; the intention being to create a new class for a more specialised type of entity. The following is an illustration of how it might be applied to the various subjects of historical research, each level providing specialised classes derived from more generic ones in the previous level.


Software inheritance
In this illustration, the handling of names could be shared for all the subject types, and inherited from the historical-subject base class, whereas the hierarchy for animate subjects (i.e. lineage) would be different from that for inanimate subjects.

We might then redraw the earlier data-access diagram as follows in order to show that the object layer forms an effective abstraction:


Abstracted data access
The program sees the same application programming interface (API), irrespective of whether the data is in local memory, in some database, or even across some network connection. NB: The index associated with the objects would be implicit in their class definitions, and not really a separate entity as implied by this diagram.

I’ve just mentioned access across a network, where your data may be on a separate machine (the server) to that of your program (the client), e.g. in the “cloud”. This is a case for which query languages are well suited. You see, if data server has millions of records available but the program on your client machine only wants to see a handful that satisfy some specific criteria, it would be extremely inefficient to transport all the records up to the client machine, across a typically slow network — much easier to push the criteria down to the server and let software there sort it out. This is effectively what happens when a SQL query is sent to a server hosting a database; the query may be as small as a single line and the returned records would be just the ones of interest. There are other forms of query language, such as MDX for multi-dimensional data queries, and the form of the returned data is highly dependent on the nature of the query.

A scripting language is usually considered to be an interpreted language, meaning one executed directly from its source-code representation — as entered by the programmer — rather than being first compiled into a set of instructions that the machine can execute directly. Because the source has to be parsed before it can be interpreted then there is a performance penalty with them, but their speed of deployment and ease of maintenance make them very useful for small-scale applications. Languages such as JavaScript and VBScript are common examples, but there is little difference from the aforementioned query languages, other than in their intended function. Many database systems even support a mechanism where segments of script can be held as stored procedures that can be invoked later by a special type of query statement. The relevance of this is that when a scripting language has access to an object model then you have a very powerful and flexible means of data access.

Suppose we want to express a custom genealogical query — one not supported intrinsically by our product — to look at all the events in our timeline, then look at all the persons connected to those same events, and then to select just the ones whose name(s) have the token "Jesson" in them. If a standard object model is defined then it doesn’t matter whether we express our query using a standard scripting language or some proprietary one associated with the product.

This example uses a java-like syntax.

Person me = New Person("Tony Proctor", 1956);
for (Event e: me.allEvents()) {
for (Person other: e.allPersons()) {
if (other.nameContains("Jesson")) {
...do something with this other person...
}
}
}

This example uses a VB6-like syntax.

Dim me As New Person
me.setPersonName (“Tony Proctor”)
me.setDateOfBirth (1956)
Dim e As Event
Dim other As Person
For Each e In me.allEvents()
For Each other in e.allPersons()
If (other.nameContains(“Jesson”)) Then
...do something with this other person...
End If
Next other
Next e

The syntax is very different but the essential elements of the object model, such as class names and method names, are the same in the two examples.

So, in summary, irrespective of where the data is coming from (memory, database, or afar), if an object model is available then any processing or queries can be expressed through scripting languages. Furthermore, those segments of code can be pushed down to the data server in order to achieve efficient retrieval in the case of remote data stores.

Citations

One instance where I have found it beneficial to expose an object model was in the processing of citation-elements in order to generate a formatted citation. A citation-element is a discrete datum that can be identified in a citation, such as an author, title, publication date, etc. When a system generates a formatted citation using a citation-template then it takes various citation-element values and inserts them at selected places in a textual template. The onus is on the hosting software to provide the citation-template with all the relevant values, and this can be a heavy burden if it doesn’t have intimate knowledge of the actual template.

Suppose that the citation is required to name the Genealogical Publishing. Co. of Baltimore. How much information should the hosting program pass to the citation-template to let the reader know where that company is? Is a prefix of “Baltimore:” enough, or does it require “Baltimore, Maryland:”, or does it require “Baltimore, Maryland, US:”? Remember that the reader may not be American, and if you think they should know all the US states then would you be aware of all the regions in their country?

The alternative approach is to pass a Place object to the citation-template and allow it to invoke the associated methods to obtain the specific information that it requires for the current template and the current user. The same approach can be applied to other data such as a Person object (allowing access to name-handling methods), or a Date object (allowing the formatting of dates from alternative calendars in short/medium/long/full forms).

Reporting

A more recent application of an object model occurred in STEMMA’s narrative support. The inclusion of a segment of script in a STEMMA table entity allowed it to fetch specific data to populate the cells at the time associated Narrative entity was rendered. This basically meant that a Narrative entity could be used as a report writer (in software terminology) to fetch up-to-date data matching a custom query and to present it along with narrative or other data.

To explain the significance of this, let’s look at the HTML equivalent since it is also possible to dynamically populate an HTML table. Once upon a time, programmers used to do this by injecting new HTML source into the table using the JavaScript document.write() function. For instance:


document.write (“<tr>”);

This allowed, say, the rows and cells of a table to be generated from some data such as the results of a SQL query or the contents of a JavaScript array. This is a very old method, and has acknowledged security risks, as well as performance implications in certain cases. An equivalent way of injecting HTML source is to modify the innerHTML property of a given node. For instance:

node.innerHTML = "<b>Hello World</b>";

A better approach, though, would be use methods provided by the HTML Document Object Model (DOM): the tree of node objects into which an HTML document is compiled. For instance, document.createElement() and node.appendChild(). In the specific case of tables, W3C defined DOM methods such as tbody.insertRow() and tr.insertCell() to help build the table body more easily.

The STEMMA case is slightly different from that of HTML since it does not have an inherent DOM — the contents of a marked-up Narrative entity have to be transformed into some other system for presentation, which might be in an HTML page, a blog page, or a word-processor document — but it does expose a tentative genealogical object model. This means that the STEMMA table entity can specify a segment of script code to call upon that object model and retrieve results for the table contents. It doesn’t do this by dynamically injecting STEMMA source, or by calling on something like the DOM methods to build-up a table; the script populates an object that simply describes the contents of a table — the headings and data cells — and this can be transformed for display in the same manner as a static table would.

What this all means is that the existing support for representing narrative essays, narrative reports, transcriptions, etc., has also become a tool for dynamically reporting on genealogical data. I don’t need to devise some completely separate tool to achieve this, and I also get the benefit of mix-and-match where dynamically generated tables can be embedded in my narrative reports. And just in case anyone hasn’t realised yet, if the underlying data is subsequently modified then I will see the very latest data the next time I view the associated Narrative entity.



[1] Tony Proctor, "Proposal to Create a Standard Run-time Object Model", FHISO Call For Papers, CFPS 9 (http://tech.fhiso.org/cfps/files/cfps19.pdf : accessed 9 Mar 2016);
[2] Tamura Jones, “FNU LNU MNU UNK”, Modern Software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml : accessed 9 Mar 2016).
[3] Using capitalisation, here, merely to distinguish software entities from the real-life entities that they are representing.

Wednesday, 2 March 2016

The Power of Annotation


We most of us believe that we know what annotation is. However, the basic concept has been applied to several different fields, for quite different purposes, and in many different ways. A review of the landscape for textual annotation was very useful to me, and I hope that others may find this useful too.

A term that goes hand-in-hand with annotation is mark-up (or “markup” in the US), to the extent that they have become virtually synonymous in certain areas. One of the first things to consider is the origin of the two terms, and how their meanings may have shifted over time.

I wanted to call this article “marking up the wrong tree”, but obscure titles are not always the best policy, no matter how side-splittingly hilarious they may seem to you. [Pull yourself together Tony]

According to the dictionary, to annotate is “to add notes to (a text or diagram) giving explanation or comment”, and an etymology is given of “Late 16th century: from Latin annotat- 'marked', from the verb annotare, from ad- 'to' + nota 'a mark'”.[1] This is probably the first usage that most of us would think of.

Annotated page of text
Figure 1 – Annotated page of text.[2]

As an aside, the analysis of an annotated document is interesting because it often involves a mixture of primary and secondary information whose layers must be considered individually, although not separately.

The term mark-up originates from the annotation of manuscript (and manual typescript) documents with symbols providing printer’s instructions, including corrections, layout, and typesetting. Similar systems of symbolic annotation are used in the field of textual scholarship, which is a collective term for textual studies that encompass analysis, description, transcription, editing, or annotation of texts. The branch of textual scholarship known as diplomatics (not to be confused with diplomacy) involves the scholarly analysis of documents and texts. In particular, a diplomatic transcription reproduces an historic manuscript as accurately as possible (a diplomatic edition) in typography, and including significant features such as original spelling and punctuation; contractions, suspensions, and other abbreviations; insertions, deletions, and other alterations; obsolete characters such as thorn and eth; superscript and subscript characters, and brevigraphs (e.g. the ampersand); these usually employ a system of mark-up in order to capture their essence in a modern typeface. A semi-diplomatic transcription relaxes the requirement for accuracy, usually for readability or practicality. For instance, some original forms are difficult to reproduce in simple typescript, particularly if the original was already marked up by hand, but more on that later. Mark-up may also be used during peer review of a document, or by an author themselves. One more field that I have to mention is corpus linguistics, or the analysis of language using selections of natural text compiled from transcribed writings or recordings (corpora). This uses annotation for such things as tagging parts of speech (POS tagging), e.g. “corpus_NN1 annotation_NN1 is_VBZ hard_AJO” where the suffixes categorise the words (e.g. noun, adjective).

We’ve seen that annotation may actually be symbolic or textual, and that mark-up often includes text as well as symbols or editorial marks. So what is the difference? In his work on corpus linguistics, Martin Weisser comes to the following conclusion:[3]

While the term markup is sometimes used to indicate the physical act of marking specific parts of a text using specific symbols, and, in contrast, annotation may often refer to the interpretive information added, the two may also be used synonymously.

It would seem that the modern usage of these terms employs annotation for the addition of meta-data (related textual or other information) to the text, and mark-up for the scheme by which such annotation is represented or encoded.

This is born out by the concept of mark-up languages, which are systems for annotating a document that are syntactically distinguishable from the text, and hence more structured than mere symbols or marginal notes. A very important distinction has to be made, therefore, between the following types of mark-up:

  1. Handwritten mark-up, as applied to a manuscript or typescript document.
  2. Typed mark-up, as applied to a typescript or digital document.
  3. Mark-up language, as typically applied to a digital document.

The first two of these are designed to be humanly-readable whereas the third type is designed to be computer-readable and so must involve grammatical rules that allow it to be parsed by software. To illustrate the difference, consider the following corrected sentence:

My favourite colour is blue red.

A representation of this using a simple typed mark-up (type 2, above) might be:

My favourite colour is <blue> ^red^.

whereas a mark-up language (type 3) might encode it as follows:

My favourite colour is <del>blue</del> <ins>red</ins>.

These may appear equivalent from a visual perspective, but consider the consequences if the altered text contained either angle brackets or carets, or if the replacement word required some clarification — a mark-up language would be able to represent these cases unambiguously so that software could process it. Also, since a mark-up language is designed purely to communicate the information to software, it means that the representation of that same information to the end-user is not fixed, and the choices would be dependent upon the capabilities of the display medium and the sophistication of the display software.

This leads us nicely to perhaps the two best-known mark-up languages: HTML (HyperText Markup Language) and XML (Extensible Markup Language). HTML was created with predefined semantics (for creating Web pages) but XML was created as a general-purpose syntax with no predefined semantics. Interestingly, the semantics associated with HTML have been refined since its initial development; the last example, above, shows a modern Semantic HTML form, but an older form might have been:

My favourite colour is <s>blue</s> red.

This is literally encoding the visual representation, as in the first example sentence, above. The modern shift in emphasis is from presentation to content structure, such that the mark-up would now show what was deleted and what was inserted rather than simply that a line was drawn through a word.

So why might someone use XML rather than HTML when transcribing an historical document? Well, despite that shift in emphasis, HTML is not a good tool for transcribing text. Consider a document that has original emphasis, such as underlines added by the author, or which has already been marked up by an editor; this information has to be preserved and yet be distinguishable from anything employed during the transcription process, and these are not just presentational matters — there are semantics associated with the original formatting. With a mark-up language such as XML then you have the flexibility to represent all the different types and levels of information without any conflict or ambiguity. Both TEI (Text Encoding Initiative) and STEMMA employ mark-up languages with support for transcription, and both have XML representations.


Using the terminology from Markup_language, there are several forms of mark-up that are required for micro-history narrative:
  • Descriptive: Marking the text in order to capture its structure and content, rather than specific visualisations of it. Ultimate control over explicit physical rendition such as colour, bold, italic, underline, font name, and font size are best left to the tool presenting the text (e.g. HTML+CSS).
  • Presentational: This mark-up would be essential for a faithful transcription of something. Although modern systems (such as HTML5) frown on explicit presentational information, it may provide important information necessary for the analysis and correct interpretation of transcribed material. STEMMA’s approach to transcription separates structure and content from presentational or stylistic matters: see Descriptive Mark-up.
  • Semantic: Although the aforementioned wikipedia link suggests that this is an alternative name for Descriptive mark-up, the usage here is more distinct. This mark-up provides information about the meaning or interpretation of textual references. It is therefore different from the structure and layout in a purely textual context, and is precisely what is needed to identify entities such as Persons and Places.
Semantic mark-up is especially important for narrative essays and narrative reports stored in a genealogical context. Although both TEI and STEMMA have their own schemes, there is a divergence that will become more important once the genealogical industry acknowledges a narrative requirement: the semantics are not independent of the data model. This may be hard to explain, but simply flagging a name as that of a person or place — irrespective of whether it makes a conclusional identification — is an isolated semantic that is addressed in a roughly similar fashion by the two schemes. However, linking such a reference into a chain of conclusion-evidence-information-source would not make any sense outside of a genealogical data model. In effect, TEI is a very comprehensive text-encoding scheme but it cannot deal with semantics associated with an all-embracing data model.

A familiar form of mark-up that we might encounter in wikis or blogs is a lightweight markup language. These have a simple syntax that can be entered directly by the editing user, as opposed to being generated in response to some graphical operation or option selection. Although still designed to be computer-readable, they are easier for a human to read — and, hence, to write. For instance:

**bold text** __underline text__   //italic text//


When looking at the mechanics of adding mark-up to an electronic document then there are two very different approaches. The most common is inline, or embedded, mark-up, where the mark-up language is interwoven with the text in a manner such that it can still be distinguished. For example:

Here is a link: <a href="http://parallaxview.co/stemma/">STEMMA</a>

The alternative is known as stand-off, or remote, mark-up and involves holding the mark-up in a separate file (or other location) to the underlying text, usually linking them by character coordinates. The concept of stand-off mark-up is attributed to Henry Thompson and David McKelvie in 1997,[4] and the advantages include:

  • The ability to mark-up read-only (protected) or very large files.
  • The ability to support mark-up from independent editors, held as separate layers, and without them having to form a single code hierarchy.
  • The ability to combine disjoint segments into a single annotation.

Others are stated but I’m less convinced of their value. In contrast, the advantages of inline mark-up include:

  • Simplicity. One file to maintain or distribute.
  • The text and mark-up are edited together, with less chance of them getting out-of-step.

Which is best really depends on the application requirements.

A common example of stand-off mark-up, which isn’t always viewed as such, is CSS (Cascading Style Sheets). It was mentioned above that Semantic HTML favours content structure in place of presentation. This works because modern HTML now goes hand-in-hand with CSS, which can describe the presentational aspects in a separate file. Rather than being linked by character coordinates, they are linked by such things as element type and class, collectively described as selectors, which may explain why CSS is rarely described as stand-off mark-up. In effect, HTML then becomes an inline mark-up describing content that relies on a stand-off mark-up for presentation. The advantages of being able to change the overall presentation style of a Web page in a consistent way, or share the style between multiple pages, should be clear.

I want to round off this review of annotation with a quick mention of the humble word-processor. So familiar and useful is this tool that we give little consideration to how it works, or what goes on inside — oh, how I wish genealogy would catch up there. It allows the end-user to add presentational mark-up (e.g. bold, or a specific font-face) and semantic mark-up (e.g. a hyperlink, or a review comment), but you don’t see the associated mark-up. The associated mark-up language is complicated and so made deliberately invisible to the end-user. The net effect of that is to reinforce the user-interface model and give the impression that the end-user is somehow annotating the visible text directly. This is an important distinction — that a hidden nuts-and-bolts mark-up supports the notion, and the physicality, of annotation in the user interface — and it should be an important consideration for future genealogy tools. There is no excuse for expecting the end-user to edit the raw mark-up rather than using a WYSIWYG (“What You See Is What You Get”) interface.



[1] Oxford Dictionaries Online (http://www.oxforddictionaries.com/us/definition/english/annotate : accessed 1 Mar 2016), s.v. “annotate”.
[2] John Keats, “Ode to a Nightingale(1819); image credit: Ryan Johnson (https://www.flickr.com/photos/kmonojo/4288773728 : accessed 1 Mar 2016); Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0).
[3] Martin Weisser, Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis (John Wiley & Sons, 16 Feb 2016), ch.11.
[4] Henry S. Thompson and David McKelvie, “Hyperlink semantics for standoff markup of read-only documents”, May 1997, technical report, Language Technology Group, HCRC, University of Edinburgh (http://www.ltg.ed.ac.uk/~ht/sgmleu97.html : accessed 2 Mar 2016).