Saturday, 26 March 2016

Dynamic Genealogical Data



Yes, I will be discussing data models and software issues in this post — sigh! — but hopefully in a manner that may be instructive to those who want to know more. As well as introducing certain important concepts, I want to illustrate some typically tricky decisions that have to be made, before rounding off with a novel way of presenting custom data to the end-user. Know-it-alls who already have a software background can just fast-forward to the “Reporting” section.

Some time ago, I submitted a paper to FHISO on the subject of object models[1] in which I explained their relevance to scripting languages, query languages, and general dynamic data access. Admittedly, this paper was aimed at a software audience, but let’s try and pull it apart to explain what these languages are, and the difference between a data format, a data model, and an object model.

Dynamic Data

Data Model

A data model is a formalised description of the relevant data entities (e.g. person, or place), their properties (e.g. names, sex, coordinates, etc.), and their relationships (e.g. biological lineage, or place of birth). Issues such as indexes and database schemas are not applicable to data models as they are a more abstract definition of the data’s structure, or rather its pattern. But issues such as cardinality (how many items of one type can be related to another), ordinality (the ordering of items related to one instance of another), and optionality (whether an item is mandatory or optional), are relevant.

Let’s consider one aspect of a genealogical data model to help illustrate this point: biological lineage. Every person had one mother and one father, even if they are unidentified; there cannot be more than one of each (ignoring the possibility of donor DNA) but some representations of this will be better than others.

If the parent person entity (e.g. a mother) points to each individual child of hers then it accommodates the erroneous situation where multiple mothers might point to the same child. However, the converse of having each person entity point to their respective mother and father enforces the cardinal integrity of the relationship without having to perform constant error checking.

Entity relationship schemes

It should be noted, here, that the direction of the link makes no difference to the ability to find children-of-a-parent or parents-of-a-child; both can be indexed from either one of the schemes.

So what if you don’t know the father? Well, having a missing father or mother link could easily be recognised as an indication that they are unidentified, but suppose that you have some incomplete details for them? Suppose, for instance, that you don’t know the name of the mother but you have her date of birth. This is where we get into controversial territory, and where I’m going to make a very bold statement. There are many threads that advocate the substitution of underscores, question marks, or some special text such as “Unknown”, “LNU” (Last Name Unknown), etc., for a missing name. In a software context, all of these are absolutely wrong, and not the right way to handle the situation. This is not a personal belief — it is a best-practice in a profession that I have spent decades in. It doesn’t matter who is making those recommendations, and there are no special cases for genealogists.

OK, rant over; now let me clarify this. Ignoring the fact that any alphabetic text may not translate well when sharing your data with someone from a different locale, the choice over which substitution to use for a missing name, or any missing datum, should not be given to the end-user, or even to a specific software product. More than that, good software would represent non-value conditions of a datum, such as unknown, not applicable, or erroneous, in a different domain to real values so that there is zero chance of a clash. For instance, consider the difference in a SQL database between NULL (a special condition in a column) and “NULL” (a normal textual value in a column). Also, the display value by which those special data conditions are represented to the end-user is a choice by the product, and not dictated by the software representation in the corresponding data format or data model.

I appreciate that some software may not follow these best-practices, but it is important to understand why this is bad for everyone. Tamura Jones produced an excellent article in 2013 related to this subject that discussed the impact of using acronyms and other invented values.[2] As I recently commented on one of Randy Seaver’s blogs, “Fake, Fudged, Dummy, and other such ‘special’ values were bad choices even in the 1970s”.

We’ve just looked at some choices in the relationships between two entities, and in the representation of non-value conditions. These cases may provide a basic insight into typical design issues affecting a data model, but what is a data model good for? It allows two products — a producer and a consumer — to agree on the structure of real data being exchanged between them.  When real data is stored in a file (or serialised as bytes for some other purpose, such as transmission over a communications link) then the data format employs a given syntax. That data format is largely irrelevant in comparison to the data model; the GEDCOM data model could be expressed in its own proprietary data format, or in XML, or some other format, and it would be a straightforward mechanical operation to convert from one format to another if they all conform to the same model.

However, such files are a very static way to exchange data, and they have to be loaded into some organised indexed form before they can be interrogated, navigated, or manipulated. One example of such a form is a database, but this is not the only form and genealogical products have other choices (see Do Genealogists Really Need a Database?). The following diagram provides a simplistic depiction of how a program may access indexed data in a disk-resident database or in memory, but other variations will be possible. For instance, a memory-resident index pointing to files on disk, as might be the case with a collection of image files.

Data access

STEMMA deliberately describes its data format as a source format in order to attach extra semantics; this draws from the term source code, as used in programming, in order to emphasise that the data is a definitive source for other forms, whether generated by transformation or by indexing, and not simply an exchange format.

Object Model

A program that accesses a database is constrained by the data-types allowed in its columns, the nature of its indexes, normalisation of its table entities, and the associated query language used to access those tables (mostly but not always SQL). Software that uses a query language directly, rather than having some abstraction layer between it and the target database, may be reducing both its longevity and its portability.

An in-memory index is more efficient and more flexible, and the ever-increasing memory capacity of modern machines means that it is totally practical to have both the data and the index in memory together. But what would the in-memory data look like?

The modern answer to this question is objects. In object orientated programming (OOP), an object is a software entity representing one instance of some real-life entity. For instance, a Person object[3] would represent one named person, and a Place object one named place. Objects of the same type are instantiated (i.e. created) from a template called a class, which defines not only the allowable properties (e.g. names and sex for a person) but also small segments of code, called methods, that may be invoked on the associated objects. For instance, there may be a method to test whether any of the names stored in the current object matches a particular name provided as a parameter. All products utilising that class in its programming would therefore use a consistent algorithm, as implemented by the designer of the associated object model: the set of related classes intended to cooperate in order to deliver access to that data. An object model is not uniquely defined by a given data model, but they do go hand-in-hand; every object model has an underpinning data model. However, whereas a data model talks about entity relationships, it is the object model that talks about actual data linkages, indexes, and issues of efficient data access.

A very important aspect of OOP is software inheritance. This is where one class is derived from another class and allows it to share unchanged portions while overriding others; the intention being to create a new class for a more specialised type of entity. The following is an illustration of how it might be applied to the various subjects of historical research, each level providing specialised classes derived from more generic ones in the previous level.


Software inheritance
In this illustration, the handling of names could be shared for all the subject types, and inherited from the historical-subject base class, whereas the hierarchy for animate subjects (i.e. lineage) would be different from that for inanimate subjects.

We might then redraw the earlier data-access diagram as follows in order to show that the object layer forms an effective abstraction:


Abstracted data access
The program sees the same application programming interface (API), irrespective of whether the data is in local memory, in some database, or even across some network connection. NB: The index associated with the objects would be implicit in their class definitions, and not really a separate entity as implied by this diagram.

I’ve just mentioned access across a network, where your data may be on a separate machine (the server) to that of your program (the client), e.g. in the “cloud”. This is a case for which query languages are well suited. You see, if data server has millions of records available but the program on your client machine only wants to see a handful that satisfy some specific criteria, it would be extremely inefficient to transport all the records up to the client machine, across a typically slow network — much easier to push the criteria down to the server and let software there sort it out. This is effectively what happens when a SQL query is sent to a server hosting a database; the query may be as small as a single line and the returned records would be just the ones of interest. There are other forms of query language, such as MDX for multi-dimensional data queries, and the form of the returned data is highly dependent on the nature of the query.

A scripting language is usually considered to be an interpreted language, meaning one executed directly from its source-code representation — as entered by the programmer — rather than being first compiled into a set of instructions that the machine can execute directly. Because the source has to be parsed before it can be interpreted then there is a performance penalty with them, but their speed of deployment and ease of maintenance make them very useful for small-scale applications. Languages such as JavaScript and VBScript are common examples, but there is little difference from the aforementioned query languages, other than in their intended function. Many database systems even support a mechanism where segments of script can be held as stored procedures that can be invoked later by a special type of query statement. The relevance of this is that when a scripting language has access to an object model then you have a very powerful and flexible means of data access.

Suppose we want to express a custom genealogical query — one not supported intrinsically by our product — to look at all the events in our timeline, then look at all the persons connected to those same events, and then to select just the ones whose name(s) have the token "Jesson" in them. If a standard object model is defined then it doesn’t matter whether we express our query using a standard scripting language or some proprietary one associated with the product.

This example uses a java-like syntax.

Person me = New Person("Tony Proctor", 1956);
for (Event e: me.allEvents()) {
for (Person other: e.allPersons()) {
if (other.nameContains("Jesson")) {
...do something with this other person...
}
}
}

This example uses a VB6-like syntax.

Dim me As New Person
me.setPersonName (“Tony Proctor”)
me.setDateOfBirth (1956)
Dim e As Event
Dim other As Person
For Each e In me.allEvents()
For Each other in e.allPersons()
If (other.nameContains(“Jesson”)) Then
...do something with this other person...
End If
Next other
Next e

The syntax is very different but the essential elements of the object model, such as class names and method names, are the same in the two examples.

So, in summary, irrespective of where the data is coming from (memory, database, or afar), if an object model is available then any processing or queries can be expressed through scripting languages. Furthermore, those segments of code can be pushed down to the data server in order to achieve efficient retrieval in the case of remote data stores.

Citations

One instance where I have found it beneficial to expose an object model was in the processing of citation-elements in order to generate a formatted citation. A citation-element is a discrete datum that can be identified in a citation, such as an author, title, publication date, etc. When a system generates a formatted citation using a citation-template then it takes various citation-element values and inserts them at selected places in a textual template. The onus is on the hosting software to provide the citation-template with all the relevant values, and this can be a heavy burden if it doesn’t have intimate knowledge of the actual template.

Suppose that the citation is required to name the Genealogical Publishing. Co. of Baltimore. How much information should the hosting program pass to the citation-template to let the reader know where that company is? Is a prefix of “Baltimore:” enough, or does it require “Baltimore, Maryland:”, or does it require “Baltimore, Maryland, US:”? Remember that the reader may not be American, and if you think they should know all the US states then would you be aware of all the regions in their country?

The alternative approach is to pass a Place object to the citation-template and allow it to invoke the associated methods to obtain the specific information that it requires for the current template and the current user. The same approach can be applied to other data such as a Person object (allowing access to name-handling methods), or a Date object (allowing the formatting of dates from alternative calendars in short/medium/long/full forms).

Reporting

A more recent application of an object model occurred in STEMMA’s narrative support. The inclusion of a segment of script in a STEMMA table entity allowed it to fetch specific data to populate the cells at the time associated Narrative entity was rendered. This basically meant that a Narrative entity could be used as a report writer (in software terminology) to fetch up-to-date data matching a custom query and to present it along with narrative or other data.

To explain the significance of this, let’s look at the HTML equivalent since it is also possible to dynamically populate an HTML table. Once upon a time, programmers used to do this by injecting new HTML source into the table using the JavaScript document.write() function. For instance:


document.write (“<tr>”);

This allowed, say, the rows and cells of a table to be generated from some data such as the results of a SQL query or the contents of a JavaScript array. This is a very old method, and has acknowledged security risks, as well as performance implications in certain cases. An equivalent way of injecting HTML source is to modify the innerHTML property of a given node. For instance:

node.innerHTML = "<b>Hello World</b>";

A better approach, though, would be use methods provided by the HTML Document Object Model (DOM): the tree of node objects into which an HTML document is compiled. For instance, document.createElement() and node.appendChild(). In the specific case of tables, W3C defined DOM methods such as tbody.insertRow() and tr.insertCell() to help build the table body more easily.

The STEMMA case is slightly different from that of HTML since it does not have an inherent DOM — the contents of a marked-up Narrative entity have to be transformed into some other system for presentation, which might be in an HTML page, a blog page, or a word-processor document — but it does expose a tentative genealogical object model. This means that the STEMMA table entity can specify a segment of script code to call upon that object model and retrieve results for the table contents. It doesn’t do this by dynamically injecting STEMMA source, or by calling on something like the DOM methods to build-up a table; the script populates an object that simply describes the contents of a table — the headings and data cells — and this can be transformed for display in the same manner as a static table would.

What this all means is that the existing support for representing narrative essays, narrative reports, transcriptions, etc., has also become a tool for dynamically reporting on genealogical data. I don’t need to devise some completely separate tool to achieve this, and I also get the benefit of mix-and-match where dynamically generated tables can be embedded in my narrative reports. And just in case anyone hasn’t realised yet, if the underlying data is subsequently modified then I will see the very latest data the next time I view the associated Narrative entity.



[1] Tony Proctor, "Proposal to Create a Standard Run-time Object Model", FHISO Call For Papers, CFPS 9 (http://tech.fhiso.org/cfps/files/cfps19.pdf : accessed 9 Mar 2016);
[2] Tamura Jones, “FNU LNU MNU UNK”, Modern Software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml : accessed 9 Mar 2016).
[3] Using capitalisation, here, merely to distinguish software entities from the real-life entities that they are representing.

No comments:

Post a Comment