Yes, I will be discussing data models and software issues in
this post — sigh! — but hopefully in a manner that may be instructive to those
who want to know more. As well as introducing certain important concepts, I
want to illustrate some typically tricky decisions that have to be made, before
rounding off with a novel way of presenting custom data to the end-user. Know-it-alls
who already have a software background can just fast-forward to the “Reporting”
section.
Some time ago, I submitted a paper to FHISO on the subject
of object models[1] in
which I explained their relevance to scripting languages, query languages, and
general dynamic data access. Admittedly, this paper was aimed at a software
audience, but let’s try and pull it apart to explain what these languages are,
and the difference between a data format, a data model, and an object model.
A data model is a
formalised description of the relevant data entities (e.g. person, or place),
their properties (e.g. names, sex, coordinates, etc.), and their relationships
(e.g. biological lineage, or place of birth). Issues such as indexes and
database schemas are not applicable to data models as they are a more abstract
definition of the data’s structure, or rather its pattern. But issues such as cardinality (how many items of one type
can be related to another), ordinality
(the ordering of items related to one instance of another), and optionality (whether an item is
mandatory or optional), are relevant.
Let’s consider one aspect of a genealogical data model to
help illustrate this point: biological lineage. Every person had one mother and
one father, even if they are unidentified; there cannot be more than one of
each (ignoring the possibility of donor DNA) but some representations of this
will be better than others.
If the parent person entity (e.g. a mother) points to each
individual child of hers then it accommodates the erroneous situation where
multiple mothers might point to the same child. However, the converse of having
each person entity point to their respective mother and father enforces the
cardinal integrity of the relationship without having to perform constant error
checking.
It should be noted, here, that the direction of the link
makes no difference to the ability to find children-of-a-parent or
parents-of-a-child; both can be indexed from either one of the schemes.
So what if you don’t know the father? Well, having a missing
father or mother link could easily be recognised as an indication that they are
unidentified, but suppose that you have some incomplete details for them?
Suppose, for instance, that you don’t know the name of the mother but you have
her date of birth. This is where we get into controversial territory, and where
I’m going to make a very bold statement. There are many threads that advocate
the substitution of underscores, question marks, or some special text such as
“Unknown”, “LNU” (Last Name Unknown), etc., for a missing name. In a software
context, all of these are absolutely wrong, and not the right way to handle
the situation. This is not a personal belief — it is a best-practice in a
profession that I have spent decades in. It doesn’t matter who is making those
recommendations, and there are no special cases for genealogists.
OK, rant over; now let me clarify this. Ignoring the fact
that any alphabetic text may not translate well when sharing your data with
someone from a different locale, the choice over which substitution to use for
a missing name, or any missing datum, should not be given to the end-user, or
even to a specific software product. More than that, good software would
represent non-value conditions of a datum, such as unknown, not applicable,
or erroneous, in a different domain
to real values so that there is zero chance of a clash. For instance, consider
the difference in a SQL database between NULL (a special condition in a column)
and “NULL” (a normal textual value in a column). Also, the display value by
which those special data conditions are represented to the end-user is a
choice by the product, and not dictated by the software representation in the
corresponding data format or data model.
I appreciate that some software may not follow these
best-practices, but it is important to understand why this is bad for everyone.
Tamura Jones produced an excellent article in 2013 related to this subject that
discussed the impact of using acronyms and other invented values.[2] As
I recently commented on one of Randy
Seaver’s blogs, “Fake, Fudged, Dummy, and other such ‘special’ values were
bad choices even in the 1970s”.
We’ve just looked at some choices in the relationships
between two entities, and in the representation of non-value conditions. These
cases may provide a basic insight into typical design issues affecting a data
model, but what is a data model good for? It allows two products — a producer and a consumer — to agree on the structure of real data being exchanged
between them. When real data is stored
in a file (or serialised as bytes for some other purpose, such as transmission
over a communications link) then the data format employs a given syntax. That
data format is largely irrelevant in comparison to the data model; the GEDCOM
data model could be expressed in its own proprietary data format, or in XML, or
some other format, and it would be a straightforward mechanical operation to
convert from one format to another if they all conform to the same model.
However, such files are a very static way to exchange data,
and they have to be loaded into some organised indexed form before they can be
interrogated, navigated, or manipulated. One example of such a form is a
database, but this is not the only form and genealogical products have other
choices (see Do
Genealogists Really Need a Database?). The following diagram provides a
simplistic depiction of how a program may access indexed data in a
disk-resident database or in memory, but other variations will be possible. For
instance, a memory-resident index pointing to files on disk, as might be the
case with a collection of image files.
STEMMA
deliberately describes its data format as a source
format in order to attach extra semantics; this draws from the term source code, as used in programming, in
order to emphasise that the data is a definitive source for other forms, whether
generated by transformation or by indexing, and not simply an exchange format.
A program that accesses a database is constrained by the
data-types allowed in its columns, the nature of its indexes, normalisation
of its table entities, and the associated query language used to
access those tables (mostly but not always SQL). Software that uses a query
language directly, rather than having some abstraction layer between it and the
target database, may be reducing both its longevity and its portability.
An in-memory index is more efficient and more flexible, and
the ever-increasing memory capacity of modern machines means that it is totally
practical to have both the data and the index in memory together. But what
would the in-memory data look like?
The modern answer to this question is objects. In object orientated
programming (OOP), an
object is a software entity representing one instance of some real-life entity.
For instance, a Person object[3]
would represent one named person, and a Place object one named place. Objects
of the same type are instantiated (i.e. created) from a template called a class, which defines not only the
allowable properties (e.g. names and sex for a person) but also small segments
of code, called methods, that may be
invoked on the associated objects. For instance, there may be a method to test
whether any of the names stored in the current object matches a particular name
provided as a parameter. All products utilising that class in its programming
would therefore use a consistent algorithm, as implemented by the designer of
the associated object model: the set
of related classes intended to cooperate in order to deliver access to
that data. An object model is not uniquely defined by a given data model,
but they do go hand-in-hand; every object model has an underpinning data model.
However, whereas a data model talks about entity relationships, it is the
object model that talks about actual data linkages, indexes, and issues of
efficient data access.
A very important aspect of OOP is software inheritance. This is where one class is
derived from another class and allows it to share unchanged portions while
overriding others; the intention being to create a new class for a more
specialised type of entity. The following is an illustration of how it might be
applied to the various subjects of historical research, each level providing
specialised classes derived from more generic ones in the previous level.
In this illustration, the handling of names could be shared
for all the subject types, and inherited from the historical-subject base class, whereas the hierarchy for
animate subjects (i.e. lineage) would be different from that for inanimate
subjects.
We might then redraw the earlier data-access diagram as
follows in order to show that the object layer forms an effective abstraction:
The program sees the same application programming interface (API),
irrespective of whether the data is in local memory, in some database, or even across
some network connection. NB: The index associated with the objects would be
implicit in their class definitions, and not really a separate entity as
implied by this diagram.
I’ve just mentioned access across a network, where your data
may be on a separate machine (the server)
to that of your program (the client),
e.g. in the “cloud”. This is a case for which query languages are well suited.
You see, if data server has millions of records available but the program on
your client machine only wants to see a handful that satisfy some specific criteria,
it would be extremely inefficient to transport all the records up to the client
machine, across a typically slow network — much easier to push the criteria down
to the server and let software there sort it out. This is effectively what
happens when a SQL query is sent to a server hosting a database; the query may
be as small as a single line and the returned records would be just the ones of
interest. There are other forms of query language, such as MDX for
multi-dimensional data queries, and the form of the returned data is highly
dependent on the nature of the query.
A scripting language
is usually considered to be an interpreted language, meaning one executed
directly from its source-code representation — as entered by the programmer —
rather than being first compiled into a set of instructions that the machine
can execute directly. Because the source has to be parsed before it can be
interpreted then there is a performance penalty with them, but their speed of
deployment and ease of maintenance make them very useful for small-scale
applications. Languages such as JavaScript and VBScript are common examples,
but there is little difference from the aforementioned query languages, other
than in their intended function. Many database systems even support a mechanism
where segments of script can be held as stored
procedures that can be invoked later by a special type of query statement.
The relevance of this is that when a scripting language has access to an object
model then you have a very powerful and flexible means of data access.
Suppose we want to express a custom genealogical query — one
not supported intrinsically by our product — to look at all the events in our
timeline, then look at all the persons connected to those same events, and then
to select just the ones whose name(s) have the token "Jesson" in them.
If a standard object model is defined then it doesn’t matter whether we express
our query using a standard scripting language or some proprietary one
associated with the product.
This example uses a java-like syntax.
Person me = New Person("Tony Proctor", 1956);
Person me = New Person("Tony Proctor", 1956);
for
(Event e: me.allEvents()) {
for
(Person other: e.allPersons()) {
if
(other.nameContains("Jesson")) {
...do
something with this other person...
}
}
}
This example uses a VB6-like syntax.
Dim
me As New Person
me.setPersonName
(“Tony Proctor”)
me.setDateOfBirth
(1956)
Dim e
As Event
Dim
other As Person
For
Each e In me.allEvents()
For
Each other in e.allPersons()
If
(other.nameContains(“Jesson”)) Then
...do
something with this other person...
End
If
Next
other
Next
e
The syntax is very different but the essential elements of
the object model, such as class names and method names, are the same in the two
examples.
So, in summary, irrespective of where the data is coming
from (memory, database, or afar), if an object model is available then any processing
or queries can be expressed through scripting languages. Furthermore, those
segments of code can be pushed down to the data server in order to achieve
efficient retrieval in the case of remote data stores.
One instance where I have found it beneficial to expose an
object model was in the processing of citation-elements in order to generate a
formatted citation. A citation-element is a discrete datum that can be
identified in a citation, such as an author, title, publication date, etc. When
a system generates a formatted citation using a citation-template then it takes
various citation-element values and inserts them at selected places in a textual
template. The onus is on the hosting software to provide the citation-template
with all the relevant values, and this can be a heavy burden if it doesn’t have
intimate knowledge of the actual template.
Suppose that the citation is required to name the
Genealogical Publishing. Co. of Baltimore. How much information should the
hosting program pass to the citation-template to let the reader know where that
company is? Is a prefix of “Baltimore:” enough, or does it require “Baltimore,
Maryland:”, or does it require “Baltimore, Maryland, US:”? Remember that the
reader may not be American, and if you think they should know all the US states
then would you be aware of all the regions in their country?
The alternative approach is to pass a Place object to the
citation-template and allow it to invoke the associated methods to obtain the
specific information that it requires for the current template and the current
user. The same approach can be applied to other data such as a Person object
(allowing access to name-handling methods), or a Date object (allowing the formatting
of dates from alternative calendars in short/medium/long/full forms).
A more recent application of an object model occurred in
STEMMA’s narrative support. The inclusion of a segment of script in a STEMMA
table entity allowed it to fetch specific data to populate the cells at the
time associated Narrative entity was rendered. This basically meant that a
Narrative entity could be used as a report writer (in software terminology) to
fetch up-to-date data matching a custom query and to present it along with
narrative or other data.
To explain the significance of this, let’s look at the HTML
equivalent since it is also possible to dynamically populate an HTML table. Once
upon a time, programmers used to do this by injecting new HTML source into the
table using the JavaScript document.write()
function. For instance:
document.write
(“<tr>”);
This allowed, say, the rows and cells of a table to be
generated from some data such as the results of a SQL query or the contents of a
JavaScript array. This is a very old method, and has acknowledged security risks,
as well as performance implications in certain cases. An equivalent way of
injecting HTML source is to modify the innerHTML
property of a given node. For instance:
node.innerHTML
= "<b>Hello World</b>";
A better approach, though, would be use methods provided by
the HTML Document Object Model (DOM): the tree of
node objects into which an HTML document is compiled. For instance, document.createElement() and node.appendChild(). In the specific case of
tables, W3C
defined DOM methods such as tbody.insertRow()
and tr.insertCell() to help build the table
body more easily.
The STEMMA case is slightly different from that of HTML
since it does not have an inherent DOM — the contents of a marked-up Narrative
entity have to be transformed into some other system for presentation, which
might be in an HTML page, a blog page, or a word-processor document — but it
does expose a tentative genealogical object model. This means that the STEMMA
table entity can specify a segment of script code to call upon that object
model and retrieve results for the table contents. It doesn’t do this by
dynamically injecting STEMMA source, or by calling on something like the DOM
methods to build-up a table; the script populates an object that simply
describes the contents of a table — the headings and data cells — and this can
be transformed for display in the same manner as a static table would.
What this all means is that the existing support for
representing narrative essays, narrative reports, transcriptions, etc., has
also become a tool for dynamically reporting on genealogical data. I don’t need
to devise some completely separate tool to achieve this, and I also get the
benefit of mix-and-match where dynamically generated tables can be embedded in
my narrative reports. And just in case anyone hasn’t realised yet, if the
underlying data is subsequently modified then I will see the very latest data
the next time I view the associated Narrative entity.
[1] Tony Proctor, "Proposal to
Create a Standard Run-time Object Model", FHISO Call For Papers, CFPS 9 (http://tech.fhiso.org/cfps/files/cfps19.pdf
: accessed 9 Mar 2016);
[2] Tamura Jones, “FNU
LNU MNU UNK”, Modern Software Experience,
11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml
: accessed 9 Mar 2016).
[3] Using capitalisation,
here, merely to distinguish software entities from the real-life entities that
they are representing.