Saturday, 14 December 2013

Digital Freedom



Our digital data has many sets of named types, such as event types. These sets can become a straightjacket if they are rigidly predefined, but are extensible sets at odds with the concept of a data standard? The answer is a resounding ‘No!’.



If you wanted to create a custom event type of, say, ‘Military Service’ then would your software let you? If it did then would that custom type be accepted by someone else using the same product, or by someone else using an entirely different product? The answer will be ‘No’ to at least one of these questions, but there is no good reason for it. It makes sense to predefine useful and common event types such as Birth, Death, Baptism, etc., but a finite list will ultimately be inadequate. There will always be some less-common event type that doesn’t fit, or you may require special event types for a different culture, or you may simply want the freedom to define your own event types in order to represent your personal history.

I want to explain how easily sets of extensible types, and other tag-names or tag-values[1], could be implemented in software. This is primarily for people who aren’t software professionals, although they might find it interesting too.

When a system defines a closed set of predefined types, options, or terms, then it is referred to as a controlled vocabulary. When that predefined set can be extended or enhanced then it is referred to as a partially controlled vocabulary.

As a simple example of a controlled vocabulary, let’s look at date formatting. Many systems now differentiate four basic date styles: {Short, Medium, Long, Full}. The software has a default recipe for how to format a date for your locale in each style. Although you can probably tailor any one of those default recipes to your own personal preferences, there are no other style names available for selection.

For genealogical data, there may be many applications of such vocabularies; both controlled and partially controlled: event types, properties (aka “facts” to everyone else), place types, role names, status values, name types, name parts, qualitative assessments (e.g. primary/secondary, original/derivative, etc), family types, and sex.

Sex is quite interesting – no, seriously! If someone defined a controlled vocabulary of just {Male, Female, Unknown} then you might wonder about other variations of sex, gender, and lifestyle. However, sex and gender are different concepts, and the birth sex is different again to some variant adopted as part of a later event. See Sex and Gender for more details.

So how can we have both a predefined set of types and retain the ability to create custom ones, whilst also avoiding clashes with anyone else’s custom types or future predefined types? This is really the crux of the problem, and it splits the practical applications into two categories. If the types are part of a passive set, such as event types, then extensibility is not only simple but custom types could be loaded by any other compliant application. However, if the types have structural or procedural connotations then they cannot be loaded by another application without it having knowledge of the associated structure or procedure. An example of the latter category is the record types used to store the data.

GEDCOM allows custom record types (aka “tags”) but it merely recommends that their names have a leading underscore character. The specification document for the GEDCOM 5.5 release[2] contains the following explanatory paragraph:

To ensure all transmitted information in the Lineage-Linked GEDCOM is uniformly identified the standardized tags cannot be placed in any other context than shown in Chapter 2. It is legal to extend the context of the form, but only by using user-defined tags which must begin with an underscore. This will not violate the lineage-linked GEDCOM standard unless the context for the grammar of the Lineage-Linked GEDCOM Form is violated. The use of the underscore in the user tag name is to signal a nonstandard construct is being used. This notifies the reading system of a discrepancy and will avoid future conflicts with tags that may be standardized in subsequent GEDCOM releases.

This may have prevented custom tags from clashing with GEDCOM ones reserved in later releases, but it never prevented clashes between alternative customisations. Simply using an underscore prefix is clearly not a workable solution. Also, any program designed around the official GEDCOM tags could do nothing more than ignore custom tags. An example list of predefined and custom tags may be found at http://www.gencom.org.nz/GEDCOM_tags.html.

The solution to this comes from the world of XML in the form of XML namespaces. Although I will talk about XML a little here, this general approach could be applied to any data representation. A namespace is simply a named container for a set of tag-names (i.e. element or attribute names). By attributing each set of tag-names to its embracing namespace, no two names will every clash and so the overall vocabulary can be extended through the inclusion of new namespaces.




Let’s briefly look how XML represents namespaces internally. It firstly defines a short prefix for each namespace name, and then applies that prefix to all associated tag-names to distinguish them from each other, and from names in the default namespace which has no prefix. For example:

<root xmlns:my="http://veg.mydomain.com"
     xmlns:your="http://furniture.yourdomain.com">
 

     <my:table>
          <my:tr>
               <my:td> Apples </my:td>
               <my:td> Bananas </my:td>
          </my:tr>
     </my:table>

     <your:table>
          <your:item> Coffee Table </your:item>
          <your:width> 80 </your:width>

          <your:length> 120 </your:length>
</your:table>
 

</root>

Here, my:table is distinct from your:table as they belong to separate namespaces. The xmlns attributes associate each prefix with its respective namespace name.

The XML namespace name is technically a URI but not a URL[3]. This basically means that it is not designed to be dereferenced or to access any associated resource. It is simply a unique identifier which distinguishes one namespace from another. The http: prefix, which confuses many people, is simply indicating that the namespace name is derived from a network domain name that you, or your organisation, owns. In other words, no separate registration scheme required here.

The syntax of a URI actually allows namespace names to be derived from unique roots other than domain names, such as email addresses, but they are rarely seen in practice. Another advantage of a URI over, say, a UUID (which is simply a string of letters and digits with no visible semantics) is that several can be created from the same root, such as “mydomain.com”. This allows you to create namespaces for distinct sets of identifiers, and support versioning of those sets.

In the XML case, its namespaces also supports new structural information being added to a data schema using something called XML Schema Definition (XSD). This allows each namespace to define a grammar for its contributions to the underlying XML syntax. For instance, in the above example, specifying what elements can exist below your:table, how many of each there can be, and what ordering is required. Although I give an outline example on the STEMMA® site at Extended Schemas, I’m not particularly in favour of this level of extensibility.

So, coming back to original topic, how does this help with genealogical types? Strictly speaking, XML’s namespaces only apply to its tag-names, although the principle has been extended since XML’s conception to include tag-values too. For instance:

<Dataset Name=’Example’
     xmlns:MyEv=’http://mydomain.com/myevents’>
     ...etc...

     <Event Key=’eExample’>
          <Type> MyEv:FamilyOuting </Type>
          ... etc ...
     </Event>

One of the earliest examples of this approach that I am aware of is the Simple Object Access Protocol (SOAP). STEMMA also follows this route and its page at Extended Vocabularies enumerates all of its own controlled and partially controlled vocabularies.

I’ve deliberately picked on event types as an illustration because I’ve already advocated much greater use of events, both protracted and hierarchical, in order to model the real-life events in our personal histories[4]. Hence, if we wanted to define an event for a “family outing” that we had evidence for, or distinguish the civil registration of a birth from the birth itself by using a separate event type[5], or create a new type for some culturally-dependent event, then we should not be constrained by a predefined list.

An equally good illustration could have involved Properties (aka “facts”) since the items of extracted evidence that we may want to record will depend strongly on the nature of the information source, and on the relevant culture. The STEMMA example at Multi-Role Events includes both custom Roles and custom Properties.

What I’ve described here is a simply a mechanism. The internals of a data representation would be hidden by a good product, and you wouldn’t be creating these files by hand. Someone is going to ask, though, about foreign-language versions, and it’s worth emphasising that what you enter and what you see are not merely copies of what’s stored in your data. Having a simple mapping of the programmatic term (e.g. MyEv:FamilyOuting) to a readable string for the locale of the current end-user (e.g. “Family Outing”) is one of the few pieces of configuration necessary in a compliant product.

By way of contrast, the schema.org mark-up employs support for “external enumerations” (http://www.w3.org/wiki/WebSchemas/ExternalEnumerations). These allow its core vocabulary to be supplemented by external ones which must be accessible via real URLs. The aforementioned document describes these external vocabularies as “controlled” (i.e. closed) and specifies criteria for their viability, essentially removing any freedom from their creation.



[1] I’m using the generic terms tag-name and tag-value here to represent the name and value of a datum, respectively. I am aware that the term ‘tag’ has specific meaning elsewhere. For instance, in GEDCOM it’s synonymous with its record names. In XML, it refers to the name of an element in angle brackets, with or without a ‘/’ character, e.g. <x>, </x>, and <x/>.
[2] “Appendix A: Lineage-Linked GEDCOM Tag Definition” in The GEDCOM Standard: Release 5.5 (Family History Department of The Church of Jesus Christ of Latter-day Saints, 2 Jan 1996).
[3] Uniform Resource Identifiers (URI), Uniform Resource Locators (URL), and Uniform Resource names (URN), are often confused. The URI represents a general class of resource identifier which includes both URL and URN. The URN always begins with a urn: scheme prefix and has a restricted syntax designed for the hierarchical naming of resources. Its NID term, which follows the scheme prefix, has to be registered with the IANA for it to be official.
[4] See “Eventful Genealogy, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html).
[5] Many people in Britain are guilty of confusing these by taking the year and quarter from the GRO civil registration index and recording them as the date of the vital event itself.

No comments:

Post a Comment