Our digital data has many sets of named types, such as event
types. These sets can become a straightjacket if they are rigidly predefined,
but are extensible sets at odds with the concept of a data standard? The answer
is a resounding ‘No!’.
If you wanted to create a custom event type of, say,
‘Military Service’ then would your software let you? If it did then would that
custom type be accepted by someone else using the same product, or by someone
else using an entirely different product? The answer will be ‘No’ to at least
one of these questions, but there is no good reason for it. It makes sense to
predefine useful and common event types such as Birth, Death, Baptism, etc.,
but a finite list will ultimately be inadequate. There will always be some
less-common event type that doesn’t fit, or you may require special event types
for a different culture, or you may simply want the freedom to define your own
event types in order to represent your personal history.
I want to explain how easily sets of extensible types, and
other tag-names or tag-values[1],
could be implemented in software. This is primarily for people who aren’t
software professionals, although they might find it interesting too.
When a system defines a closed set of predefined types,
options, or terms, then it is referred to as a controlled vocabulary. When that predefined set can be extended or
enhanced then it is referred to as a partially
controlled vocabulary.
As a simple example of a controlled vocabulary, let’s look at
date formatting. Many systems now differentiate four basic date styles: {Short,
Medium, Long, Full}. The software has a default recipe for how to format a date
for your locale in each style. Although you can probably tailor any one of
those default recipes to your own personal preferences, there are no other
style names available for selection.
For genealogical data, there may be many applications of such
vocabularies; both controlled and partially controlled: event types, properties (aka “facts” to
everyone else), place types, role names, status values, name types, name parts,
qualitative assessments (e.g. primary/secondary, original/derivative, etc),
family types, and sex.
Sex is quite
interesting – no, seriously! If someone defined a controlled vocabulary of just
{Male, Female, Unknown} then you might wonder about other variations of sex,
gender, and lifestyle. However, sex and gender are different concepts, and the
birth sex is different again to some variant adopted as part of a later event.
See Sex
and Gender for more details.
So how can we have both a predefined set of types and retain
the ability to create custom ones, whilst also avoiding clashes with anyone
else’s custom types or future predefined types? This is really the crux of the
problem, and it splits the practical applications into two categories. If the
types are part of a passive set, such as event types, then extensibility is not
only simple but custom types could be loaded by any other compliant
application. However, if the types have structural or procedural connotations then
they cannot be loaded by another application without it having knowledge of the
associated structure or procedure. An example of the latter category is the
record types used to store the data.
GEDCOM allows custom record types (aka “tags”) but it merely
recommends that their names have a leading underscore character. The
specification document for the GEDCOM 5.5 release[2]
contains the following explanatory paragraph:
To ensure all transmitted information in the Lineage-Linked GEDCOM is
uniformly identified the standardized tags cannot be placed in any other
context than shown in Chapter 2. It is legal to extend the context of the form,
but only by using user-defined tags which must begin with an underscore. This
will not violate the lineage-linked GEDCOM standard unless the context for the
grammar of the Lineage-Linked GEDCOM Form is violated. The use of the underscore
in the user tag name is to signal a nonstandard construct is being used. This
notifies the reading system of a discrepancy and will avoid future conflicts
with tags that may be standardized in subsequent GEDCOM releases.
This may have prevented
custom tags from clashing with GEDCOM ones reserved in later releases, but it never
prevented clashes between alternative customisations. Simply using an
underscore prefix is clearly not a workable solution. Also, any program
designed around the official GEDCOM tags could do nothing more than ignore
custom tags. An example list of predefined and custom tags may be found
at http://www.gencom.org.nz/GEDCOM_tags.html.
The solution to
this comes from the world of XML
in the form of XML namespaces.
Although I will talk about XML a little here, this general approach could be
applied to any data representation. A namespace
is simply a named container for a set of tag-names (i.e. element or attribute
names). By attributing each set of tag-names to its embracing namespace, no two
names will every clash and so the overall vocabulary can be extended through
the inclusion of new namespaces.
Let’s briefly
look how XML represents namespaces internally. It firstly defines a short prefix
for each namespace name, and then applies
that prefix to all associated tag-names to distinguish them from each other,
and from names in the default namespace which has no prefix. For example:
<root xmlns:my="http://veg.mydomain.com"
xmlns:your="http://furniture.yourdomain.com">
<my:table>
<my:tr>
<my:td> Apples </my:td>
<my:td> Bananas </my:td>
</my:tr>
</my:table>
<your:table>
<your:item> Coffee Table </your:item>
<your:width> 80 </your:width>
<your:length> 120 </your:length>
</your:table>
</root>
Here, my:table
is distinct from your:table as they belong to separate namespaces. The xmlns
attributes associate each prefix with its respective namespace name.
The XML
namespace name is technically a URI but not a URL[3].
This basically means that it is not designed to be dereferenced or to access
any associated resource. It is simply a unique identifier which distinguishes one
namespace from another. The http: prefix, which confuses many people, is simply
indicating that the namespace name is derived from a network domain name that
you, or your organisation, owns. In other words, no separate registration
scheme required here.
The syntax of
a URI actually allows namespace names to be derived from unique roots other
than domain names, such as email addresses, but they are rarely seen in
practice. Another advantage of a URI over, say, a UUID (which is simply a string of
letters and digits with no visible semantics) is that several can be created
from the same root, such as “mydomain.com”. This allows you to create namespaces
for distinct sets of identifiers, and support versioning of those sets.
In the XML
case, its namespaces also supports new structural information being added to a
data schema using something called XML Schema Definition (XSD). This allows each
namespace to define a grammar for its contributions to the underlying XML
syntax. For instance, in the above example, specifying what elements can exist
below your:table, how many of each there can be, and what ordering
is required. Although I give an outline example on the STEMMA® site at Extended
Schemas, I’m not particularly in favour of this level of extensibility.
So, coming
back to original topic, how does this help with genealogical types? Strictly
speaking, XML’s namespaces only apply to its tag-names, although the principle
has been extended since XML’s conception to include tag-values too. For
instance:
<Dataset Name=’Example’
...etc...
<Event Key=’eExample’>
<Type> MyEv:FamilyOuting </Type>
... etc ...
</Event>
One of the
earliest examples of this approach that I am aware of is the Simple Object
Access Protocol (SOAP).
STEMMA also follows this route and its page at Extended
Vocabularies enumerates all of its own controlled and partially controlled
vocabularies.
I’ve
deliberately picked on event types as an illustration because I’ve already
advocated much greater use of events, both protracted and hierarchical, in
order to model the real-life events in our personal histories[4].
Hence, if we wanted to define an event for a “family outing” that we had
evidence for, or distinguish the civil registration of a birth from the birth
itself by using a separate event type[5],
or create a new type for some culturally-dependent event, then we should not be
constrained by a predefined list.
An equally
good illustration could have involved Properties (aka “facts”) since the items
of extracted evidence that we may want to record will depend strongly on the
nature of the information source, and on the relevant culture. The STEMMA
example at Multi-Role
Events includes both custom Roles and custom Properties.
What I’ve
described here is a simply a mechanism. The internals of a data representation
would be hidden by a good product, and you wouldn’t be creating these files by
hand. Someone is going to ask, though, about foreign-language versions, and
it’s worth emphasising that what you enter and what you see are not merely
copies of what’s stored in your data. Having a simple mapping of the
programmatic term (e.g. MyEv:FamilyOuting) to a readable string for the locale
of the current end-user (e.g. “Family Outing”) is one of the few pieces of
configuration necessary in a compliant product.
By way of
contrast, the schema.org mark-up employs support for “external enumerations”
(http://www.w3.org/wiki/WebSchemas/ExternalEnumerations).
These allow its core vocabulary to be supplemented by external ones which must
be accessible via real URLs. The aforementioned document describes these
external vocabularies as “controlled” (i.e. closed) and specifies criteria for
their viability, essentially removing any freedom from their creation.
[1] I’m using the generic
terms tag-name and tag-value here to represent the name and value of a datum,
respectively. I am aware that the term ‘tag’ has specific meaning elsewhere.
For instance, in GEDCOM it’s synonymous with its record names. In XML, it
refers to the name of an element in angle brackets, with or without a ‘/’
character, e.g. <x>, </x>, and <x/>.
[2] “Appendix A: Lineage-Linked GEDCOM Tag Definition” in The GEDCOM Standard: Release 5.5 (Family
History Department of The Church of Jesus Christ of Latter-day Saints, 2 Jan
1996).
[3] Uniform Resource
Identifiers (URI),
Uniform Resource Locators (URL), and
Uniform Resource names (URN), are often
confused. The URI represents a general class of resource identifier which
includes both URL and URN. The URN always begins with a urn: scheme prefix and
has a restricted syntax designed for the hierarchical naming of resources. Its
NID term, which follows the scheme prefix, has to be registered with the IANA
for it to be official.
[4] See “Eventful Genealogy”, Blogger.com, Parallax View,
3 Nov 2013
(http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html).
[5] Many people in
Britain are guilty of confusing these by taking the year and quarter from the
GRO civil registration index and recording them as the date of the vital event
itself.
No comments:
Post a Comment