Are we Modelling Data or Commerce?
The tone of certain Internet posts and related discussions
has recently moved from genealogical data
formats to genealogical data models.
What does this mean, though, and where will it all end? What commercial factors
might be at play here?
It is no coincidence that James Tanner has recently
discussed this same topic on his blog at What
happened to GEDCOM?, Are
data-sharing standards possible?, and What
is a genealogical Data Model? but I want to give a very different
perspective here.
A data format
(more correctly called a ‘serialisation format’) is a specific physical
representation of your data in a file using a particular data syntax. A data model, on the other hand, is a
description of the shape and structure of the data without using a specific
syntax. In order to illustrate this, consider how a person is related to their
biological parents. To state that each person is linked to just one father and to
one mother might be part of a data model specification. However, the nature of
that linkage, and what it might look like, is only relevant when discussing a
specific data format.
The most widely accepted function of these data formats and data
models is for the exchange of genealogical data between people, and hence
between different software products, possibly running on different types of
machine and in different locales. Sharing our data is a fundamental tenet of
genealogy — if it wasn’t possible then it wouldn’t work.
STEMMA® describes these data formats,
and itself, as source formats in
order to emphasise their fundamental nature. They are essentially an
unambiguous textual representation from which any number of indexed forms can
be generated. This, in turn, is analogous to the source code for a programming language from which a compiled form
can be generated unambiguously for different machine architectures. Note
that no data format or data model makes any specification about indexes or
database schemas. They’re the prerogative of the designers of the software
products that process such data.
OK, so much for the principle but what do we have at the
moment? The representation that we’re all most familiar with is GEDCOM (GEnealogical Data
COMmunication) which is a data format developed in the mid-1980s by The Church of
Jesus Christ of Latter-day Saints (aka LDS Church or Mormon Church) as an aid
to their research. This format gets regular criticism but it has been in
constant use ever since it was first developed, despite not having been updated
since about 1995. A later XML-based GEDCOM 6.0 was proposed but never released.
GEDCOM is termed a de facto standard
because it is not recognised by any standards body, and it came about by being
the only player in town. In fact, it
is still pretty much the only player in town. There are other data formats and
data models but they’re usually self-serving — meaning that they were conceived
in order to support a specific proprietary product — or they’re niche R&D
projects. A number of projects have started and failed to define a more modern
standard, including GenTech, BetterGEDCOM, and OpenGen.
Why is a proper data standard so important? Well, there are
several reasons:
- Unambiguous specification. Standards are written in a clear and precise manner, and without using marketing-speak or peacock terms.
- International applicability. Standards have to address a global market rather than just the domestic market of the author.
- Longevity. Data in a standard format can always be resurrected (or converted) because the specification is open — not locked away on some proprietary development machine.
So what is wrong with the GEDCOM that we currently have? Unfortunately,
GEDCOM is quite limited in its scope, and is acknowledged to be focused on
biological lineage rather than generic family history. Some of the better documented
issues with GEDCOM include:
- No support for multi-person events resulting in duplication and redundancy.
- Loosely-defined support for source citations resulting in incompatible implementations.
- No top-level support for Places.
- No ordering for Events other than by date (which may not be known).
- No support for interpersonal relations outside of traditional marriages.
- Use of the ANSEL character standard which has recently been administratively withdrawn (14th February 2013). Although Unicode support was added in v5.3, and UTF-8 proposed in a later v5.5.1 draft, this has still not been implemented by some vendors.
- No support for narrative text, as in transcribed evidence or reasoning.
The specification document is not of an international
standards quality which has resulted in some ambiguous interpretations. Perhaps
more importantly, though, vendors have provided selective implementations. They have tended to implement only the
parts relevant to their own product, and this obviously impacts the ability to
import data from other products.
So, if a new standard this is so obviously essential for
users to share their data then what are the obstacles preventing it? Why has
this not happened years ago, or even a decade ago? Are there some commercial
realities that might be at play here?
Another potential function of a data standard is for the
long-term storage and preservation of our data. It would be great for users to
be able to create a safe backup copy of all their data (with nothing lost) in a
standard format, and even to have the option of bequeathing that to an archive
when they can no longer continue with it. By contrast, bequeathing a
proprietary database may be useless if the associated product becomes extinct.
The first obstacle, though, is that no standard format currently exists with
sufficient power and scope. The second problem is that vendors might get uneasy
about it since it would then provide a slick and reliable method of moving from
their product to a different product.
Software development will continue to evolve, though, and
products must become more powerful and more usable. This results in a wider separation
between the data that the products use internally and the data that can be
exported to other products. In effect, GEDCOM has become a throttled exchange
mechanism — almost an excuse that means vendors don’t have to share
everything. Whether intentionally or
not, all products are becoming islands because they cannot share accurately and
fully with every other product.
Another potential factor is the cost of assimilating some
newer and more powerful data model. Every product has its own internal data
model that supports its operational capabilities. If a standard model is widely
different to that in terms of concepts or scope then it could result in
unjustifiable development costs for a small vendor. Although we’re talking
about external data models — those used for exchange or preservation — there is
bound to be some impact on the internal data models of existing products,
especially during data import. Software development is an expensive process and
a true open-source development might be more agile in this respect.
In June 2013, Ryan
Heaton of Family Search generated a post on github that disputed the possibility
of an all-encompassing data model, describing it as a myth: GEDCOM
X Media Types. His arguments are a little vague because he tries to
distinguish the GEDCOM X Conceptual Model as narrowly focused on data exchange,
but this is precisely the area under discussion. Representation of data outside
of any product, whether for exchange or preservation, is exactly what we’re
talking about. In some respects genealogy needs to evolve and generally grow
up. We continue to get mired in the distinction between genealogical data and
family history data, and religiously refer to ‘genealogical data models’ to the
exclusion of other forms of micro-history.
One goal of the STEMMA Data Model was to cater for other types of micro-history
data, including that for One-Name Studies, One-Place Studies,
personal historians (as in APH), house histories, etc. It is
therefore evident that I completely believe in the possibility of a single
representational model.
You may be
wondering why I didn’t mention FHISO (Family
History Information Standards Organisation) above. FHISO are relatively new in
the field of data standards, and grew out of the older BetterGEDCOM wiki
project. Their remit includes all data standards connected with genealogy, and
this even includes the possibility of “fixing” GEDCOM so that it at least works
as it was intended to. When FHISO looks at a more powerful replacement, though
— something that will still be a necessity — then how smoothly will that go?
FHISO has significant industry support but when it comes to adoption of a new
standard, what will be the incentive to vendors? The advantages to users are
clear and obvious but will commercial realities leave them with less than they
want?
A possibility that
has occurred in several other industry sectors is where a large, generic
software organisation comes out with a killer product — one designed to kill
off all the smaller products currently fighting for market share. Sometimes
this happens by acquisition-and-enhancement of an existing product and
sometimes through completely new development. This is a very real possibility
and I guarantee that our industry is being examined already, especially as
genealogy and other forms of micro-history become ever more popular. If there’s
money to be made then some large company will develop a product to satisfy as
many users as possible (geographically and functionally), and employ their
powerful marketing to sell it deeper and further than anything else. In some
ways this might be good since we finally get a data standard that isn’t as
constraining as GEDCOM, and without any of the silly arguing. On the other
hand, we won’t have had any say in its functional requirements. As in those
other industry cases, it will begin as a de facto standard and then be
submitted for something like an ISO designation because the company knows the
ropes, and has done that several times before.
Is that something
we — users and vendors — really want to risk? It’s time to look at the bigger
picture.
No comments:
Post a Comment