GeneaBloggers

Monday, 26 August 2013

The Commercial Realities of Data Standards


Are we Modelling Data or Commerce?


The tone of certain Internet posts and related discussions has recently moved from genealogical data formats to genealogical data models. What does this mean, though, and where will it all end? What commercial factors might be at play here?

It is no coincidence that James Tanner has recently discussed this same topic on his blog at What happened to GEDCOM?, Are data-sharing standards possible?, and What is a genealogical Data Model? but I want to give a very different perspective here.

A data format (more correctly called a ‘serialisation format’) is a specific physical representation of your data in a file using a particular data syntax. A data model, on the other hand, is a description of the shape and structure of the data without using a specific syntax. In order to illustrate this, consider how a person is related to their biological parents. To state that each person is linked to just one father and to one mother might be part of a data model specification. However, the nature of that linkage, and what it might look like, is only relevant when discussing a specific data format.

The most widely accepted function of these data formats and data models is for the exchange of genealogical data between people, and hence between different software products, possibly running on different types of machine and in different locales. Sharing our data is a fundamental tenet of genealogy – if it wasn’t possible then it wouldn’t work.

STEMMA® describes these data formats, and itself, as source formats in order to emphasise their fundamental nature. They are essentially an unambiguous textual representation from which any number of indexed forms can be generated. This, in turn, is analogous to the source code for a programming language from which a compiled form can be generated unambiguously for different machine architectures. Note that no data format or data model makes any specification about indexes or database schemas. They’re the prerogative of the designers of the software products that process such data.

OK, so much for the principle but what do we have at the moment? The representation that we’re all most familiar with is GEDCOM (GEnealogical Data COMmunication) which is a data format developed in the mid-1980s by The Church of Jesus Christ of Latter-day Saints (aka LDS Church or Mormon Church) as an aid to their research. This format gets regular criticism but it has been in constant use ever since it was first developed, despite not having been updated since about 1995. A later XML-based GEDCOM 6.0 was proposed but never released. GEDCOM is termed a de facto standard because it is not recognised by any standards body, and it came about by being the only player in town. In fact, it is still pretty much the only player in town. There are other data formats and data models but they’re usually self-serving – meaning that they were conceived in order to support a specific proprietary product – or they’re niche R&D projects. A number of projects have started and failed to define a more modern standard, including GenTech, BetterGEDCOM, and OpenGen.

Why is a proper data standard so important? Well, there are several reasons:

  • Unambiguous specification. Standards are written in a clear and precise manner, and without using marketing-speak or peacock terms.
  • International applicability. Standards have to address a global market rather than just the domestic market of the author.
  • Longevity. Data in a standard format can always be resurrected (or converted) because the specification is open – not locked away on some proprietary development machine.

So what is wrong with the GEDCOM that we currently have? Unfortunately, GEDCOM is quite limited in its scope, and is acknowledged to be focused on biological lineage rather than generic family history. Some of the better documented issues with GEDCOM include:

  • No support for multi-person events resulting in duplication and redundancy.
  • Loosely-defined support for source citations resulting in incompatible implementations.
  • No top-level support for Places.
  • No ordering for Events other than by date (which may not be known).
  • No support for interpersonal relations outside of traditional marriages.
  • Use of the ANSEL character standard which has recently been administratively withdrawn (14th February 2013). Although Unicode support was added in v5.3, and UTF-8 proposed in a later v5.5.1 draft, this has still not been implemented by some vendors.
  • No support for narrative text, as in transcribed evidence or reasoning.

The specification document is not of an international standards quality which has resulted in some ambiguous interpretations. Perhaps more importantly, though, vendors have provided selective implementations. They have tended to implement only the parts relevant to their own product, and this obviously impacts the ability to import data from other products.

So, if a new standard this is so obviously essential for users to share their data then what are the obstacles preventing it? Why has this not happened years ago, or even a decade ago? Are there some commercial realities that might be at play here?

Another potential function of a data standard is for the long-term storage and preservation of our data. It would be great for users to be able to create a safe backup copy of all their data (with nothing lost) in a standard format, and even to have the option of bequeathing that to an archive when they can no longer continue with it. By contrast, bequeathing a proprietary database may be useless if the associated product becomes extinct. The first obstacle, though, is that no standard format currently exists with sufficient power and scope. The second problem is that vendors might get uneasy about it since it would then provide a slick and reliable method of moving from their product to a different product.

Software development will continue to evolve, though, and products must become more powerful and more usable. This results in a wider separation between the data that the products use internally and the data that can be exported to other products. In effect, GEDCOM has become a throttled exchange mechanism – almost an excuse that means vendors don’t have to share everything.  Whether intentionally or not, all products are becoming islands because they cannot share accurately and fully with every other product.

Another potential factor is the cost of assimilating some newer and more powerful data model. Every product has its own internal data model that supports its operational capabilities. If a standard model is widely different to that in terms of concepts or scope then it could result in unjustifiable development costs for a small vendor. Although we’re talking about external data models — those used for exchange or preservation — there is bound to be some impact on the internal data models of existing products, especially during data import. Software development is an expensive process and a true open-source development might be more agile in this respect.

In June 2013, Ryan Heaton of Family Search generated a post on github that disputed the possibility of an all-encompassing data model, describing it as a myth: GEDCOM X Media Types. His arguments are a little vague because he tries to distinguish the GEDCOM X Conceptual Model as narrowly focused on data exchange, but this is precisely the area under discussion. Representation of data outside of any product, whether for exchange or preservation, is exactly what we’re talking about. In some respects genealogy needs to evolve and generally grow up. We continue to get mired in the distinction between genealogical data and family history data, and religiously refer to ‘genealogical data models’ to the exclusion of other forms of micro-history. One goal of the STEMMA Data Model was to cater for other types of micro-history data, including that for One-Name Studies, One-Place Studies, personal historians (as in APH), house histories, etc. It is therefore evident that I completely believe in the possibility of a single representational model.


You may be wondering why I didn’t mention FHISO (Family History Information Standards Organisation) above. FHISO are relatively new in the field of data standards, and grew out of the older BetterGEDCOM wiki project. Their remit includes all data standards connected with genealogy, and this even includes the possibility of “fixing” GEDCOM so that it at least works as it was intended to. When FHISO looks at a more powerful replacement, though – something that will still be a necessity – then how smoothly will that go? FHISO has significant industry support but when it comes to adoption of a new standard, what will be the incentive to vendors? The advantages to users are clear and obvious but will commercial realities leave them with less than they want?

A possibility that has occurred in several other industry sectors is where a large, generic software organisation comes out with a killer product — one designed to kill off all the smaller products currently fighting for market share. Sometimes this happens by acquisition-and-enhancement of an existing product and sometimes through completely new development. This is a very real possibility and I guarantee that our industry is being examined already, especially as genealogy and other forms of micro-history become ever more popular. If there’s money to be made then some large company will develop a product to satisfy as many users as possible (geographically and functionally), and employ their powerful marketing to sell it deeper and further than anything else. In some ways this might be good since we finally get a data standard that isn’t as constraining as GEDCOM, and without any of the silly arguing. On the other hand, we won’t have had any say in its functional requirements. As in those other industry cases, it will begin as a de facto standard and then be submitted for something like an ISO designation because the company knows the ropes, and has done that several times before.

Is that something we — users and vendors — really want to risk? It’s time to look at the bigger picture.