Wednesday 11 June 2014

Bootstrapping a Data Standard

There haven’t been many discussions about data standards for a while now. What is the practicality of their development? Are there conflicting requirements? What would be a good way to proceed?



In 2012, FHISO (Family History Information Standards Organisation) were created to develop the first industry standard for the representation of genealogical data. This was to be a collaborative project, incorporating interested parties from across the wider community, and result in a freely available, open standard with international applicability. Although they elected their first Board in 2013, they have recently gone “radio silent” until they have sorted out some serious logistical problems with key Board members.[1] What lies ahead for them, though? In The Commercial Realities of Data Standards, I discussed some of the conflicting requirements of a data standard, and also differentiated the terms data model and file format.

To many people, the issue is simply one of developing an agreed file format, and so there shouldn’t be much difficulty. In reality, it is the data model — the abstract representation of the data — that requires the effort to define. Once defined, any number of file formats can be used to represent it physically.

So why don’t we run with GEDCOM? It was never developed as a standard but it has achieved the status of a de facto standard. Well, we all know that it has serious faults and limitations, and some example analyses can be found on the BetterGEDCOM wiki at Shortcomings of GEDCOM and on Louis Kessler’s blog at Nine Necessities in a GEDCOM Replacement[2]. It also has a weak specification resulting in ambiguous implementation, and a proprietary syntax that would be at odds with the modern world of the Semantic Web. What it has in its favour is widespread acceptance, although this is weakened slightly by selective implementation amongst the various products.

There is some mileage in this possibility but I will come to it later. Almost without exception, the criticisms of GEDCOM are “detail changes”, such as changing the representation of a place, or a person, or a date, or a source citation. By far the biggest issues, though, are in terms of “structural changes”. You see, GEDCOM is a lineage-linked format designed specifically to represent biological lineage. No matter how you interpret the term genealogy[3], this obviously limits the scope of that representation. Any representation of historical data must support multi-person events as a core entity type[4], but GEDCOM is sadly lumbered with mere single-person events. This is an absolute necessity for the representation of family history.

However, with the addition of some other features, such as:

  • Places as top-level entities, and a possible focus for history.
  • Structured narrative (i.e. incorporating mark-up).
  • Group entities (see military example).

then the representation can avoid what I’ve previously called the “lineage trap” and become applicable to those other types of micro-history that are currently struggling for any type of comprehensive representation. This includes One-Name Studies, One-Place Studies. Personal historians, house histories, place histories (as opposed to the history of a place in terms of its people), and organisational histories. That lineage trap occurs when we artificially confine the historical representation to the events of a family. In reality, genealogical research needs that freedom, and yet there is little cost involved in addressing the greater generality. It is mainly a matter of adopting the right perspective when defining the data model.

Although this sounds excellent, and it could unify the disparate parts of the historical-research community, there are a couple of obstacles to the approach. The first is the effect on existing products — both desktop and online — and whether their creators would want to increase the scope to this extent. Imagine, for instance, a very simple tree-building product. Let’s suppose that is merely holds the details of people, including the dates of their vital events (birth, marriage, and death), and links to their offspring. Exporting to a representation with an historical focus should not be a problem since the necessary parts for depicting lineage would be a subset within it. However, what should it do if it tried to import a contribution generated by a more powerful product; one that included the elements mentioned above? Indeed, the lineage part might be empty if it represented, say, the history of a place. Unless that data representation was accepted by a very significant part of the genealogical software world then the extra effort involved might be dismissed as unjustifiable.

Another obstacle is that GEDCOM, in conjunction with tree orientated products, has effectively compromised the structure of existing data.[5] Evidence will have been associated directly with people, possibly with duplication being necessary, rather than with the relevant events. This compromise may be beyond refactoring, meaning that the data would have to be export as-is rather than restructured to take advantage of a better representation.

So what about a two-tier standard? Many standards have different levels of scope to which conformity can be targeted, and I have recently criticised the ISO 8601 date standard for not doing this. One basic idea would be to revamp GEDCOM in order to fix the known issues, and to invest some structural changes in order to make it conceptually closer to a true historical standard. If an historical standard were defined now then GEDCOM would be incompatible with it since it does not have an event-based structure (in addition to its lineage). Creating a “GEDCOM+” would effectively build a bridge between the existing, very old, GEDCOM and that comprehensive standard for historical data since the new data models would be aligned. It would only be the scope of the data models that would differ.

Being able to factor-out enough of the data model of an historical standard, and then apply it to the lineage-linked GEDCOM in order to give it proper representations for its vital events, would be an interesting challenge. Event-linked variations of GEDCOM have been proposed before, but the important goal here would be ensuring a smooth path from the GEDCOM+ to the full historical standard, and ensuring that the latter is a superset of the former.

This idea was actually kicked around within FHISO since there are some interesting advantages. Although FHISO had very significant support within the industry, it still needed to establish its credibility. Working on a GEDCOM+ could have provided that opportunity, and allowed it to deliver a proper, documented standard in a realistic timeframe while the bigger standard could be developed to a longer timescale. Providing a newer GEDCOM would have also made more friends in the industry, and in the general community, since the incompatibilities between products is still a major gripe of genealogists and family historians at all levels.

Things never run that smoothly, though. GEDCOM is a proprietary format and the name is owned by the The Church of Jesus Christ of Latter-day Saints (LDS Church). While several developers have created variations of it, having their development work condoned, and being given permission to use the same name, are very unlikely scenarios. FamilySearch, the genealogical arm of the Church, were one of the most visible absences from the FHISO membership. Although the developers of GEDCOM-X were supportive, and understood that what they were doing and what FHISO were doing was fundamentally different, the Church as a whole is very complex and that same level of understanding was not pervasive. Had FamilySearch have been a member then that work could have been done with their involvement. There might still have been a political issue with it appearing that FHISO was working for FamilySearch but that would have been surmountable. It would have been possible to proceed anyway, and simply use a name other than GEDCOM, but then that might have introduced further fragmentation into the industry, and we certainly don’t need that.

Development of any successful new standard has to consider the upgrade path; the path by which existing software can embrace the new standard, and by which existing data can be migrated. If you believe, as I do, that the future needs more than a representation of lineage, and that there’s no fundamental reason to exclude other forms of micro-history, then that automatically introduces a gulf between that new representation and GEDCOM. Boostrapping a new standard by initially working on an enhanced GEDCOM model is the best way to proceed. The data syntax of that representation is irrelevant — it could use the old proprietary syntax if there was benefit to existing vendors — but its data model would help to bridge the old and new worlds.




[1] The issues are health-related, and I have been assured that FHISO has not gone away — as some have supposed — and that it will come back as a more organised and productive organisation.
[2] While Louis’s suggestions make mostly good sense, I am not citing the list as one that I entirely agree with. In particular, suggestion 7 (“No Extensions”) needs a note. I agree that schema extensions should be avoided, but certain types of extension are essential for generality and would not limit conformity. See both Extensibility and Digital Freedom.
[3] See “What is Genealogy?”, Blogger.com, Parallax View, 1 May 2014 (http://parallax-viewpoint.blogspot.com/2014/05/what-is-genealogy.html).
[4] See “Eventful Genealogy”, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html); Also, the follow-up "Eventful Genealogy - Part II", 6 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy-part-ii.html).
[5] See “Evidence and Where to Stick It”, Blogger.com, Parallax View, 24 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/evidence-and-where-to-stick-it.html)

No comments:

Post a Comment