There haven’t been many discussions about data standards for
a while now. What is the practicality of their development? Are there
conflicting requirements? What would be a good way to proceed?
In 2012, FHISO (Family
History Information Standards Organisation) were created to develop the first
industry standard for the representation of genealogical data. This was to be a
collaborative project, incorporating interested parties from across the wider
community, and result in a freely available, open standard with international
applicability. Although they elected their first Board in 2013, they have
recently gone “radio silent” until they have sorted out some serious logistical
problems with key Board members.[1]
What lies ahead for them, though? In The
Commercial Realities of Data Standards, I discussed some of the conflicting
requirements of a data standard, and also differentiated the terms data model and file format.
To many people, the issue is simply one of developing an
agreed file format, and so there shouldn’t be much difficulty. In reality, it
is the data model — the abstract representation of the data — that requires the
effort to define. Once defined, any number of file formats can be used to
represent it physically.
So why don’t we run with GEDCOM? It was never developed
as a standard but it has achieved the status of a de facto standard. Well,
we all know that it has serious faults and limitations, and some example
analyses can be found on the BetterGEDCOM wiki at Shortcomings
of GEDCOM and on Louis Kessler’s blog at Nine Necessities in a GEDCOM
Replacement[2]. It also
has a weak specification resulting in ambiguous implementation, and a
proprietary syntax that would be at odds with the modern world of the Semantic
Web. What it has in its favour is widespread acceptance, although this is
weakened slightly by selective
implementation amongst the various products.
There is some mileage in this possibility but I will come to
it later. Almost without exception, the criticisms of GEDCOM are “detail
changes”, such as changing the representation of a place, or a person, or a
date, or a source citation. By far the biggest issues, though, are in terms of
“structural changes”. You see, GEDCOM is a lineage-linked format designed
specifically to represent biological lineage. No matter how you interpret the
term genealogy[3],
this obviously limits the scope of that representation. Any representation of
historical data must support multi-person events as a core entity type[4],
but GEDCOM is sadly lumbered with mere single-person events. This is an
absolute necessity for the representation of family history.
However, with the addition of some other features, such as:
- Places as top-level entities, and a possible focus for history.
- Structured narrative (i.e. incorporating mark-up).
- Group entities (see military example).
then the representation can avoid what I’ve previously
called the “lineage trap” and become applicable to those other types of
micro-history that are currently struggling for any type of comprehensive
representation. This includes One-Name Studies, One-Place Studies. Personal historians, house histories,
place
histories (as opposed to the history of a place in terms of its people),
and organisational
histories. That lineage trap occurs when we artificially confine the
historical representation to the events of a family. In reality, genealogical
research needs that freedom, and yet there is little cost involved in
addressing the greater generality. It is mainly a matter of adopting the right
perspective when defining the data model.
Although this sounds excellent, and it could unify the
disparate parts of the historical-research community, there are a couple of
obstacles to the approach. The first is the effect on existing products — both
desktop and online — and whether their creators would want to increase the
scope to this extent. Imagine, for instance, a very simple tree-building
product. Let’s suppose that is merely holds the details of people, including
the dates of their vital events (birth, marriage, and death), and links to their
offspring. Exporting to a representation with an historical focus should not be
a problem since the necessary parts for depicting lineage would be a subset
within it. However, what should it do if it tried to import a contribution
generated by a more powerful product; one that included the elements mentioned
above? Indeed, the lineage part might be empty if it represented, say, the
history of a place. Unless that data representation was accepted by a very
significant part of the genealogical software world then the extra effort
involved might be dismissed as unjustifiable.
Another obstacle is that GEDCOM, in conjunction with tree
orientated products, has effectively compromised the structure of existing
data.[5]
Evidence will have been associated directly with people, possibly with duplication
being necessary, rather than with the relevant events. This compromise may be
beyond refactoring, meaning that the data would have to be export as-is rather
than restructured to take advantage of a better representation.
So what about a two-tier standard? Many standards have
different levels of scope to which conformity can be targeted, and I have recently
criticised the ISO 8601 date standard for not doing this. One basic idea would
be to revamp GEDCOM in order to fix the known issues, and to invest some
structural changes in order to make it conceptually closer to a true historical
standard. If an historical standard were defined now then GEDCOM would be
incompatible with it since it does not have an event-based structure (in
addition to its lineage). Creating a “GEDCOM+” would effectively build a bridge
between the existing, very old, GEDCOM and that comprehensive standard for
historical data since the new data models would be aligned. It would only be
the scope of the data models that would differ.
Being able to factor-out enough of the data model of an
historical standard, and then apply it to the lineage-linked GEDCOM in order to
give it proper representations for its vital events, would be an interesting
challenge. Event-linked variations of GEDCOM have been proposed before, but the
important goal here would be ensuring a smooth path from the GEDCOM+ to the
full historical standard, and ensuring that the latter is a superset of the
former.
This idea was actually kicked around within FHISO since
there are some interesting advantages. Although FHISO had very significant
support within the industry, it still needed to establish its credibility.
Working on a GEDCOM+ could have provided that opportunity, and allowed it to
deliver a proper, documented standard in a realistic timeframe while the bigger
standard could be developed to a longer timescale. Providing a newer GEDCOM
would have also made more friends in the industry, and in the general
community, since the incompatibilities between products is still a major gripe
of genealogists and family historians at all levels.
Things never run that smoothly, though. GEDCOM is a
proprietary format and the name is owned by the The Church of Jesus Christ of
Latter-day Saints (LDS Church). While several developers have created
variations of it, having their development work condoned, and being given
permission to use the same name, are very unlikely scenarios. FamilySearch, the
genealogical arm of the Church, were one of the most visible absences from the
FHISO membership. Although the developers of GEDCOM-X were supportive, and
understood that what they were doing and what FHISO were doing was
fundamentally different, the Church as a whole is very complex and that same level
of understanding was not pervasive. Had FamilySearch have been a member then
that work could have been done with their involvement. There might still have
been a political issue with it appearing that FHISO was working for
FamilySearch but that would have been surmountable. It would have been possible
to proceed anyway, and simply use a name other than GEDCOM, but then that might have introduced further fragmentation into
the industry, and we certainly don’t need that.
Development of any successful new standard has to consider
the upgrade path; the path by which
existing software can embrace the new standard, and by which existing data can
be migrated. If you believe, as I do, that the future needs more than a
representation of lineage, and that there’s no fundamental reason to exclude
other forms of micro-history, then that automatically introduces a gulf between
that new representation and GEDCOM. Boostrapping a new standard by initially working
on an enhanced GEDCOM model is the best way to proceed. The data syntax of that
representation is irrelevant — it could use the old proprietary syntax if there
was benefit to existing vendors — but its data model would help to bridge the
old and new worlds.
[1] The issues are
health-related, and I have been assured that FHISO has not gone away — as some have
supposed — and that it will come back as a more organised and productive
organisation.
[2] While Louis’s
suggestions make mostly good sense, I am not citing the list as one that I
entirely agree with. In particular, suggestion 7 (“No Extensions”) needs a
note. I agree that schema extensions should be avoided, but certain types of
extension are essential for generality and would not limit conformity. See both
Extensibility
and Digital
Freedom.
[3] See “What is Genealogy?”, Blogger.com, Parallax
View, 1 May 2014 (http://parallax-viewpoint.blogspot.com/2014/05/what-is-genealogy.html).
[4] See “Eventful Genealogy”, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html);
Also, the follow-up "Eventful Genealogy - Part II", 6 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy-part-ii.html).
[5] See “Evidence and Where to Stick It”, Blogger.com, Parallax View, 24 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/evidence-and-where-to-stick-it.html)
No comments:
Post a Comment