Parallax View ®: June 2014

Monday 30 June 2014

Citations for Online Trees

I want to use this post to expand on some comments that I recently made on two of James Tanner’s blog-posts: The Issue of Source Citations and The Challenge of Genealogical Complexity. The gist of my comments was that there are distinct endeavours in genealogy and that the requirements of source references are not the same for each of them.

There may be a spectrum of endeavours, as James himself suggested, but let’s begin with the premise that there are people who create online family trees for the sole purpose of their own enjoyment, and of sharing with friends and family. I know this to be true through the class I used to give, although I can’t reliably comment on whether they constitute a majority or a minority. Distinct from these people are those who want to conduct rigorous research and generate sound, written conclusions.

On the face of it, this sounds to be a gross generalisation. I’m sure there are thorough researchers who also create online family trees, although I would argue that the medium of an online family tree is, by its very nature, a restricted format and so how could the all the details of that research be adequately accommodated there.

The issue of source citations is a differentiator that has been picked on repeatedly here. There are people in the latter group who bemoan the endeavours of some people in the former group because they have no source references; no citations of any kind. This may be because their Web site offers limited functionality, or because those people are simply copy-and-pasting from other trees, or because their trees exist purely for “cousin bait”[1], or because they don’t really appreciate what citations are and what they can offer.

If you’re from a literary or academic background then you will be aware of citations, and of their essentiality, but there are many people who are unfamiliar with them. Indeed, some may immediately think of traffic citations, and consider them to be some sort of punishment for not doing their family tree correctly.

But a citation is a citation and we should all be including them, right? Well, not necessarily. The goals may be different for those different endeavours, and there are certainly more ways of representing them than with our traditional printed forms.

In the early days of the Graphical User Interface (GUI) — the computer display that we all now take for granted — one new software product, which I wasn’t involved with, wanted to help executives to get more involved with their PCs. You see, in those days, only the typing pool used keyboards. On the basis that an executive could at least master the use of a mouse, they painted a graphical display on the screen from which he/she could select characters with their mouse. Yes, you guessed it, … they painted a picture of a QWERTY keyboard and it did nothing to help those executives; they were just as bamboozled as before. They might have been better painting a simple A-Z and 0-9 sequential list, but a keyboard is a keyboard, no matter what the requirements are.

Let’s look at the goals normally associated with citing our sources:

Intellectual honesty (not claiming prior work as your own).
Allowing your sources to be independently assessed by the reader.
Allowing the strength of your information sources to be assessed.

These are fine if you’re writing an article for a journal, or a book, or a research report, but are they all relevant to a hobbyist who just wants to share with their family? Well, if that was the extent of their sharing then I would say that none of them are particularly important to that hobbyist. Much more important is simply being able to recall where they got an item of evidence from. We’ve all been there — even the most experienced of us — when something doesn’t quite fit, and we realise that we have an error. If it wasn’t for our citations then we’d be in a desperate state trying to recall how we got into that position. This is something that many beginners learn through unhappy experience, and it reinforces the notion that citing our sources is important to all us.

The essential difference for this type of hobbyist is that they would need something that linked back directly to the relevant online source record. They would not be interested in the nuances of a properly punctuated reference-note citation, or the addition of analytical notes. They would just want some note that could be clicked on, and followed, in order to see the relevant record. The format of this electronic citation (more correctly, an electronic bookmark) may not follow the conventions of our reference-note, source-list, and source-label citations, although it would be entirely feasible to generate one from it if necessary.

So, maybe the best way to educate beginners about the need for citations would be to make it click-easy to add an electronic citation when using each online record to compile information into their tree. Ancestry’s member trees are examples that already use a similar electronic mechanism. Although they can add references to sources hosted by Ancestry, or narrative and images provided by the user, they are less able to reference online records hosted by other providers. This is a problem because we use so many online sources, and yet some of them are desperately bad at providing associated citations, whether electronic or printed. This wouldn’t require much of a standard since a basic URL could do most of it, although the ContextObject employed by the OpenURL would make it more manageable.

What happens, though, if the aforementioned sharing goes beyond mere family and friends? Many user-owned online trees have public visibility, and unified online trees must have public visibility. Are simple electronic citations then enough? In the latter case, they would clearly be inadequate for a user’s peers to ascertain the accuracy of their claims. In principle, that sort of collaborative environment isn’t too far from the published works of research listed earlier; the ones where traditional citations are expected. The subtlety of this point may disguise its significance: extrapolating the mechanism of a user-owned tree to create a mechanism for a unified tree cannot work. Those electronic citations would be good at connecting to an online source, and being able to visit it. A traditional reference-note citation would be able to describe many more source types, including multiple sources in the same note, and be able to add analytical notes on the availability, reliability, or objectivity of the information. Although such a reference-note citation could be compiled into a digital representation, it’s the positioning of the source reference that causes the problems.

Citing each source when making a reasoned stepwise argument that leads to a specific conclusion is something that requires narrative. It is not the same as plainly citing a death certificate as evidence of a date of death, or a census page as evidence of a birth date. Online trees do not accommodate anything like a narrative report, but they could. Back in What to Share, and How - Part II, I considered an approach to collaboration where the unit of sharing included narrative, citations, lineage, timelines, and geography.

[1] I fall into this category myself. My fully sourced and documented data is held in locally, and my online tree exists purely to attract distant relatives in order to share with them. This implies that an online tree should not be judged by its number of source references but this is an increasingly difficult stance to take.

Wednesday 11 June 2014

Bootstrapping a Data Standard

There haven’t been many discussions about data standards for a while now. What is the practicality of their development? Are there conflicting requirements? What would be a good way to proceed?

In 2012, FHISO (Family History Information Standards Organisation) were created to develop the first industry standard for the representation of genealogical data. This was to be a collaborative project, incorporating interested parties from across the wider community, and result in a freely available, open standard with international applicability. Although they elected their first Board in 2013, they have recently gone “radio silent” until they have sorted out some serious logistical problems with key Board members.[1] What lies ahead for them, though? In The Commercial Realities of Data Standards, I discussed some of the conflicting requirements of a data standard, and also differentiated the terms data model and file format.

To many people, the issue is simply one of developing an agreed file format, and so there shouldn’t be much difficulty. In reality, it is the data model — the abstract representation of the data — that requires the effort to define. Once defined, any number of file formats can be used to represent it physically.

So why don’t we run with GEDCOM? It was never developed as a standard but it has achieved the status of a de facto standard. Well, we all know that it has serious faults and limitations, and some example analyses can be found on the BetterGEDCOM wiki at Shortcomings of GEDCOM and on Louis Kessler’s blog at Nine Necessities in a GEDCOM Replacement [2]. It also has a weak specification resulting in ambiguous implementation, and a proprietary syntax that would be at odds with the modern world of the Semantic Web. What it has in its favour is widespread acceptance, although this is weakened slightly by selective implementation amongst the various products.

There is some mileage in this possibility but I will come to it later. Almost without exception, the criticisms of GEDCOM are “detail changes”, such as changing the representation of a place, or a person, or a date, or a source citation. By far the biggest issues, though, are in terms of “structural changes”. You see, GEDCOM is a lineage-linked format designed specifically to represent biological lineage. No matter how you interpret the term genealogy[3], this obviously limits the scope of that representation. Any representation of historical data must support multi-person events as a core entity type[4], but GEDCOM is sadly lumbered with mere single-person events. This is an absolute necessity for the representation of family history.

However, with the addition of some other features, such as:

Places as top-level entities, and a possible focus for history.
Structured narrative (i.e. incorporating mark-up).
Group entities (see military example).

then the representation can avoid what I’ve previously called the “lineage trap” and become applicable to those other types of micro-history that are currently struggling for any type of comprehensive representation. This includes One-Name Studies, One-Place Studies. Personal historians, house histories, place histories (as opposed to the history of a place in terms of its people), and organisational histories. That lineage trap occurs when we artificially confine the historical representation to the events of a family. In reality, genealogical research needs that freedom, and yet there is little cost involved in addressing the greater generality. It is mainly a matter of adopting the right perspective when defining the data model.

Although this sounds excellent, and it could unify the disparate parts of the historical-research community, there are a couple of obstacles to the approach. The first is the effect on existing products — both desktop and online — and whether their creators would want to increase the scope to this extent. Imagine, for instance, a very simple tree-building product. Let’s suppose that is merely holds the details of people, including the dates of their vital events (birth, marriage, and death), and links to their offspring. Exporting to a representation with an historical focus should not be a problem since the necessary parts for depicting lineage would be a subset within it. However, what should it do if it tried to import a contribution generated by a more powerful product; one that included the elements mentioned above? Indeed, the lineage part might be empty if it represented, say, the history of a place. Unless that data representation was accepted by a very significant part of the genealogical software world then the extra effort involved might be dismissed as unjustifiable.

Another obstacle is that GEDCOM, in conjunction with tree orientated products, has effectively compromised the structure of existing data.[5] Evidence will have been associated directly with people, possibly with duplication being necessary, rather than with the relevant events. This compromise may be beyond refactoring, meaning that the data would have to be export as-is rather than restructured to take advantage of a better representation.

So what about a two-tier standard? Many standards have different levels of scope to which conformity can be targeted, and I have recently criticised the ISO 8601 date standard for not doing this. One basic idea would be to revamp GEDCOM in order to fix the known issues, and to invest some structural changes in order to make it conceptually closer to a true historical standard. If an historical standard were defined now then GEDCOM would be incompatible with it since it does not have an event-based structure (in addition to its lineage). Creating a “GEDCOM+” would effectively build a bridge between the existing, very old, GEDCOM and that comprehensive standard for historical data since the new data models would be aligned. It would only be the scope of the data models that would differ.

Being able to factor-out enough of the data model of an historical standard, and then apply it to the lineage-linked GEDCOM in order to give it proper representations for its vital events, would be an interesting challenge. Event-linked variations of GEDCOM have been proposed before, but the important goal here would be ensuring a smooth path from the GEDCOM+ to the full historical standard, and ensuring that the latter is a superset of the former.

This idea was actually kicked around within FHISO since there are some interesting advantages. Although FHISO had very significant support within the industry, it still needed to establish its credibility. Working on a GEDCOM+ could have provided that opportunity, and allowed it to deliver a proper, documented standard in a realistic timeframe while the bigger standard could be developed to a longer timescale. Providing a newer GEDCOM would have also made more friends in the industry, and in the general community, since the incompatibilities between products is still a major gripe of genealogists and family historians at all levels.

Things never run that smoothly, though. GEDCOM is a proprietary format and the name is owned by the The Church of Jesus Christ of Latter-day Saints (LDS Church). While several developers have created variations of it, having their development work condoned, and being given permission to use the same name, are very unlikely scenarios. FamilySearch, the genealogical arm of the Church, were one of the most visible absences from the FHISO membership. Although the developers of GEDCOM-X were supportive, and understood that what they were doing and what FHISO were doing was fundamentally different, the Church as a whole is very complex and that same level of understanding was not pervasive. Had FamilySearch have been a member then that work could have been done with their involvement. There might still have been a political issue with it appearing that FHISO was working for FamilySearch but that would have been surmountable. It would have been possible to proceed anyway, and simply use a name other than GEDCOM, but then that might have introduced further fragmentation into the industry, and we certainly don’t need that.

Development of any successful new standard has to consider the upgrade path; the path by which existing software can embrace the new standard, and by which existing data can be migrated. If you believe, as I do, that the future needs more than a representation of lineage, and that there’s no fundamental reason to exclude other forms of micro-history, then that automatically introduces a gulf between that new representation and GEDCOM. Boostrapping a new standard by initially working on an enhanced GEDCOM model is the best way to proceed. The data syntax of that representation is irrelevant — it could use the old proprietary syntax if there was benefit to existing vendors — but its data model would help to bridge the old and new worlds.

[1] The issues are health-related, and I have been assured that FHISO has not gone away — as some have supposed — and that it will come back as a more organised and productive organisation.

[2] While Louis’s suggestions make mostly good sense, I am not citing the list as one that I entirely agree with. In particular, suggestion 7 (“No Extensions”) needs a note. I agree that schema extensions should be avoided, but certain types of extension are essential for generality and would not limit conformity. See both Extensibility and Digital Freedom.

[3] See “What is Genealogy?”, Blogger.com, Parallax View, 1 May 2014 (http://parallax-viewpoint.blogspot.com/2014/05/what-is-genealogy.html).

[4] See “Eventful Genealogy”, Blogger.com, Parallax View, 3 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy.html); Also, the follow-up "Eventful Genealogy - Part II", 6 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/eventful-genealogy-part-ii.html).

[5] See “Evidence and Where to Stick It”, Blogger.com, Parallax View, 24 Nov 2013 (http://parallax-viewpoint.blogspot.com/2013/11/evidence-and-where-to-stick-it.html)

Friday 6 June 2014

Is the ISO Date Standard Bad?

Most genealogists will have come across the ISO date standard. If not then I’ll introduce it to you, and explain why it’s important to us. I want to question, though, whether it is bad for genealogists and for technology in general.

Most genealogical data will be concerned with dates rather than times. Although Time Zones (TZ) and Daylight Saving Time (DST) are usually applied to local clock times, they can also apply to local calendar dates. The importance of this to family historians is going to be slim at best but it needed to be said before we look at the ISO standard.

Having machine-readable copies of our dates is essential when software is applied to genealogical data, or to historical data in general. We take it for granted that databases store the dates of our vital events in some internal format that facilitates sorting, searching, and collation. With the growing amount of data appearing on the Internet then an international standard is also essential so that searches can be performed across disparate data without having to worry about which country it was created by, or which format it is represented in.

It is understandable that some people fear this conversion of data to a machine-readable representation, often citing that evidence may not be that clear. However, the verbatim evidential form complements a normalised digital version. Neither one supersedes the other, and they’re both essential for different reasons.

The ISO 8601 date standard[1] was conceived as an unambiguous way of storing and exchanging dates, times, and dates plus times in combination — hereinafter referred to as datetimes. Most people who are aware of it will immediately think of the YYYY-MM-DD numeric representation of dates, e.g. 2014-06-09 for 9^th June 2014. A numeric representation is important because — believe it or not — there are countries who do not speak English ☺. This particular layout achieves two things: (i) it avoids the UK/US difference in the way we order our day and month fields, and (ii) it makes the representation textually sortable because the bigger units are at the head.

Times are represented in the format hh:mm:ss, and when combined with dates to represent a datetime then the two parts are separated by a ‘T’ character, i.e. YYYY-MM-DDThh:mm:ss. The separating hyphen and colon characters may be omitted if data size is perceived as an issue. All of the date, time, and datetime representations allow truncation from the tail of the string in order to describe values of greater granularity.[2] For instance, 20:12 (i.e. 8:12pm) or 2014-08 (i.e. August 2014).

So far, this sounds good, right? If the standard had rounded the specification off about now then it would have been great. Unfortunately, there’s a lot of unrelated stuff dumped in there, and an undue level of “flexibility”.

A decimal fraction can be applied to the seconds field, or the minutes or hours fields in one of the truncated forms, but the number of decimal places is “by mutual agreement” between sender and receiver. For instance 12.34 (hh.hh format), 21:10.217 (hh:mm.mmm format), or 23:59:59.9 (hh:mm.ss.s format).

The Gregorian calendar was introduced during 1582 but the standard allows proleptic application (i.e. to dates before the calendar was defined) “by mutual agreement” between sender and receiver. The date may also be extended to include more digits using a +YYYYYY… representation, although the number of digits is “by mutual agreement”.

The standard supports week dates which use week-numbers and days-of-the-week rather than month-numbers and days-of-the-month: YYYY-Www-D where the ‘W’ is a fixed designator. For instance: 2014-W10-2, meaning the second day (Tuesday) of the 10^th week of 2014.

The standard supports ordinal dates which use days-of-the-year: YYYY-DDD. Although both week dates and ordinal dates are separately sortable, that capability breaks down if they are mixed with each other or with basic dates.

The standard supports an optional UTC[3] designator of ‘Z’ (i.e. “Zulu time”), or a UTC offset (±hh or ±hh:mm), appended to a time.

The standard supports time intervals using one of the forms: start/end (i.e. YYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss), start/Pduration (e.g. YYYY-MM-DDThh:mm:ss/PYYYY-MM-DDThh:mm:ss), or Pduration/end (i.e. PYYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss).

The standard supports recurring time intervals by prefixing “Rnn/” before one of the aforementioned time-interval representations, where the ‘nn’ is the repeat count. The standard does not stipulate the number of digits in this count.

Right, so you’re now aware of the complexity of this standard. It’s not just about a standard representation of a date and/or time. The standard was originally designed to replace older standards on numeric date/time representations (ISO 2014), week dates (ISO 2015), ordinal dates (ISO 2711), and a number of time-related standards. It was revised in 2000, and again in 2004, partly because the complexity had led to ambiguities.

This complexity is bad because most applications are only interested in specific parts of the standard — usually the basic representations of dates and/or times. I know of no software that implements the entire standard, and that means a statement such as “ISO 8601 compliant” is meaningless. Which parts has it implemented? Which options has it selected?

There are also many instances of the clause “by mutual agreement”:

In the acceptance of year values from 0000 to 1582.
In the acceptance of more than four digits in the year field.
In the decimal places of a fractional time.
In the omission of the separating ‘T’ between a date and a time in a datetime.
In the full range of valid terms in a time interval.

The standard may be acting as a guide in these situations but “by mutual agreement” is basically the contract established between two pieces of software when there is no standard. In particular, on the Internet there is no specific receiver with which mutual agreement can be formed and so that degree of flexibility is inappropriate there.

The W3C discussion note at W3CDTF examined the need for a subset of the overloaded ISO 8601 standard that could be used on the Internet. Only the basic representational part of the standard was used, and that was enough to satisfy the requirements of data exchange.

The US Library of Congress Extended Date Time Format (EDTF) actually subsets the various features defined by ISO 8601 (e.g. “Level 0”) so that implementation can be to a selected level for which an agreed designation exists. The EDTF sacrifices some of the ISO 8601 flexibility but also extends it in order to address issues such as uncertain date components (e.g. you know the year and the day but are unsure of the month).

So if the ISO standard were similarly subsetted then would that be the answer? It would certainly help but the standard is deficient in a number of other ways.

There is no support for quarter dates. For instance, representing the period January through March as Quarter 1. This is essential for certain records such as the index of vital events compiled by the GRO of England and Wales. Although the local registrations will involve specific day-based dates, the index is compiled on a quarterly basis. Citing an entry therefore needs a way of representing the relevant quarter. This is also another instance of the difference between granularity and imprecision already mentioned in note [2]. I can only assume that this was an oversight of the standard since a format of YYYY-Qq (e.g. 1956-Q2) is consistent with the standard as it exists now, and it follows the precedent already set by week dates.

Perhaps the most lacking support that’s relevant to historical data is support for non-Gregorian calendars. There are many other calendar systems in the world — both ancient and modern — and these may be based on solar cycles, lunar cycles, astronomical cycles, or regnal years. I am aware of no digital representations of dates from any of these calendars and this has serious repercussions. The prevailing notion amongst developers of software technology, and related standards, is that they can all be converted to the Gregorian calendar and represented using (some part of-) ISO 8601. This breaks down in practice, though, because exact conversions are not always possible. Indeed, the conversion may be dependent upon other factors such as the precise location of the event. At the very least, the conversion has to introduce some imprecision into the Gregorian equivalent, but converting such a date prematurely will set in stone an association that may change as new evidence or better research becomes available. What is needed is a general scheme that can represent dates from different calendar systems using a similar numerical approach to the Gregorian case. Any conversion would then be done on-the-fly, if and when necessary, without breaking a golden rule by distorting the evidence to fit the technology. More on this another time though.

[1] Data elements and interchange formats — Information interchange — Representation of dates and times, International Standard, ISO 8601:2004(E), 3^rd ed. 1 Dec 2004; online copies obtained from http://dotat.at/tmp/ISO_8601-2004_E.pdf (accessed 3 Jun 2014).

[2] The standard describes this as “reduced accuracy”, but there's a difference between imprecision and granularity in this field. Saying that a photo was taken in 1942/1943 is a case of imprecision but when talking about '19th century newspapers' then that's a case of granularity.

[3] UTC standards for Coordinated Universal Time (http://en.wikipedia.org/wiki/Coordinated_Universal_Time). For most intents and purposes, it can be considered to be the same as GMT (Greenwich Mean Time), or “Zulu time”.