Friday 6 June 2014

Is the ISO Date Standard Bad?

Most genealogists will have come across the ISO date standard. If not then I’ll introduce it to you, and explain why it’s important to us. I want to question, though, whether it is bad for genealogists and for technology in general.



Most genealogical data will be concerned with dates rather than times. Although Time Zones (TZ) and Daylight Saving Time (DST) are usually applied to local clock times, they can also apply to local calendar dates. The importance of this to family historians is going to be slim at best but it needed to be said before we look at the ISO standard.

Having machine-readable copies of our dates is essential when software is applied to genealogical data, or to historical data in general. We take it for granted that databases store the dates of our vital events in some internal format that facilitates sorting, searching, and collation. With the growing amount of data appearing on the Internet then an international standard is also essential so that searches can be performed across disparate data without having to worry about which country it was created by, or which format it is represented in.

It is understandable that some people fear this conversion of data to a machine-readable representation, often citing that evidence may not be that clear. However, the verbatim evidential form complements a normalised digital version. Neither one supersedes the other, and they’re both essential for different reasons.

The ISO 8601 date standard[1] was conceived as an unambiguous way of storing and exchanging dates, times, and dates plus times in combination — hereinafter referred to as datetimes. Most people who are aware of it will immediately think of the YYYY-MM-DD numeric representation of dates, e.g. 2014-06-09 for 9th June 2014. A numeric representation is important because — believe it or not — there are countries who do not speak English ☺. This particular layout achieves two things: (i) it avoids the UK/US difference in the way we order our day and month fields, and (ii) it makes the representation textually sortable because the bigger units are at the head.

Times are represented in the format hh:mm:ss, and when combined with dates to represent a datetime then the two parts are separated by a ‘T’ character, i.e. YYYY-MM-DDThh:mm:ss. The separating hyphen and colon characters may be omitted if data size is perceived as an issue. All of the date, time, and datetime representations allow truncation from the tail of the string in order to describe values of greater granularity.[2] For instance, 20:12 (i.e. 8:12pm) or 2014-08 (i.e. August 2014).

So far, this sounds good, right? If the standard had rounded the specification off about now then it would have been great. Unfortunately, there’s a lot of unrelated stuff dumped in there, and an undue level of “flexibility”.

A decimal fraction can be applied to the seconds field, or the minutes or hours fields in one of the truncated forms, but the number of decimal places is “by mutual agreement” between sender and receiver. For instance 12.34 (hh.hh format), 21:10.217 (hh:mm.mmm format), or 23:59:59.9 (hh:mm.ss.s format).

The Gregorian calendar was introduced during 1582 but the standard allows proleptic application (i.e. to dates before the calendar was defined) “by mutual agreement” between sender and receiver. The date may also be extended to include more digits using a +YYYYYY… representation, although the number of digits is “by mutual agreement”.

The standard supports week dates which use week-numbers and days-of-the-week rather than month-numbers and days-of-the-month: YYYY-Www-D where the ‘W’ is a fixed designator. For instance: 2014-W10-2, meaning the second day (Tuesday) of the 10th week of 2014.

The standard supports ordinal dates which use days-of-the-year: YYYY-DDD. Although both week dates and ordinal dates are separately sortable, that capability breaks down if they are mixed with each other or with basic dates.

The standard supports an optional UTC[3] designator of ‘Z’ (i.e. “Zulu time”), or a UTC offset (±hh or ±hh:mm), appended to a time.

The standard supports time intervals using one of the forms: start/end (i.e. YYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss), start/Pduration (e.g. YYYY-MM-DDThh:mm:ss/PYYYY-MM-DDThh:mm:ss), or Pduration/end (i.e. PYYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss).

The standard supports recurring time intervals by prefixing “Rnn/” before one of the aforementioned time-interval representations, where the ‘nn’ is the repeat count. The standard does not stipulate the number of digits in this count.

Right, so you’re now aware of the complexity of this standard. It’s not just about a standard representation of a date and/or time. The standard was originally designed to replace older standards on numeric date/time representations (ISO 2014), week dates (ISO 2015), ordinal dates (ISO 2711), and a number of time-related standards. It was revised in 2000, and again in 2004, partly because the complexity had led to ambiguities.

This complexity is bad because most applications are only interested in specific parts of the standard — usually the basic representations of dates and/or times. I know of no software that implements the entire standard, and that means a statement such as “ISO 8601 compliant” is meaningless. Which parts has it implemented? Which options has it selected?

There are also many instances of the clause “by mutual agreement”:

  • In the acceptance of year values from 0000 to 1582.
  • In the acceptance of more than four digits in the year field.
  • In the decimal places of a fractional time.
  • In the omission of the separating ‘T’ between a date and a time in a datetime.
  • In the full range of valid terms in a time interval.

The standard may be acting as a guide in these situations but “by mutual agreement” is basically the contract established between two pieces of software when there is no standard. In particular, on the Internet there is no specific receiver with which mutual agreement can be formed and so that degree of flexibility is inappropriate there.

The W3C discussion note at W3CDTF examined the need for a subset of the overloaded ISO 8601 standard that could be used on the Internet. Only the basic representational part of the standard was used, and that was enough to satisfy the requirements of data exchange.

The US Library of Congress Extended Date Time Format (EDTF) actually subsets the various features defined by ISO 8601 (e.g. “Level 0”) so that implementation can be to a selected level for which an agreed designation exists. The EDTF sacrifices some of the ISO 8601 flexibility but also extends it in order to address issues such as uncertain date components (e.g. you know the year and the day but are unsure of the month).

So if the ISO standard were similarly subsetted then would that be the answer? It would certainly help but the standard is deficient in a number of other ways.

There is no support for quarter dates. For instance, representing the period January through March as Quarter 1. This is essential for certain records such as the index of vital events compiled by the GRO of England and Wales. Although the local registrations will involve specific day-based dates, the index is compiled on a quarterly basis. Citing an entry therefore needs a way of representing the relevant quarter. This is also another instance of the difference between granularity and imprecision already mentioned in note [2]. I can only assume that this was an oversight of the standard since a format of YYYY-Qq (e.g. 1956-Q2) is consistent with the standard as it exists now, and it follows the precedent already set by week dates.

Perhaps the most lacking support that’s relevant to historical data is support for non-Gregorian calendars. There are many other calendar systems in the world — both ancient and modern — and these may be based on solar cycles, lunar cycles, astronomical cycles, or regnal years. I am aware of no digital representations of dates from any of these calendars and this has serious repercussions. The prevailing notion amongst developers of software technology, and related standards, is that they can all be converted to the Gregorian calendar and represented using (some part of-) ISO 8601. This breaks down in practice, though, because exact conversions are not always possible. Indeed, the conversion may be dependent upon other factors such as the precise location of the event. At the very least, the conversion has to introduce some imprecision into the Gregorian equivalent, but converting such a date prematurely will set in stone an association that may change as new evidence or better research becomes available. What is needed is a general scheme that can represent dates from different calendar systems using a similar numerical approach to the Gregorian case. Any conversion would then be done on-the-fly, if and when necessary, without breaking a golden rule by distorting the evidence to fit the technology. More on this another time though.





[1] Data elements and interchange formats — Information interchange — Representation of dates and times, International Standard, ISO 8601:2004(E), 3rd ed. 1 Dec 2004; online copies obtained from http://dotat.at/tmp/ISO_8601-2004_E.pdf (accessed 3 Jun 2014).
[2] The standard describes this as “reduced accuracy”, but there's a difference between imprecision and granularity in this field. Saying that a photo was taken in 1942/1943 is a case of imprecision but when talking about '19th century newspapers' then that's a case of granularity.
[3] UTC standards for Coordinated Universal Time (http://en.wikipedia.org/wiki/Coordinated_Universal_Time). For most intents and purposes, it can be considered to be the same as GMT (Greenwich Mean Time), or “Zulu time”.

No comments:

Post a Comment