Most genealogists will have come across the ISO date
standard. If not then I’ll introduce it to you, and explain why it’s important
to us. I want to question, though, whether it is bad for genealogists and for
technology in general.
Most genealogical data will be concerned with dates rather
than times. Although Time Zones (TZ) and Daylight Saving Time (DST) are usually
applied to local clock times, they can also apply to local calendar dates. The
importance of this to family historians is going to be slim at best but it
needed to be said before we look at the ISO standard.
Having machine-readable copies of our dates is essential
when software is applied to genealogical data, or to historical data in
general. We take it for granted that databases store the dates of our vital
events in some internal format that facilitates sorting, searching, and
collation. With the growing amount of data appearing on the Internet then an
international standard is also essential so that searches can be performed
across disparate data without having to worry about which country it was
created by, or which format it is represented in.
It is understandable that some people fear this conversion of
data to a machine-readable representation, often citing that evidence may not
be that clear. However, the verbatim evidential form complements a normalised
digital version. Neither one supersedes the other, and they’re both essential
for different reasons.
The ISO 8601 date standard[1]
was conceived as an unambiguous way of storing and exchanging dates, times, and
dates plus times in combination — hereinafter referred to as datetimes. Most people who are aware of it will immediately think
of the YYYY-MM-DD numeric representation of dates, e.g. 2014-06-09 for 9th
June 2014. A numeric representation is important because — believe it or not —
there are countries who do not speak English ☺. This particular layout achieves
two things: (i) it avoids the UK/US difference in the way we order our day and
month fields, and (ii) it makes the representation textually sortable because
the bigger units are at the head.
Times are represented in the format hh:mm:ss, and when
combined with dates to represent a datetime then the two parts are separated by
a ‘T’ character, i.e. YYYY-MM-DDThh:mm:ss. The separating hyphen and colon characters
may be omitted if data size is perceived as an issue. All of the date, time,
and datetime representations allow truncation from the tail of the string in
order to describe values of greater granularity.[2] For instance, 20:12 (i.e. 8:12pm) or
2014-08 (i.e. August 2014).
So far, this sounds good, right? If the standard had rounded
the specification off about now then it would have been great. Unfortunately,
there’s a lot of unrelated stuff dumped in there, and an undue level of
“flexibility”.
A decimal fraction can be applied to the seconds field, or
the minutes or hours fields in one of the truncated forms, but the number of
decimal places is “by mutual agreement” between sender and receiver. For
instance 12.34 (hh.hh format), 21:10.217 (hh:mm.mmm format), or 23:59:59.9
(hh:mm.ss.s format).
The Gregorian calendar was introduced during 1582 but the
standard allows proleptic application (i.e. to dates before the calendar was
defined) “by mutual agreement” between sender and receiver. The date may also
be extended to include more digits using a +YYYYYY… representation, although
the number of digits is “by mutual agreement”.
The standard supports week
dates which use week-numbers and days-of-the-week rather than month-numbers
and days-of-the-month: YYYY-Www-D where the ‘W’ is a fixed designator. For
instance: 2014-W10-2, meaning the second day (Tuesday) of the 10th
week of 2014.
The standard supports ordinal
dates which use days-of-the-year: YYYY-DDD. Although both week dates and
ordinal dates are separately sortable, that capability breaks down if they are
mixed with each other or with basic dates.
The standard supports an optional UTC[3]
designator of ‘Z’ (i.e. “Zulu time”), or a UTC offset (±hh or ±hh:mm), appended
to a time.
The standard supports time intervals using one of the forms:
start/end (i.e. YYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss),
start/Pduration (e.g. YYYY-MM-DDThh:mm:ss/PYYYY-MM-DDThh:mm:ss),
or Pduration/end (i.e. PYYYY-MM-DDThh:mm:ss/YYYY-MM-DDThh:mm:ss).
The standard supports recurring time intervals by prefixing “Rnn/”
before one of the aforementioned time-interval representations, where the ‘nn’
is the repeat count. The standard does not stipulate the number of digits in
this count.
Right, so you’re now aware of the complexity of this
standard. It’s not just about a standard representation of a date and/or time. The
standard was originally designed to replace older standards on numeric
date/time representations (ISO 2014), week dates (ISO 2015), ordinal dates (ISO
2711), and a number of time-related standards. It was revised in 2000, and
again in 2004, partly because the complexity had led to ambiguities.
This complexity is bad because most applications are only
interested in specific parts of the standard — usually the basic
representations of dates and/or times. I know of no software that implements the
entire standard, and that means a statement such as “ISO 8601 compliant” is
meaningless. Which parts has it implemented? Which options has it selected?
There are also many instances of the clause “by mutual
agreement”:
- In the acceptance of year values from 0000 to 1582.
- In the acceptance of more than four digits in the year field.
- In the decimal places of a fractional time.
- In the omission of the separating ‘T’ between a date and a time in a datetime.
- In the full range of valid terms in a time interval.
The standard may be acting as a guide in these situations
but “by mutual agreement” is basically the contract established between two
pieces of software when there is no standard. In particular, on the Internet
there is no specific receiver with which mutual agreement can be formed and so
that degree of flexibility is inappropriate there.
The W3C discussion note at W3CDTF examined the need for a
subset of the overloaded ISO 8601 standard that could be used on the Internet. Only
the basic representational part of the standard was used, and that was enough
to satisfy the requirements of data exchange.
The US Library of Congress Extended Date Time Format (EDTF) actually subsets the various
features defined by ISO 8601 (e.g. “Level 0”) so that implementation can be to
a selected level for which an agreed designation exists. The EDTF sacrifices
some of the ISO 8601 flexibility but also extends it in order to address issues
such as uncertain date components (e.g. you know the year and the day but are
unsure of the month).
So if the ISO standard were similarly subsetted then would
that be the answer? It would certainly help but the standard is deficient in a
number of other ways.
There is no support for quarter
dates. For instance, representing the period January through March as
Quarter 1. This is essential for certain records such as the index of vital
events compiled by the GRO
of England and Wales. Although the local registrations will involve specific
day-based dates, the index is compiled on a quarterly basis. Citing an entry
therefore needs a way of representing the relevant quarter. This is also
another instance of the difference between granularity and imprecision already mentioned
in note [2].
I can only assume that this was an oversight of the standard since a format of
YYYY-Qq (e.g. 1956-Q2) is consistent with the standard as it exists now, and it
follows the precedent already set by week
dates.
Perhaps the most lacking support that’s relevant to
historical data is support for non-Gregorian calendars. There are many other calendar systems
in the world — both ancient and modern — and these may be based on solar cycles,
lunar cycles, astronomical cycles, or regnal years. I am aware of
no digital representations of dates from any of these calendars and this has
serious repercussions. The prevailing notion amongst developers of software
technology, and related standards, is that they can all be converted to the
Gregorian calendar and represented using (some part of-) ISO 8601. This breaks
down in practice, though, because exact conversions are not always possible. Indeed,
the conversion may be dependent upon other factors such as the precise location
of the event. At the very least, the conversion has to introduce some
imprecision into the Gregorian equivalent, but converting such a date
prematurely will set in stone an association that may change as new evidence or
better research becomes available. What is needed is a general scheme that can
represent dates from different calendar systems using a similar numerical approach
to the Gregorian case. Any conversion would then be done on-the-fly, if and
when necessary, without breaking a golden rule by distorting the evidence to fit
the technology. More on this another time though.
[1] Data elements and
interchange formats — Information
interchange — Representation
of dates and times,
International Standard, ISO 8601:2004(E), 3rd ed. 1 Dec 2004; online
copies obtained from http://dotat.at/tmp/ISO_8601-2004_E.pdf (accessed 3 Jun 2014).
[2] The standard
describes this as “reduced accuracy”, but there's a
difference between imprecision and granularity in this field. Saying that a
photo was taken in 1942/1943 is a case of imprecision but when talking about
'19th century newspapers' then that's a case of granularity.
[3] UTC standards for
Coordinated Universal Time (http://en.wikipedia.org/wiki/Coordinated_Universal_Time).
For most intents and purposes, it can be considered to be the same as GMT
(Greenwich Mean Time), or “Zulu time”.
No comments:
Post a Comment