Places need a unique form of reference if we want to share
or correlate them, but what should it be? Should their names be standardised?
Should we rely on some system of geographical coordinates? Should we use a fabricated
reference, or surrogate key using the
language of data modelling?
I have previously written about place names, and how they
should be recorded as they originally were, at A
Place for Everything. This post tried to put-to-rest some age-old questions
such as whether to record their old or new name, how to deal with alternative
names/spellings, how to represent jurisdictional changes, and basically to make
Places a top-level entity in our data so that we can attach evidence and
narrative to them. The same debates will persist though. You might say: there be dragons in all directions.
One of the frequent issues is the suggested importance of
using coordinates over variable names, and I want to focus on that because I
admit that I don’t quite get it. A strong advocate of using coordinates is
prolific blogger James Tanner who has mentioned them recently at: Finding
Your Ancestors Exact Location and The
Challenge of Changing Place Names, and less recently in 2009 at: Even
More on Standardization of Place.
Before I go any further, I want to make sure that my
preferred terminology is clear and unambiguous since this is an area where
related but dissimilar terms are used interchangeably:
- Postal Address – A sequence of terms that direct traditional mail (e.g. letters, packages, etc) to a particular recipient.
- Location – A fixed geographical point or area, usually referenced by its coordinates.
- Place – A named point or area deemed to have significance to humans.
The first of these is easy to define, and easy to
distinguish from the other two. Obviously postal addresses do not apply to
every place or location. For instance, it may be an historical one, or it may
no longer exist, or it could be the name of a vehicle in transit such as a
ship, or post may never be delivered there. Conversely, an address such as a
“P. O. Box Number” is an abstract collection point for the recipient and does
not correspond to a physical place or location. Distinguishing a Place from a
Location is a little more subtle (see difference-between-location-and-place
for instance). In order to illustrate the difference, consider a major urban
redevelopment where streets are torn down and remade in a different fashion. A
household on the new street constitutes a different Place to a household on the old street, even though they may be at
the same physical Location. It
therefore makes sense to talk about the “location of a place”.
So do I believe coordinates are important? Absolutely! For
instance, knowing which English counties are contiguous with a given one
doesn’t tell me anything about which direction they are, or how far away they
are. This makes a big difference if you’re near the boundary of such a region.
My main issue with coordinates is that it should not be the
responsibility of genealogical users to measure or estimate them, and then to record
them in their own data. Genealogists are rarely qualified cartographers! I
strongly believe that the coordinates for each place should be recorded and
made freely-available by a relevant authoritative body, and I will return to
this in a moment.
I have never seen an historical data source with the
coordinates of my ancestors on. Every single case has involved me looking for a
named place on an old map, or in a street/trade directory. Even though some of
the names may have had dubious spellings, identification has always involved
the name. I often use maps but have never had cause to use latitude &
longitudes values. As a rural dweller myself, though, I admit there will be
cases where coordinates are the best method of recording some location but they
will not be the majority.
As well as it being hard to obtain accurate coordinates of
non-existent places, their granularity is even more problematic. Many records
do not pinpoint a household. They might reference a street, or a village/town,
or even the relevant county. In general, the larger the place then the harder
it is to geocode it. A simple record of its mid-point is virtually useless, but
then geocoding the whole perimeter as some approximate polygon needs
considerably more attention.
When people talk about coordinates (especially in the US)
they are mostly referring to latitude & longitude values, and these can be
stored in standardised ways. ISO 6709:2008 supports point location
representation through the use of XML but, recognizing the need for
compatibility with the previous version of the standard, ISO 6709:1983, it also
allows for the use of a single alphanumeric string. However, different
coordinate systems are in common use, such as the Ordnance
Survey National Grid of Great Britain. Yes, it is possible to convert from
one to the other (e.g. using the Ordnance Survey geograph tool at http://www.geograph.org.uk/showmap.php?gridref=SK57034525)
but what’s the point if all the local maps use the alternative system. Being
able to navigate directly from your data to the corresponding point on a
digital map is a worthy goal, but it does not mandate storing private
coordinates in your data, or relying on one specific coordinate system.
There have been a number of announcements in this general
field recently:
- Dallan Quass, of WeRelate, discussed design considerations for the ‘REST Service for Place Standardization’ on rootsdev@googlegroups.com, and started enquiring about potential data sources.
- Justin York announced the launch of an Open Place Database (OPD) at http://blog.genealogysystems.com/2013/12/watercooler-wednesday-15-introducing.html. One of the hurdles this project hopes to tackle is geocoding irregularly-shaped places (see http://gis.stackexchange.com/questions/79510/how-can-i-geocode-to-a-shape-instead-of-a-coordinate/79519).
- Rob Hoare, on having seen the OPD announcement, suggested he might divert more effort to his own open database for place names (http://openplacenames.com/).
These initiatives should be praised and supported by the
whole genealogical community, if not beyond.
I have not done any work myself on coordinates within my
STEMMA® project but I have investigated how to represent hierarchical Place
entities, including the use of time-dependent parents to cope with
jurisdictional changes. There are two main ways to proceed here since there are
multiple potential hierarchies: (a) actually implement multiple concurrent parents
for each entity such that you can follow hierarchies of different types (e.g.
administrative or religious), or (b) adopt an attributed hierarchy where you implement one main hierarchy, and
then give each Place various properties representing parents of a different
type (e.g. a religious parish).
STEMMA is currently experimenting with the second route but I
admit that there are difficulties. One particular problem in England is the
identification of a ‘county’. The historic
counties of England were established for administration by the Normans, and
in most cases were based on earlier kingdoms and shires established by the
Anglo-Saxons. These geographic counties existed before the local government
reforms of 1965 and 1974. Counties are now primarily an administrative
division composed of districts and boroughs. The overall administrative
hierarchy is described at Subdivisions_of_England.
However, the ceremonial
counties are still used as geographical entities. The postal
counties of the UK, now known officially as the former postal counties,
were postal subdivisions in routine use by the Royal Mail until 1996. The registration counties
were a statistical unit for the registration of births, marriages, and deaths
and for the output of census information.
A hard example that occurs frequently in my own data is the
town of Ilkeston. Although technically in the administrative county of
Derbyshire, it is actually closer to Nottingham than it is to Derby. It
therefore appears in the Basford Registration District of the Nottinghamshire
Registration County. So which is the more important type of county to use in
the primary hierarchy?
STEMMA also represents streets, and even individual
households, as Place entities. Without any link to an authoritative database, I
experimented with recording the end-points of streets in terms of which other
street they connected to. This resulted in a network rather than a map but was
still of some use. Unfortunately, it didn’t include street intersections, and
had to be adjusted for some real-life practical cases: a cul de sac only has one end-point (easy enough), but I also grew up
on a street which looped around and joined on to itself. This constitutes one
of the few cases I know of where all the streets intersecting at a junction are
the same one, and this alone meant dropping several types of consistency check.
The issue of house numbers has a particularly British
eccentricity to it. US streets may be nicely orthogonal, with numbers allocated
in blocks, but not so in places like England. It is widely reported that we generally
have even numbers on one side of a street and odd numbers on the other, but
they’re often out-of-synchronisation with each other, and occasionally even run
in opposite directions. Even if an authoritative source didn’t record the
location of every household, knowing which way the house numbers went could be
helpful in knowing how neighbourly two houses were.
So what would my answer to the original question be,
assuming that I had sufficient resources? That would have to be a Place Authority, meaning an
Internet-based service supporting all aspects of historical and modern places.
Although such a service would have a single URL, the database itself would have
to be federated such that it could be populated and maintained by multiple appropriate
authoritative bodies. This is essential for a worldwide authority but it could
even apply within national borders, such as the within the US. In effect, the service
would be hierarchical, and the conceptual database would be geographically distributed,
thus constituting a Shared Nothing
Architecture. Incoming requests to the main URL would be delegated, as
appropriate, to one-or-more subordinate nodes, each of which must support a
standard request/response interface. For instance, given a vague (not
country-specific) place-name search, the results returned from multiple nodes
would be merged before responding to the client software.
This federated implementation would allow new databases to
be brought online as and when they became available, and would anticipate a
take-up based on a demonstrable proof-of-concept being implemented somewhere.
This sketch does not specify the precise nature of the
database – including the choice of hierarchical model – or of the request
types. However, I can easily outline some functional requirements:
- Supporting a lookup of a place name, with or without a specific area of focus such as a country
- Support for global place types along the lines of ISO 3166-2 or Nomenclature of Units for Territorial Statistics (NUTS) but incorporating historical ones too.
- Information about all the accepted names of the current place, including any dates for historical ones, and including any dual-language representations such as in Canada, Ireland, and Wales.
- Information about the parent place of any given place, including parents of different types and the dates of any changes.
- Acceptance of a client locale and a date range during queries.
- Request for a formatted Place Hierarchy Path, as explained previously in A Place for Everything.
- Providing coordinates of your place, and its boundaries.
- Knowledge of abbreviations used in your locale, and common misspellings of a name.
- Acceptance of evidence-based user submissions to the underlying database(s).
- Provision of historical narrative and/or external links pertaining to the place.
So, returning to the question in the heading, I believe the
unambiguous place identifier should be a fabricated surrogate key, and that both
place names (including all their variations and jurisdictional changes) and
coordinates should be maintained by a Place Authority for each respective key.
No comments:
Post a Comment