Places need a unique form of reference if we want to share or correlate them, but what should it be? Should their names be standardised? Should we rely on some system of geographical coordinates? Should we use a fabricated reference, or surrogate key using the language of data modelling?
I have previously written about place names, and how they should be recorded as they originally were, at A Place for Everything. This post tried to put-to-rest some age-old questions such as whether to record their old or new name, how to deal with alternative names/spellings, how to represent jurisdictional changes, and basically to make Places a top-level entity in our data so that we can attach evidence and narrative to them. The same debates will persist though. You might say: there be dragons in all directions.
One of the frequent issues is the suggested importance of using coordinates over variable names, and I want to focus on that because I admit that I don’t quite get it. A strong advocate of using coordinates is prolific blogger James Tanner who has mentioned them recently at: Finding Your Ancestors Exact Location and The Challenge of Changing Place Names, and less recently in 2009 at: Even More on Standardization of Place.
Before I go any further, I want to make sure that my preferred terminology is clear and unambiguous since this is an area where related but dissimilar terms are used interchangeably:
- Postal Address – A sequence of terms that direct traditional mail (e.g. letters, packages, etc) to a particular recipient.
- Location – A fixed geographical point or area, usually referenced by its coordinates.
- Place – A named point or area deemed to have significance to humans.
The first of these is easy to define, and easy to distinguish from the other two. Obviously postal addresses do not apply to every place or location. For instance, it may be an historical one, or it may no longer exist, or it could be the name of a vehicle in transit such as a ship, or post may never be delivered there. Conversely, an address such as a “P. O. Box Number” is an abstract collection point for the recipient and does not correspond to a physical place or location. Distinguishing a Place from a Location is a little more subtle (see difference-between-location-and-place for instance). In order to illustrate the difference, consider a major urban redevelopment where streets are torn down and remade in a different fashion. A household on the new street constitutes a different Place to a household on the old street, even though they may be at the same physical Location. It therefore makes sense to talk about the “location of a place”.
So do I believe coordinates are important? Absolutely! For instance, knowing which English counties are contiguous with a given one doesn’t tell me anything about which direction they are, or how far away they are. This makes a big difference if you’re near the boundary of such a region.
My main issue with coordinates is that it should not be the responsibility of genealogical users to measure or estimate them, and then to record them in their own data. Genealogists are rarely qualified cartographers! I strongly believe that the coordinates for each place should be recorded and made freely-available by a relevant authoritative body, and I will return to this in a moment.
I have never seen an historical data source with the coordinates of my ancestors on. Every single case has involved me looking for a named place on an old map, or in a street/trade directory. Even though some of the names may have had dubious spellings, identification has always involved the name. I often use maps but have never had cause to use latitude & longitudes values. As a rural dweller myself, though, I admit there will be cases where coordinates are the best method of recording some location but they will not be the majority.
As well as it being hard to obtain accurate coordinates of non-existent places, their granularity is even more problematic. Many records do not pinpoint a household. They might reference a street, or a village/town, or even the relevant county. In general, the larger the place then the harder it is to geocode it. A simple record of its mid-point is virtually useless, but then geocoding the whole perimeter as some approximate polygon needs considerably more attention.
When people talk about coordinates (especially in the US) they are mostly referring to latitude & longitude values, and these can be stored in standardised ways. ISO 6709:2008 supports point location representation through the use of XML but, recognizing the need for compatibility with the previous version of the standard, ISO 6709:1983, it also allows for the use of a single alphanumeric string. However, different coordinate systems are in common use, such as the Ordnance Survey National Grid of Great Britain. Yes, it is possible to convert from one to the other (e.g. using the Ordnance Survey geograph tool at http://www.geograph.org.uk/showmap.php?gridref=SK57034525) but what’s the point if all the local maps use the alternative system. Being able to navigate directly from your data to the corresponding point on a digital map is a worthy goal, but it does not mandate storing private coordinates in your data, or relying on one specific coordinate system.
There have been a number of announcements in this general field recently:
- Dallan Quass, of WeRelate, discussed design considerations for the ‘REST Service for Place Standardization’ on firstname.lastname@example.org, and started enquiring about potential data sources.
- Justin York announced the launch of an Open Place Database (OPD) at http://blog.genealogysystems.com/2013/12/watercooler-wednesday-15-introducing.html. One of the hurdles this project hopes to tackle is geocoding irregularly-shaped places (see http://gis.stackexchange.com/questions/79510/how-can-i-geocode-to-a-shape-instead-of-a-coordinate/79519).
- Rob Hoare, on having seen the OPD announcement, suggested he might divert more effort to his own open database for place names (http://openplacenames.com/).
These initiatives should be praised and supported by the whole genealogical community, if not beyond.
I have not done any work myself on coordinates within my STEMMA® project but I have investigated how to represent hierarchical Place entities, including the use of time-dependent parents to cope with jurisdictional changes. There are two main ways to proceed here since there are multiple potential hierarchies: (a) actually implement multiple concurrent parents for each entity such that you can follow hierarchies of different types (e.g. administrative or religious), or (b) adopt an attributed hierarchy where you implement one main hierarchy, and then give each Place various properties representing parents of a different type (e.g. a religious parish).
STEMMA is currently experimenting with the second route but I admit that there are difficulties. One particular problem in England is the identification of a ‘county’. The historic counties of England were established for administration by the Normans, and in most cases were based on earlier kingdoms and shires established by the Anglo-Saxons. These geographic counties existed before the local government reforms of 1965 and 1974. Counties are now primarily an administrative division composed of districts and boroughs. The overall administrative hierarchy is described at Subdivisions_of_England. However, the ceremonial counties are still used as geographical entities. The postal counties of the UK, now known officially as the former postal counties, were postal subdivisions in routine use by the Royal Mail until 1996. The registration counties were a statistical unit for the registration of births, marriages, and deaths and for the output of census information.
A hard example that occurs frequently in my own data is the town of Ilkeston. Although technically in the administrative county of Derbyshire, it is actually closer to Nottingham than it is to Derby. It therefore appears in the Basford Registration District of the Nottinghamshire Registration County. So which is the more important type of county to use in the primary hierarchy?
STEMMA also represents streets, and even individual households, as Place entities. Without any link to an authoritative database, I experimented with recording the end-points of streets in terms of which other street they connected to. This resulted in a network rather than a map but was still of some use. Unfortunately, it didn’t include street intersections, and had to be adjusted for some real-life practical cases: a cul de sac only has one end-point (easy enough), but I also grew up on a street which looped around and joined on to itself. This constitutes one of the few cases I know of where all the streets intersecting at a junction are the same one, and this alone meant dropping several types of consistency check.
The issue of house numbers has a particularly British eccentricity to it. US streets may be nicely orthogonal, with numbers allocated in blocks, but not so in places like England. It is widely reported that we generally have even numbers on one side of a street and odd numbers on the other, but they’re often out-of-synchronisation with each other, and occasionally even run in opposite directions. Even if an authoritative source didn’t record the location of every household, knowing which way the house numbers went could be helpful in knowing how neighbourly two houses were.
So what would my answer to the original question be, assuming that I had sufficient resources? That would have to be a Place Authority, meaning an Internet-based service supporting all aspects of historical and modern places. Although such a service would have a single URL, the database itself would have to be federated such that it could be populated and maintained by multiple appropriate authoritative bodies. This is essential for a worldwide authority but it could even apply within national borders, such as the within the US. In effect, the service would be hierarchical, and the conceptual database would be geographically distributed, thus constituting a Shared Nothing Architecture. Incoming requests to the main URL would be delegated, as appropriate, to one-or-more subordinate nodes, each of which must support a standard request/response interface. For instance, given a vague (not country-specific) place-name search, the results returned from multiple nodes would be merged before responding to the client software.
This federated implementation would allow new databases to be brought online as and when they became available, and would anticipate a take-up based on a demonstrable proof-of-concept being implemented somewhere.
This sketch does not specify the precise nature of the database – including the choice of hierarchical model – or of the request types. However, I can easily outline some functional requirements:
- Supporting a lookup of a place name, with or without a specific area of focus such as a country
- Support for global place types along the lines of ISO 3166-2 or Nomenclature of Units for Territorial Statistics (NUTS) but incorporating historical ones too.
- Information about all the accepted names of the current place, including any dates for historical ones, and including any dual-language representations such as in Canada, Ireland, and Wales.
- Information about the parent place of any given place, including parents of different types and the dates of any changes.
- Acceptance of a client locale and a date range during queries.
- Request for a formatted Place Hierarchy Path, as explained previously in A Place for Everything.
- Providing coordinates of your place, and its boundaries.
- Knowledge of abbreviations used in your locale, and common misspellings of a name.
- Acceptance of evidence-based user submissions to the underlying database(s).
- Provision of historical narrative and/or external links pertaining to the place.
So, returning to the question in the heading, I believe the unambiguous place identifier should be a fabricated surrogate key, and that both place names (including all their variations and jurisdictional changes) and coordinates should be maintained by a Place Authority for each respective key.