Thursday 26 December 2013

Proof of the Pudding

What can and cannot be proved? Can we prove anything at all? How close is the genealogical process to that of mathematics or the sciences?

Like a number of people with a mathematical or scientific background, I questioned the use of the term “genealogical proof” when I first came into the field. Unfortunately for me, I have since criticised it a number of times, and have been gently rebuked by long-standing genealogists who insisted it “is our jargon” and “It is currently what it is”.

Without wanting to appear belligerent, I will compare and contrast some interpretations of the term proof in order to see just how far apart genealogy and science really are in this vein. We all know there are entrenched attitudes, but there are also misconceptions too.

Let’s first look at the dictionary definitions of the terms fact[1] and proof[2], respectively:

Fact: A thing that is known or proved to be true.
Proof: Evidence or argument establishing a fact or the truth of a statement.

These are quite specific, and it would seem that much day-to-day usage lies on the edge of, or even outside of, these descriptions. For instance, providing proof of your identity doesn’t necessarily prove beyond any doubt who you are, otherwise identity theft would not exist. The problem arises because terms such as truth and fact imply something real and immutable, and hence incontrovertible. That level of proof is more accurately referred to as an absolute proof.

Mathematics is actually the only discipline where an absolute proof is possible. A mathematical proof is reasoned not from evidence but from first principles, or from other proofs. A proof typically begins by describing what is to be proved, and concludes with the initialism QED[3] to indicate completion. The body of a mathematical proof may draw on smaller proofs (theorems, lemmas, or even equations) to reach its goal in a hierarchical fashion, and the overall presentation is of non-sequential, labelled connections rather than plain narrative. Between 1910 and 1913, Alfred North Whitehead and Bertrand Russell published a three-volume work on the foundations of mathematics called The Principia Mathematica. It was an attempt to derive all mathematical truths from a well-defined set of axioms and inference rules in symbolic logic.

So what about scientific proof? Well, you may be surprised to hear that there’s no such thing. In a scientific context, it is an oxymoron since mathematics is a key tool of science and so they share the same interpretation of proof. Just as genealogy strives to explain the past from the available evidence, so (pure-)science tries to explain the universe from experimental evidence. Their common evidence-based aspirations lead them to the same limitation: you can never prove anything absolutely, but you can certainly disprove something.

Elizabeth Shown Mills, in her remarkable book on evidence analysis, states that “…there is no such thing as proof that can never be rebutted”[4], which is entirely in-keeping with the aforementioned evidence-based limitation. Mills acknowledges that the past is not directly accessible, and that any new piece of evidence could disprove our conclusions. The book clearly distinguishes the genealogical usage of words such as proof, and fact[5] (effectively just information from a source, and usually an assertion or a claim), from any absolute interpretation. However, while differentiating a proof in genealogy from that in law and social sciences, it makes the following statement about science[6]:

Unlike science, however, genealogy accepts no margin of error. A single error in identity or kinship will be multiplied exponentially with each generation beyond the error. Errors will occur. But family historians today approach their work with the mindset that erring is unacceptable.

This is an unfamiliar view of science to me, and I have to assume that it is referring to the applied sciences, such as engineering and technology. For example, if you were building a sky-scraper tower and your foundations were not level then the error would be compounded the higher you went. It would then be a matter of acceptable tolerances as to whether the project succeeded. From the perspective of pure science, though, the statement would be wholly wrong, and it is in pure science that the closest analogy to the genealogical ideal lies.

So let’s look at our separate vocabularies. In science, a hypothesis is a proposed explanation of some phenomenon, and is therefore based primarily on some subset of evidence relating to that phenomenon. The hypothesis becomes a theory when it has been tested against a much greater set of evidence. At this point, the terms are almost identical to those presented by Mills. However, even if a scientific theory is tested against all available evidence then it does not mean that anything has been proved — there is always a chance it could be overturned later, and the history of science is full of such cases. Mills places the definition of ‘proof’ in this spot, and it is distinguished from ‘theory’ by the concept of aggregated evidence.

Remember that this is still a compare-and-contrast exercise in aligning our concepts and vocabularies, and there isn’t a great divergence so far. It is with evidence, though, that our greatest differences become manifest. Science is about the here-and-now whereas genealogy is about the been-and-gone. What this means is that genealogy only has a finite set of evidence available, and although more of that set may be discovered over time, no evidence outside of that set will ever be found. It also means that evidence cannot be created on demand in order to solve a particular problem, or to support/refute a given proposition. On the other hand, in science — technology permitting — an experiment can be conceived purposely to test a given theory, or to separate two competing theories.

The idea of competing theories is something we have in common. It is possible in both fields to have more than one theory which explains the available evidence, no matter how deeply that evidence is scrutinised. Whereas science can usually conduct a specific experiment to disprove some of the candidate theories, and so support the remainder, genealogy can only search for more items of evidence that already exist. If they don’t exist somewhere now then they never will in the future either. Part of this process of testing our theories may involve extrapolating from them, and checking what they would predict if they were true. This gives focus to the areas where we need more evidence, and is again something common to both fields.

Even the concept of Occam’s Razor is something we have in common. Put simply, the least complicated explanation is probably the correct one. This doesn’t make it absolutely true, and certainly doesn’t constitute a proof in anyone’s vocabulary, but it can help focus our research.

One area where science meets genealogy, thus showing the relevance of this post, is DNA analysis. DNA analysis compares the genetic code of someone with either specific individuals or some ethnic group. There are three types of tests: Y-Chromosome (Y-DNA), which looks at a male descent along his direct paternal line, mitochondrial DNA (mtDNA), which looks at the descent of either sex along their direct maternal line, and autosomal (atDNA), which covers all types of descent. It is important to understand that DNA testing only matches certain sequences in our genetic profile, and the relevance of the results therefore depends on how long those sequences are, how rare they are, and how far back you're looking (more intermediate generations means the results are less significant). Hence, just as with any other type of evidence — scientific or historical — there can be no absolute proof. DNA testing can support the proposition that two people are related, or completely disprove it, but it can never prove it beyond doubt. When done correctly, DNA testing does provide an objective test that is not susceptible to bias or personal agendas, but our own acceptance and interpretation is a different matter. Note that when DNA evidence is presented in court, it is also supplemented by a risk factor indicating the chances of a false positive.

One short-lived concept that illustrates the dangers of an ambiguous interpretation of proof is Definitive Tangible Proof[7] (DTP). This was originally described as “…which can be, but is not limited to, birth/baptism records, church records, marriage records, land deals (purchases/sales), death records/certificates, grave markers, tax rolls, probate records, military records, Wills, etc.”. Rather than proof, this concept was really describing certain sources of evidence. However, no source is ever definitive, and all those listed can, and often do, contain errors. These are not special cases, and none can be taken at face value. When used in a proof argument (i.e. a written argument helping to ascertain some truth) then all such evidence must be assessed in the same way, including a consideration of the nature of the information, the nature of the underlying source, and how that evidence correlates with other evidence to support or contradict a claim.

So, in summary, I agree that historical research has a large element of precision in such areas as finding all available evidence, analysing and correlating that evidence, resolving any conflicts, and writing it up clearly and unambiguously. However, whereas science reserves the term proof for the absolute case, and doesn’t attempt to push any ideas beyond the status of theory, genealogy employs the word proof in the context of the less-precise disciplines. Despite attempts to define proof for the genealogical context, I believe this disparity of precision is at the root of many of our confusions.

[1] Oxford Dictionaries Online ( : accessed 24 Dec 2013), s.v. “fact”.
[2] Oxford Dictionaries Online ( : accessed 24 Dec 2013), s.v. “proof”.
[3] Q.E.D is an initialism of the Latin phrase: quod erat demonstrandum, which means "what was to be demonstrated”. However, the Latin was itself a translation of a Greek phrase, and translating directly from the original Greek results in “what was required to be proved”.
[4] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore, Maryland: Genealogical Pub. Co., 2009), p.17.
[5] Mills, Evidence Explained, p.18.
[6] Mills, Evidence Explained, p.19.
[7] This concept appeared at in May 2013. Although it has since been taken down, there are several posts still visible in the Internet that use the term as it was originally defined.

Thursday 19 December 2013

Place Names or Coordinates?

Places need a unique form of reference if we want to share or correlate them, but what should it be? Should their names be standardised? Should we rely on some system of geographical coordinates? Should we use a fabricated reference, or surrogate key using the language of data modelling?

I have previously written about place names, and how they should be recorded as they originally were, at A Place for Everything. This post tried to put-to-rest some age-old questions such as whether to record their old or new name, how to deal with alternative names/spellings, how to represent jurisdictional changes, and basically to make Places a top-level entity in our data so that we can attach evidence and narrative to them. The same debates will persist though. You might say: there be dragons in all directions.

One of the frequent issues is the suggested importance of using coordinates over variable names, and I want to focus on that because I admit that I don’t quite get it. A strong advocate of using coordinates is prolific blogger James Tanner who has mentioned them recently at: Finding Your Ancestors Exact Location and The Challenge of Changing Place Names, and less recently in 2009 at: Even More on Standardization of Place.

Before I go any further, I want to make sure that my preferred terminology is clear and unambiguous since this is an area where related but dissimilar terms are used interchangeably:

  • Postal Address – A sequence of terms that direct traditional mail (e.g. letters, packages, etc) to a particular recipient.
  • Location – A fixed geographical point or area, usually referenced by its coordinates.
  • Place – A named point or area deemed to have significance to humans.

The first of these is easy to define, and easy to distinguish from the other two. Obviously postal addresses do not apply to every place or location. For instance, it may be an historical one, or it may no longer exist, or it could be the name of a vehicle in transit such as a ship, or post may never be delivered there. Conversely, an address such as a “P. O. Box Number” is an abstract collection point for the recipient and does not correspond to a physical place or location. Distinguishing a Place from a Location is a little more subtle (see difference-between-location-and-place for instance). In order to illustrate the difference, consider a major urban redevelopment where streets are torn down and remade in a different fashion. A household on the new street constitutes a different Place to a household on the old street, even though they may be at the same physical Location. It therefore makes sense to talk about the “location of a place”.

So do I believe coordinates are important? Absolutely! For instance, knowing which English counties are contiguous with a given one doesn’t tell me anything about which direction they are, or how far away they are. This makes a big difference if you’re near the boundary of such a region.

My main issue with coordinates is that it should not be the responsibility of genealogical users to measure or estimate them, and then to record them in their own data. Genealogists are rarely qualified cartographers! I strongly believe that the coordinates for each place should be recorded and made freely-available by a relevant authoritative body, and I will return to this in a moment.

I have never seen an historical data source with the coordinates of my ancestors on. Every single case has involved me looking for a named place on an old map, or in a street/trade directory. Even though some of the names may have had dubious spellings, identification has always involved the name. I often use maps but have never had cause to use latitude & longitudes values. As a rural dweller myself, though, I admit there will be cases where coordinates are the best method of recording some location but they will not be the majority.

As well as it being hard to obtain accurate coordinates of non-existent places, their granularity is even more problematic. Many records do not pinpoint a household. They might reference a street, or a village/town, or even the relevant county. In general, the larger the place then the harder it is to geocode it. A simple record of its mid-point is virtually useless, but then geocoding the whole perimeter as some approximate polygon needs considerably more attention.

When people talk about coordinates (especially in the US) they are mostly referring to latitude & longitude values, and these can be stored in standardised ways. ISO 6709:2008 supports point location representation through the use of XML but, recognizing the need for compatibility with the previous version of the standard, ISO 6709:1983, it also allows for the use of a single alphanumeric string. However, different coordinate systems are in common use, such as the Ordnance Survey National Grid of Great Britain. Yes, it is possible to convert from one to the other (e.g. using the Ordnance Survey geograph tool at but what’s the point if all the local maps use the alternative system. Being able to navigate directly from your data to the corresponding point on a digital map is a worthy goal, but it does not mandate storing private coordinates in your data, or relying on one specific coordinate system.

There have been a number of announcements in this general field recently:

These initiatives should be praised and supported by the whole genealogical community, if not beyond.

I have not done any work myself on coordinates within my STEMMA® project but I have investigated how to represent hierarchical Place entities, including the use of time-dependent parents to cope with jurisdictional changes. There are two main ways to proceed here since there are multiple potential hierarchies: (a) actually implement multiple concurrent parents for each entity such that you can follow hierarchies of different types (e.g. administrative or religious), or (b) adopt an attributed hierarchy where you implement one main hierarchy, and then give each Place various properties representing parents of a different type (e.g. a religious parish).

STEMMA is currently experimenting with the second route but I admit that there are difficulties. One particular problem in England is the identification of a ‘county’. The historic counties of England were established for administration by the Normans, and in most cases were based on earlier kingdoms and shires established by the Anglo-Saxons. These geographic counties existed before the local government reforms of 1965 and 1974. Counties are now primarily an administrative division composed of districts and boroughs. The overall administrative hierarchy is described at Subdivisions_of_England. However, the ceremonial counties are still used as geographical entities. The postal counties of the UK, now known officially as the former postal counties, were postal subdivisions in routine use by the Royal Mail until 1996. The registration counties were a statistical unit for the registration of births, marriages, and deaths and for the output of census information.

A hard example that occurs frequently in my own data is the town of Ilkeston. Although technically in the administrative county of Derbyshire, it is actually closer to Nottingham than it is to Derby. It therefore appears in the Basford Registration District of the Nottinghamshire Registration County. So which is the more important type of county to use in the primary hierarchy?

STEMMA also represents streets, and even individual households, as Place entities. Without any link to an authoritative database, I experimented with recording the end-points of streets in terms of which other street they connected to. This resulted in a network rather than a map but was still of some use. Unfortunately, it didn’t include street intersections, and had to be adjusted for some real-life practical cases: a cul de sac only has one end-point (easy enough), but I also grew up on a street which looped around and joined on to itself. This constitutes one of the few cases I know of where all the streets intersecting at a junction are the same one, and this alone meant dropping several types of consistency check.

The issue of house numbers has a particularly British eccentricity to it. US streets may be nicely orthogonal, with numbers allocated in blocks, but not so in places like England. It is widely reported that we generally have even numbers on one side of a street and odd numbers on the other, but they’re often out-of-synchronisation with each other, and occasionally even run in opposite directions. Even if an authoritative source didn’t record the location of every household, knowing which way the house numbers went could be helpful in knowing how neighbourly two houses were.

So what would my answer to the original question be, assuming that I had sufficient resources? That would have to be a Place Authority, meaning an Internet-based service supporting all aspects of historical and modern places. Although such a service would have a single URL, the database itself would have to be federated such that it could be populated and maintained by multiple appropriate authoritative bodies. This is essential for a worldwide authority but it could even apply within national borders, such as the within the US. In effect, the service would be hierarchical, and the conceptual database would be geographically distributed, thus constituting a Shared Nothing Architecture. Incoming requests to the main URL would be delegated, as appropriate, to one-or-more subordinate nodes, each of which must support a standard request/response interface. For instance, given a vague (not country-specific) place-name search, the results returned from multiple nodes would be merged before responding to the client software.

This federated implementation would allow new databases to be brought online as and when they became available, and would anticipate a take-up based on a demonstrable proof-of-concept being implemented somewhere.

This sketch does not specify the precise nature of the database – including the choice of hierarchical model – or of the request types. However, I can easily outline some functional requirements:

  • Supporting a lookup of a place name, with or without a specific area of focus such as a country
  • Support for global place types along the lines of  ISO 3166-2 or Nomenclature of Units for Territorial Statistics (NUTS) but incorporating historical ones too.
  • Information about all the accepted names of the current place, including any dates for historical ones, and including any dual-language representations such as in Canada, Ireland, and Wales.
  • Information about the parent place of any given place, including parents of different types and the dates of any changes.
  • Acceptance of a client locale and a date range during queries.
  • Request for a formatted Place Hierarchy Path, as explained previously in A Place for Everything.
  • Providing coordinates of your place, and its boundaries.
  • Knowledge of abbreviations used in your locale, and common misspellings of a name.
  • Acceptance of evidence-based user submissions to the underlying database(s).
  • Provision of historical narrative and/or external links pertaining to the place.

So, returning to the question in the heading, I believe the unambiguous place identifier should be a fabricated surrogate key, and that both place names (including all their variations and jurisdictional changes) and coordinates should be maintained by a Place Authority for each respective key.

Saturday 14 December 2013

Digital Freedom

Our digital data has many sets of named types, such as event types. These sets can become a straightjacket if they are rigidly predefined, but are extensible sets at odds with the concept of a data standard? The answer is a resounding ‘No!’.

If you wanted to create a custom event type of, say, ‘Military Service’ then would your software let you? If it did then would that custom type be accepted by someone else using the same product, or by someone else using an entirely different product? The answer will be ‘No’ to at least one of these questions, but there is no good reason for it. It makes sense to predefine useful and common event types such as Birth, Death, Baptism, etc., but a finite list will ultimately be inadequate. There will always be some less-common event type that doesn’t fit, or you may require special event types for a different culture, or you may simply want the freedom to define your own event types in order to represent your personal history.

I want to explain how easily sets of extensible types, and other tag-names or tag-values[1], could be implemented in software. This is primarily for people who aren’t software professionals, although they might find it interesting too.

When a system defines a closed set of predefined types, options, or terms, then it is referred to as a controlled vocabulary. When that predefined set can be extended or enhanced then it is referred to as a partially controlled vocabulary.

As a simple example of a controlled vocabulary, let’s look at date formatting. Many systems now differentiate four basic date styles: {Short, Medium, Long, Full}. The software has a default recipe for how to format a date for your locale in each style. Although you can probably tailor any one of those default recipes to your own personal preferences, there are no other style names available for selection.

For genealogical data, there may be many applications of such vocabularies; both controlled and partially controlled: event types, properties (aka “facts” to everyone else), place types, role names, status values, name types, name parts, qualitative assessments (e.g. primary/secondary, original/derivative, etc), family types, and sex.

Sex is quite interesting – no, seriously! If someone defined a controlled vocabulary of just {Male, Female, Unknown} then you might wonder about other variations of sex, gender, and lifestyle. However, sex and gender are different concepts, and the birth sex is different again to some variant adopted as part of a later event. See Sex and Gender for more details.

So how can we have both a predefined set of types and retain the ability to create custom ones, whilst also avoiding clashes with anyone else’s custom types or future predefined types? This is really the crux of the problem, and it splits the practical applications into two categories. If the types are part of a passive set, such as event types, then extensibility is not only simple but custom types could be loaded by any other compliant application. However, if the types have structural or procedural connotations then they cannot be loaded by another application without it having knowledge of the associated structure or procedure. An example of the latter category is the record types used to store the data.

GEDCOM allows custom record types (aka “tags”) but it merely recommends that their names have a leading underscore character. The specification document for the GEDCOM 5.5 release[2] contains the following explanatory paragraph:

To ensure all transmitted information in the Lineage-Linked GEDCOM is uniformly identified the standardized tags cannot be placed in any other context than shown in Chapter 2. It is legal to extend the context of the form, but only by using user-defined tags which must begin with an underscore. This will not violate the lineage-linked GEDCOM standard unless the context for the grammar of the Lineage-Linked GEDCOM Form is violated. The use of the underscore in the user tag name is to signal a nonstandard construct is being used. This notifies the reading system of a discrepancy and will avoid future conflicts with tags that may be standardized in subsequent GEDCOM releases.

This may have prevented custom tags from clashing with GEDCOM ones reserved in later releases, but it never prevented clashes between alternative customisations. Simply using an underscore prefix is clearly not a workable solution. Also, any program designed around the official GEDCOM tags could do nothing more than ignore custom tags. An example list of predefined and custom tags may be found at

The solution to this comes from the world of XML in the form of XML namespaces. Although I will talk about XML a little here, this general approach could be applied to any data representation. A namespace is simply a named container for a set of tag-names (i.e. element or attribute names). By attributing each set of tag-names to its embracing namespace, no two names will every clash and so the overall vocabulary can be extended through the inclusion of new namespaces.

Let’s briefly look how XML represents namespaces internally. It firstly defines a short prefix for each namespace name, and then applies that prefix to all associated tag-names to distinguish them from each other, and from names in the default namespace which has no prefix. For example:

<root xmlns:my=""

               <my:td> Apples </my:td>
               <my:td> Bananas </my:td>

          <your:item> Coffee Table </your:item>
          <your:width> 80 </your:width>

          <your:length> 120 </your:length>


Here, my:table is distinct from your:table as they belong to separate namespaces. The xmlns attributes associate each prefix with its respective namespace name.

The XML namespace name is technically a URI but not a URL[3]. This basically means that it is not designed to be dereferenced or to access any associated resource. It is simply a unique identifier which distinguishes one namespace from another. The http: prefix, which confuses many people, is simply indicating that the namespace name is derived from a network domain name that you, or your organisation, owns. In other words, no separate registration scheme required here.

The syntax of a URI actually allows namespace names to be derived from unique roots other than domain names, such as email addresses, but they are rarely seen in practice. Another advantage of a URI over, say, a UUID (which is simply a string of letters and digits with no visible semantics) is that several can be created from the same root, such as “”. This allows you to create namespaces for distinct sets of identifiers, and support versioning of those sets.

In the XML case, its namespaces also supports new structural information being added to a data schema using something called XML Schema Definition (XSD). This allows each namespace to define a grammar for its contributions to the underlying XML syntax. For instance, in the above example, specifying what elements can exist below your:table, how many of each there can be, and what ordering is required. Although I give an outline example on the STEMMA® site at Extended Schemas, I’m not particularly in favour of this level of extensibility.

So, coming back to original topic, how does this help with genealogical types? Strictly speaking, XML’s namespaces only apply to its tag-names, although the principle has been extended since XML’s conception to include tag-values too. For instance:

<Dataset Name=’Example’

     <Event Key=’eExample’>
          <Type> MyEv:FamilyOuting </Type>
          ... etc ...

One of the earliest examples of this approach that I am aware of is the Simple Object Access Protocol (SOAP). STEMMA also follows this route and its page at Extended Vocabularies enumerates all of its own controlled and partially controlled vocabularies.

I’ve deliberately picked on event types as an illustration because I’ve already advocated much greater use of events, both protracted and hierarchical, in order to model the real-life events in our personal histories[4]. Hence, if we wanted to define an event for a “family outing” that we had evidence for, or distinguish the civil registration of a birth from the birth itself by using a separate event type[5], or create a new type for some culturally-dependent event, then we should not be constrained by a predefined list.

An equally good illustration could have involved Properties (aka “facts”) since the items of extracted evidence that we may want to record will depend strongly on the nature of the information source, and on the relevant culture. The STEMMA example at Multi-Role Events includes both custom Roles and custom Properties.

What I’ve described here is a simply a mechanism. The internals of a data representation would be hidden by a good product, and you wouldn’t be creating these files by hand. Someone is going to ask, though, about foreign-language versions, and it’s worth emphasising that what you enter and what you see are not merely copies of what’s stored in your data. Having a simple mapping of the programmatic term (e.g. MyEv:FamilyOuting) to a readable string for the locale of the current end-user (e.g. “Family Outing”) is one of the few pieces of configuration necessary in a compliant product.

By way of contrast, the mark-up employs support for “external enumerations” ( These allow its core vocabulary to be supplemented by external ones which must be accessible via real URLs. The aforementioned document describes these external vocabularies as “controlled” (i.e. closed) and specifies criteria for their viability, essentially removing any freedom from their creation.

[1] I’m using the generic terms tag-name and tag-value here to represent the name and value of a datum, respectively. I am aware that the term ‘tag’ has specific meaning elsewhere. For instance, in GEDCOM it’s synonymous with its record names. In XML, it refers to the name of an element in angle brackets, with or without a ‘/’ character, e.g. <x>, </x>, and <x/>.
[2] “Appendix A: Lineage-Linked GEDCOM Tag Definition” in The GEDCOM Standard: Release 5.5 (Family History Department of The Church of Jesus Christ of Latter-day Saints, 2 Jan 1996).
[3] Uniform Resource Identifiers (URI), Uniform Resource Locators (URL), and Uniform Resource names (URN), are often confused. The URI represents a general class of resource identifier which includes both URL and URN. The URN always begins with a urn: scheme prefix and has a restricted syntax designed for the hierarchical naming of resources. Its NID term, which follows the scheme prefix, has to be registered with the IANA for it to be official.
[4] See “Eventful Genealogy,, Parallax View, 3 Nov 2013 (
[5] Many people in Britain are guilty of confusing these by taking the year and quarter from the GRO civil registration index and recording them as the date of the vital event itself.