Parallax View ®: August 2013

Saturday, 31 August 2013

Are we a Genealogical Community?

Thomas MacEntee has reopened the subject of ‘the state of the genealogical community’ at: house-divided-house-doors. This is a frequent talking point with well-voiced opinions on all sides debating our similarities, differences, talents, and ethics. Despite there having been a number of recent threads on this subject, Thomas has invited bloggers to present their considered thoughts.

It has probably never been the case that genealogy only consisted of genealogical researchers. In earlier times, transcribers, archivists, librarians, etc., were deemed to have their own disciplines, irrespective of whether they contributed to genealogy. However, genealogy has many more facets now, and we have a very diverse set of participants as a result. A large number relate to the application of IT and the mushrooming technological support. This includes software designers, software vendors, data standards, those digitising records, and online content providers. A large number result from the hobbyists, and especially those taking advantage of the increased availability of data online. There are also more writers now, including bloggers and those writing for the popular magazines. Are we a community, though? This doesn’t have a black-and-white answer.

One very emotive subject that we need to move beyond is that of licensing our industry. There are those — primarily in the US — who strongly believe that genealogical research should be licensed, and that this will guarantee a more reliable level of professionalism. In all industries, though, licensing is designed purely to create a differential. In fields such as law or medicine, this is essential because of the serious damage that can be caused by a sloppy effort. This doesn’t mean that such cases can never happen but it does mean that they would incur serious repercussions. In other words, a licence comes with a huge level of responsibility. In our field, that differential would be misplaced, and it would work against the concept of a single community. You only have to look at the logistical practicalities such as researchers looking at records in other states, or other countries, and foreign researchers looking at US records, in order to appreciate that it cannot work.

Although there are grey areas, a professional (noun) is widely held to be someone engaged in an activity for financial gain, such as their main paid occupation. To be professional (adjective) is taken to be conformance to the technical or ethical standards of a profession. Someone engaged without financial gain, though, is often consider to be an amateur but then that doesn’t accurately represent people working for non-profit organisations or academics. The root of many of these discussions is the specific activity of paid genealogical research, and we should not lose sight of that focus when we make recomendations.

What we do have is a series of courses and qualifications around the world, including the BCG certification. In any field, qualifications indicate that you’re serious about a subject, and that you’ve put in time and effort to study and have reached an accepted level of expertise. It’s true that someone could have invested the same effort independently, or they may have a lifetime of experience, but without having received any qualification. In effect, a course, or wherever your experience and expertise came from, is separate from a qualification. However, those letters after your name are telling your prospective clients about your attitude to your subject. In professions where you are contracted by companies, your track record, and even word-of-mouth, can be more important than qualifications but personal clients need an upfront indication to assess you by. I strongly believe that innate talents, and especially personal commitment, also play an important role but those courses and qualifications are good things and must be supported.

An inevitable question is whether qualifications lead to elitism. In all walks of life there will be people who wear their qualifications or uniform as a status symbol; people who expect to be judged solely by these adornments rather than by what they say and do. This is human nature and cannot be avoided. Luckily, it is rare and it should not be considered an inherent consequence of there being qualifications, or it being especially associated with our field. A more likely situation works in reverse and involves people assuming that qualifications automatically mean you must be right. This can easily be interpreted as your conclusions being The Truth rather than the result of a reasoned analysis of available evidence. Just as in pure science, theories may be replaced by better theories, or revised in the light of new evidence. No one is ever guaranteed to be right.

Maybe the term ‘community’ is misleading because we are so very diverse in the parts we play. It’s really a case of circles within circles since there will always be different ways that we can specialise. What we don’t need are artificial barriers, or sleights from one of those circles to another such as between genealogical researchers and software designers. Those circles are not mutually exclusive and so unhelpful remarks can actually become totally wrong where people have a foot in more than one camp. Just for a moment, let’s try looking outwards from our ‘community’ rather than inwards. Genealogy and family history, irrespective of whether you consider them to be the same or different, are part of the bigger circle of micro-history alongside One-Name Studies, One-Place Studies, personal historians (as in APH), house histories, etc. It would be rare for any us to have never crossed into those fields. Micro-history, in turn is part of history in general. One of my first blog-posts was to recount the experiences of Dr Nick Barratt when he suggested this relationship to a conference of academic historians: Are Genealogists Historians Too?. The reaction is a perfect example of what we don’t want.

I believe there is a sea change about to take hold of genealogy, and it may take some people by surprise. I feel the traditional focus on family trees, and even family history, will be replaced by an insatiable public appetite to reclaim our public history. This will include the histories of our towns and villages, personal recollections, recordings, narrative, etc. I totally agree with Dr Barratt that these histories are essential for the general appreciation of history by ordinary people. Unfortunately, there is an absolute dearth of software support to help people in this direction. Currently in the UK, TV program makers are beginning to see that “real history” is more accessible than celebrity history or academic history. Very soon, our circles are going to get a whole lot bigger so let’s get things in perspective!

Monday, 26 August 2013

The Commercial Realities of Data Standards

Are we Modelling Data or Commerce?

The tone of certain Internet posts and related discussions has recently moved from genealogical data formats to genealogical data models. What does this mean, though, and where will it all end? What commercial factors might be at play here?

It is no coincidence that James Tanner has recently discussed this same topic on his blog at What happened to GEDCOM?, Are data-sharing standards possible?, and What is a genealogical Data Model? but I want to give a very different perspective here.

A data format (more correctly called a ‘serialisation format’) is a specific physical representation of your data in a file using a particular data syntax. A data model, on the other hand, is a description of the shape and structure of the data without using a specific syntax. In order to illustrate this, consider how a person is related to their biological parents. To state that each person is linked to just one father and to one mother might be part of a data model specification. However, the nature of that linkage, and what it might look like, is only relevant when discussing a specific data format.

The most widely accepted function of these data formats and data models is for the exchange of genealogical data between people, and hence between different software products, possibly running on different types of machine and in different locales. Sharing our data is a fundamental tenet of genealogy — if it wasn’t possible then it wouldn’t work.

STEMMA^® describes these data formats, and itself, as source formats in order to emphasise their fundamental nature. They are essentially an unambiguous textual representation from which any number of indexed forms can be generated. This, in turn, is analogous to the source code for a programming language from which a compiled form can be generated unambiguously for different machine architectures. Note that no data format or data model makes any specification about indexes or database schemas. They’re the prerogative of the designers of the software products that process such data.

OK, so much for the principle but what do we have at the moment? The representation that we’re all most familiar with is GEDCOM (GEnealogical Data COMmunication) which is a data format developed in the mid-1980s by The Church of Jesus Christ of Latter-day Saints (aka LDS Church or Mormon Church) as an aid to their research. This format gets regular criticism but it has been in constant use ever since it was first developed, despite not having been updated since about 1995. A later XML-based GEDCOM 6.0 was proposed but never released. GEDCOM is termed a de facto standard because it is not recognised by any standards body, and it came about by being the only player in town. In fact, it is still pretty much the only player in town. There are other data formats and data models but they’re usually self-serving — meaning that they were conceived in order to support a specific proprietary product — or they’re niche R&D projects. A number of projects have started and failed to define a more modern standard, including GenTech, BetterGEDCOM, and OpenGen.

Why is a proper data standard so important? Well, there are several reasons:

Unambiguous specification. Standards are written in a clear and precise manner, and without using marketing-speak or peacock terms.
International applicability. Standards have to address a global market rather than just the domestic market of the author.
Longevity. Data in a standard format can always be resurrected (or converted) because the specification is open — not locked away on some proprietary development machine.

So what is wrong with the GEDCOM that we currently have? Unfortunately, GEDCOM is quite limited in its scope, and is acknowledged to be focused on biological lineage rather than generic family history. Some of the better documented issues with GEDCOM include:

No support for multi-person events resulting in duplication and redundancy.
Loosely-defined support for source citations resulting in incompatible implementations.
No top-level support for Places.
No ordering for Events other than by date (which may not be known).
No support for interpersonal relations outside of traditional marriages.
Use of the ANSEL character standard which has recently been administratively withdrawn (14^th February 2013). Although Unicode support was added in v5.3, and UTF-8 proposed in a later v5.5.1 draft, this has still not been implemented by some vendors.
No support for narrative text, as in transcribed evidence or reasoning.

The specification document is not of an international standards quality which has resulted in some ambiguous interpretations. Perhaps more importantly, though, vendors have provided selective implementations. They have tended to implement only the parts relevant to their own product, and this obviously impacts the ability to import data from other products.

So, if a new standard this is so obviously essential for users to share their data then what are the obstacles preventing it? Why has this not happened years ago, or even a decade ago? Are there some commercial realities that might be at play here?

Another potential function of a data standard is for the long-term storage and preservation of our data. It would be great for users to be able to create a safe backup copy of all their data (with nothing lost) in a standard format, and even to have the option of bequeathing that to an archive when they can no longer continue with it. By contrast, bequeathing a proprietary database may be useless if the associated product becomes extinct. The first obstacle, though, is that no standard format currently exists with sufficient power and scope. The second problem is that vendors might get uneasy about it since it would then provide a slick and reliable method of moving from their product to a different product.

Software development will continue to evolve, though, and products must become more powerful and more usable. This results in a wider separation between the data that the products use internally and the data that can be exported to other products. In effect, GEDCOM has become a throttled exchange mechanism — almost an excuse that means vendors don’t have to share everything. Whether intentionally or not, all products are becoming islands because they cannot share accurately and fully with every other product.

Another potential factor is the cost of assimilating some newer and more powerful data model. Every product has its own internal data model that supports its operational capabilities. If a standard model is widely different to that in terms of concepts or scope then it could result in unjustifiable development costs for a small vendor. Although we’re talking about external data models — those used for exchange or preservation — there is bound to be some impact on the internal data models of existing products, especially during data import. Software development is an expensive process and a true open-source development might be more agile in this respect.

In June 2013, Ryan Heaton of Family Search generated a post on github that disputed the possibility of an all-encompassing data model, describing it as a myth: GEDCOM X Media Types. His arguments are a little vague because he tries to distinguish the GEDCOM X Conceptual Model as narrowly focused on data exchange, but this is precisely the area under discussion. Representation of data outside of any product, whether for exchange or preservation, is exactly what we’re talking about. In some respects genealogy needs to evolve and generally grow up. We continue to get mired in the distinction between genealogical data and family history data, and religiously refer to ‘genealogical data models’ to the exclusion of other forms of micro-history. One goal of the STEMMA Data Model was to cater for other types of micro-history data, including that for One-Name Studies, One-Place Studies, personal historians (as in APH), house histories, etc. It is therefore evident that I completely believe in the possibility of a single representational model.

You may be wondering why I didn’t mention FHISO (Family History Information Standards Organisation) above. FHISO are relatively new in the field of data standards, and grew out of the older BetterGEDCOM wiki project. Their remit includes all data standards connected with genealogy, and this even includes the possibility of “fixing” GEDCOM so that it at least works as it was intended to. When FHISO looks at a more powerful replacement, though — something that will still be a necessity — then how smoothly will that go? FHISO has significant industry support but when it comes to adoption of a new standard, what will be the incentive to vendors? The advantages to users are clear and obvious but will commercial realities leave them with less than they want?

A possibility that has occurred in several other industry sectors is where a large, generic software organisation comes out with a killer product — one designed to kill off all the smaller products currently fighting for market share. Sometimes this happens by acquisition-and-enhancement of an existing product and sometimes through completely new development. This is a very real possibility and I guarantee that our industry is being examined already, especially as genealogy and other forms of micro-history become ever more popular. If there’s money to be made then some large company will develop a product to satisfy as many users as possible (geographically and functionally), and employ their powerful marketing to sell it deeper and further than anything else. In some ways this might be good since we finally get a data standard that isn’t as constraining as GEDCOM, and without any of the silly arguing. On the other hand, we won’t have had any say in its functional requirements. As in those other industry cases, it will begin as a de facto standard and then be submitted for something like an ISO designation because the company knows the ropes, and has done that several times before.

Is that something we — users and vendors — really want to risk? It’s time to look at the bigger picture.

Monday, 19 August 2013

A Place for Everything

Why do people get in such a muddle with references to places? A common question on genealogical forums is what to record when the modern name of a place differs from its historical name. Similarly when the enclosing region, such as a county, has been changed since the recording of the document in-hand. Do you alter it to the modern designation, or what? Which is correct?

Well, the obvious answer is that both are correct within their respective timeframes. At the root of this difficulty is the prevailing notion that a place, as represented in your data, is simply the place-name[1]. Just for a moment, let us examine what would happen if you were recording a reference to a person rather than a place. In the details of a marriage, you would not simply write the name of the bride and the groom – you would link the marriage details to the respective person-entities in your data. Those person-entities would contain all the known names used by those people, and the events in their lives, and maybe even biographical notes, documents such as certificates for vital events, photographs, and parentage. If the source had a dubious version of their name, or an obvious spelling error, then it would not be a problem – you would record what was in the source, usually with some explanatory annotation, and still link it to the corresponding person-entity.

The concept of treating places in a parallel fashion to people may sound odd depending on what software you currently use, and whether you’ve ever conducted any place studies. However, there are strong analogies than can be leveraged to good effect, and the STEMMA^® R&D project has pushed this concept further than most. If places are represented as top-level entities in your data then you can attach documents to them (e.g. maps, land deeds), attach photographs, add events to form a timeline, define the coordinates of the associated location[2], and add historical narrative. The similarities in the handling of names, especially multiple names, misspellings, alternative spellings, foreign-language forms, and time dependencies, are striking, and a synopsis can be found at Person & Place Name Similarities.

A contrived example that demonstrates how people can be replaced by places when recording historical data may be found at Case Study - Places. There is also some analogy in the area of parentage although there are also fundamental differences here too. A person has just two biological parents and that fact is fixed forever. Each place has an enclosing region of geographical or administrative importance but this can change over time.

This nicely brings me on to the subject of place-hierarchies. We appreciate that a simple place-name may be ambiguous so we expect it to be qualified in some way. For instance, a town called Americus exists in both the US states Georgia and Indiana. The linking of each place to a parent place creates a place-hierarchy, and the printed form of that place-hierarchy is termed a place-hierarchy-path in STEMMA. For instance:

[Americus, Indiana, US]

[Americus, Georgia, US]

There are some important things to note here:

· The printed form isn’t the place-hierarchy itself. It is only one representation of it. The true place-hierarchy would be encoded in some way in your data.

· The printed direction (small-to-large or vice versa), the separating characters (commas in this example), and any enclosing delimiters are all culturally-dependent options.

· The individual items in each place-hierarchy are all places-entities themselves. STEMMA allows these to be anything from a single household up to a whole country.

The more astute people reading this — at least those who haven’t fallen asleep yet — will realise that place-entities exist in a place-hierarchy, whilst place-names exist in a place-hierarchy-path. This is an important differentiation to make for a hierarchy since we have already shown that a place-entity is not the same as a mere place-name.

On 12^th August 2013, James Tanner took issue with standardised place names in his blog: Wherein I once again take on the threat of standardized place names and rather unfairly blamed this on programmers. The post was quickly followed up with another one giving a specific software example: examples-of-attempts-at-name. His original post leads with an example of five different references to the same place:

Allen's Camp, Yavapai, Arizona Territory, United States
St. Joseph, Yavapai, Arizona Territory, United States
St. Joseph, Apache, Arizona Territory, United States
St. Joseph, Navajo, Arizona Territory, United States
St. Joseph, Navajo, Arizona, United States
Joseph City, Navajo, Arizona, United States

These are all examples of place-hierarchy-paths but alternative place-names are accepted, and even advocated above. The only standardisation that is necessary is that of the place-hierarchy itself, i.e. which place-entities appear in the hierarchy for each country, and in which order. For display purposes, control over the choice of alternative place-names might be a customisable option, but during input the software should consult all the known place-names for each place-entity. In effect, James’s issue wasn’t so much with standardised names as with a poor software implementation. In defence of programmers everywhere, that could even be blamed on product management.

So can everyone do this? It depends on the software you’re using whether you can create a true place-hierarchy as in the illustration below. This depicts the bottom part of a place-hierarchy and attempts to show that all the constituent place-entities could have multiple names and time-dependent parentage, in addition to citing local resources such as photographs and external resources such as historical details. If your software doesn’t support places as top-level entities then you can simulate things to some extent by consolidating the information for each place, say in a document or folder, and linking evidential place-names to your cache for the respective entity. This overcomes the leading question of this post, although it doesn’t help you simulate a full place-hierarchy.

Place-hierarchies could deliver so much but the concept is stymied by lack of understanding about real use cases, lack of analysis regarding their true capabilities, poor software implementations, lack of corresponding standards, and no collaboration. Otherwise, everything is great!

[1] I deliberately use a hyphenated form for this term, and several other terms in this post, in order to emphasise that it describes a single concept that is under discussion, and to avoid ambiguities.

[2] The terms place and location are often used interchangeably, and attempts to differentiate them have not gained ground in family history. My own distinction between them, and postal address, may be found at: Place Names. This satisfies a need to be precise about two different concepts.

Tuesday, 13 August 2013

Family Units

We all accept the concept of a family unit in our genealogical data, but what is it? As well as exploring the variants that exist, I want to make a case for this attempted colouring of our data being subjective and not directly substantiated by records.

The concept of a ‘family’ is impossible to pin down without some stricter subdivisions of the term. From the point of view of genealogy (as opposed to family history), it is often considered to be the parents and their unmarried children, and this has influenced the design of data formats (including GEDCOM) and the software that processes them.

However, as we all know from our own research, this is isn't always the case. One or both of the parents may be missing. The group may no longer be living together. Either or both of the parents may have remarried – bringing previous children with them. There may be older generations living with them, or siblings of the parents (i.e. aunts and uncles to the children). The guardians may be foster or adoptive parents. The biological parents may not be married, or may even have spouses elsewhere that they are trying to avoid in order to take advantage of a poor man's divorce[1].

Wikipedia nicely defines a family as a group of people affiliated by consanguinity (blood relationships), affinity, or co-residence. This includes many different possibilities in a single sentence. The article discusses some of our more common family notions, including:

Matrilocal. A mother and her children.
Conjugal (or Nuclear family). A husband, his wife, and children.
Consanguineal (or Extended family). In which parents and children co-reside with other members of a parent's family.
Blended (or Step-family). Families with mixed parents. For instance, where one or both parents remarried, bringing children of the former family into the new family.

Biological relationships are fixed and finite, meaning we each have just one progenitive mother and father[2], whereas all other types of relationship are time-dependent and possibly overlapping. The concept of Marriage is also dependent upon both culture and life-style so the general case of a family unit must be based on some sociological grouping such as living together.

Co-residence alone is insufficient to assume the family tag – long-term co-residents may be boarders, lodgers, or staff, or may be people forced together through necessity. You can’t even go through a census return and simply remove anyone with a role of visitor, boarder, lodger, servant, etc., because there are definitely cases where those people are part of the associated biological family. Conversely, although some members may have to live outside the household, say for work, they may still be considered family.

We can’t even assume that a family unit is a sociological group supporting each other emotionally and/or financially without knowledge of their situation. In effect, retrospectively applying this tag to an historical group may need more supporting evidence than can be yielded by birth certificates and census returns alone.

Different societies may also have different traditions or different concepts of a “family unit”, and the aforementioned Wikipedia article discusses some of these.

OK, so what’s the solution? I struggled with this in the initial design of STEMMA^® but eventually decided to take a step back. I didn’t dispute the need to put people into groups but it wasn’t my responsibility to define a one-size-fits-all concept either. Instead of trying to solve this specific problem, I generalised the concept of a group in order to accommodate anyone’s definition of a family unit, or any other type of person grouping. The STEMMA Group element can have a variety of types, including the flavours of family described above, and may even be used to model custom groupings of people. A Group allows Persons to be associated with it in a time-dependent way, e.g. from the time of a parent’s marriage, or until the time of a child’s marriage. An example may be found under Data Model.

The Group syntax also allows derived groups to be created using SET operators such as Union. Do you remember Venn diagrams from school? Well, they’re a useful way of thinking about groups of people and how the groups may be used to derive other groups. For instance:

[1] In England and Wales, an act of Parliament, Offences Against the Person Act 1861, contained a clause in section.57, Bigamy, which allowed for a presumption of death if separated for seven years or more.

"Provided that nothing in this section contained shall extend ... to any person marrying a second time, whose husband or wife shall have been continually absent from such person for the space of seven years then last past, and shall not have been known by such person to be living within that time".

Lack of knowledge was all that was required here, and there was no obligation to go and find them. This became informally known as “the seven year rule” or “a poor man’s divorce”.

[2] Actually, technology is capable of engineering children with DNA from three or more “parents” (see uk-government-ivf-dna-three-people).