Tuesday, 8 July 2014

Happy Families

The issue of how to handle unusual or exotic families is a frequent one in genealogy, and especially when trying to enter data into a software product or into an online tree. I want to examine some of the common questions, and try to inject some clarity into the subject.

Last month, I was a participant in an exchange asking whether a particular software product[1] could represent a legally married lesbian couple. That exchange looked at some of the practical aspects before considering more unusual combinations of people.  One of the main sources of confusion — in this exchange and in other threads — is our innate tendency to conflate a number of different concepts, such as marriage and family. This tendency is a natural consequence of our own cultural upbringing and of our desire to document “family history” without questioning what a family actually is.

Once we understand that certain concepts are actually independent of each other then we’re less mired in our own cultural norms, and any unusual scenarios become clearer. In terms of digital representation, this amounts to a “Just right” as opposed to a “Too parochial” or a “Too complex”.

Although this software could accommodate a same-sex couple, it retained the concept of a father and a mother tag in the context of an associated family. Rather than this being an attempt to impose traditional structures, it was required in order to control how the couple were represented in charts. For instance, who is on the left and who is on the right, or possibly who is coloured blue and who is coloured pink.

This requirement raises an interesting question of whether such a chart would be showing biological lineage or a family unit. With biological lineage then the sex of the parents (male/female) is important, and could be used as the basis for colour-coding. With a family unit then gender roles are more important and such a colour-coding would be too simplistic — the difference between sex and gender often being confused[2].

This same software uses the terms “Partner 1” and “Partner 2” in reports, as opposed to charts, but this is unnecessary. I originally used this same approach in STEMMA® for the roles of marriage events until I tried to address cases with more than two partners; I now just have a single role of “Partner”.

Polygamy is a marriage that includes more than two participants. When a man has more than one wife then it is called polygyny, and when a woman has more than one husband then it is called polyandry. In both cases there is no marriage bond between the multiple wives or multiple husbands. If polygamy is illegal then such a relationship is termed bigamy. Polygamy has traditionally been associated with positions of wealth or power. Although historically not uncommon, the practice has been outlawed in many countries – some quite recently (Hong Kong in 1971). It was widespread in African countries and, although now in decline, it is still performed.

There are some technical categories of a family unit[3] but does a polygamous group even have the concept of family? Are all the adults considered parents of all the children, or are there sub-families?

We should all know that a marriage doesn’t automatically define a family unit, nor create a framework for one. The parents may be unmarried, or not the biological parents, but we still fall into this trap. It’s also a dynamic concept since people enter and leave a family group. If your software provides such a concept then how does it handle a scenario where the parents have died, or where the children were forcibly separated from the parents? I am personally very cautious about attaching this tag to any group of people since it needs more supporting evidence than any official record can provide. Proof of co-residency is not proof of family. Unless you have first-hand knowledge, or some written/oral testimony, then it’s a presumptuous label. You might make a case for it being obvious in the case of biological parents but this breaks down in the exotic cases, and if it’s so obvious in those simple cases then why do we need the label.

Going back to the original lesbian couple, if one of them was the biological mother of their child there might have been a sperm donor, assuming that the child wasn’t from a previous relationship, and so the biological connections are obviously different from the family connections. Even this attempt at objectivity can break down, though, and there are cases where the donor male forms part of an extended family, and might even be included on a combined birth certificate[4].

What about information on the biological parents? Surely that part is straightforward since we all have just two biological parents; one of each sex. Well, the future may change this as it will soon be possible to have three or more biological parents using a mechanism known as “mitochondrial transfer”. The technique is intended to prevent mitochondrial diseases including muscular dystrophy and some heart and liver conditions[5]. There would still be just one couple contributing to the child’s XY sex chromosomes but some other type of connection would be required for the secondary genetic contributions.

So what’s the best approach?

If representing biological lineage then a child can only born to one genetic father and mother, irrespective of whether they were both present at the conception, and ignoring the possibility of "gene injection" from a secondary genetic donor. The latter scenario can be handled by a diminutive form of the normal biological-parent links.

Separate from lineage is the concept of a family unit, whether it includes adoptive/foster parents, guardians, same-sex couples, or something more exotic. This can be modelled using a Group entity to connect specific people.

Separate again is the concept of a bonding ceremony (e.g. marriage), whether it include same or dissimilar sexes, or more than two partners. This can be modelled using an Event entity.

All three of these are independent concepts. If we try to merge or confuse them then it will ultimately run into cultural and life-style differences that we will find hard to represent.

For any software people, a STEMMA example may be found at Nature and Nuture.

[1] Gramps – Genealogical Research Software ( : accessed 7 Jul 2014).
[2] See “No Sex Please, We're Genealogists!”,, Parallax View, 10 May 2014 (
[3] See “Family Units”,, Parallax View, 13 Aug 2013 (
[4] Catherine Rolfsen, “Della Wolf is B.C.'s 1st child with 3 parents on birth certificate”, CBC News, 6 Feb 2014, online ( : accessed 7 Jul 2014).
[5] "Three-parent baby", Wikipedia ( : accessed 7 Jul 2014). Press coverage: Ian Sample, “Three-person IVF: UK government backs mitochondrial transfer”, The Guardian, 28 Jun 2013, online ( : accessed 7 Jul 2014). Matt Smith, “FDA considering 3-parent embryos”, CNN, 28 Feb 2014, online ( : accessed 7 Jul 2014).

Monday, 30 June 2014

Citations for Online Trees

I want to use this post to expand on some comments that I recently made on two of James Tanner’s blog-posts: The Issue of Source Citations and The Challenge of Genealogical Complexity. The gist of my comments was that there are distinct endeavours in genealogy and that the requirements of source references are not the same for each of them.

There may be a spectrum of endeavours, as James himself suggested, but let’s begin with the premise that there are people who create online family trees for the sole purpose of their own enjoyment, and of sharing with friends and family. I know this to be true through the class I used to give, although I can’t reliably comment on whether they constitute a majority or a minority. Distinct from these people are those who want to conduct rigorous research and generate sound, written conclusions.

On the face of it, this sounds to be a gross generalisation. I’m sure there are thorough researchers who also create online family trees, although I would argue that the medium of an online family tree is, by its very nature, a restricted format and so how could the all the details of that research be adequately accommodated there.

The issue of source citations is a differentiator that has been picked on repeatedly here. There are people in the latter group who bemoan the endeavours of some people in the former group because they have no source references; no citations of any kind. This may be because their Web site offers limited functionality, or because those people are simply copy-and-pasting from other trees, or because their trees exist purely for “cousin bait”[1], or because they don’t really appreciate what citations are and what they can offer.

If you’re from a literary or academic background then you will be aware of citations, and of their essentiality, but there are many people who are unfamiliar with them. Indeed, some may immediately think of traffic citations, and consider them to be some sort of punishment for not doing their family tree correctly.

But a citation is a citation and we should all be including them, right? Well, not necessarily. The goals may be different for those different endeavours, and there are certainly more ways of representing them than with our traditional printed forms.

In the early days of the Graphical User Interface (GUI) — the computer display that we all now take for granted — one new software product, which I wasn’t involved with, wanted to help executives to get more involved with their PCs. You see, in those days, only the typing pool used keyboards. On the basis that an executive could at least master the use of a mouse, they painted a graphical display on the screen from which he/she could select characters with their mouse. Yes, you guessed it, … they painted a picture of a QWERTY keyboard and it did nothing to help those executives; they were just as bamboozled as before. They might have been better painting a simple A-Z and 0-9 sequential list, but a keyboard is a keyboard, no matter what the requirements are.

Let’s look at the goals normally associated with citing our sources:

  • Intellectual honesty (not claiming prior work as your own).
  • Allowing your sources to be independently assessed by the reader.
  • Allowing the strength of your information sources to be assessed.

These are fine if you’re writing an article for a journal, or a book, or a research report, but are they all relevant to a hobbyist who just wants to share with their family? Well, if that was the extent of their sharing then I would say that none of them are particularly important to that hobbyist. Much more important is simply being able to recall where they got an item of evidence from. We’ve all been there — even the most experienced of us — when something doesn’t quite fit, and we realise that we have an error. If it wasn’t for our citations then we’d be in a desperate state trying to recall how we got into that position. This is something that many beginners learn through unhappy experience, and it reinforces the notion that citing our sources is important to all us.

The essential difference for this type of hobbyist is that they would need something that linked back directly to the relevant online source record. They would not be interested in the nuances of a properly punctuated reference-note citation, or the addition of discursive notes. They would just want some note that could be clicked on, and followed, in order to see the relevant record. The format of this electronic citation may not follow the conventions of our reference-note, source-list, and source-label citations, although it would be entirely feasible to generate one from it if necessary.

So, maybe the best way to educate beginners about the need for citations would be to make it click-easy to add an electronic citation when using each online record to compile information into their tree. Ancestry’s member trees are examples that already use a similar electronic mechanism. Although they can add references to sources hosted by Ancestry, or narrative and images provided by the user, they are less able to reference online records hosted by other providers. This is a problem because we use so many online sources, and yet some of them are desperately bad at providing associated citations, whether electronic or printed. This wouldn’t require much of a standard since a basic URL could do most of it, although the ContextObject employed by the OpenURL would make it more manageable.

What happens, though, if the aforementioned sharing goes beyond mere family and friends? Many user-owned online trees have public visibility, and unified online trees must have public visibility. Are simple electronic citations then enough? In the latter case, they would clearly be inadequate for a user’s peers to ascertain the accuracy of their claims. In principle, that sort of collaborative environment isn’t too far from the published works of research listed earlier; the ones where traditional citations are expected. The subtlety of this point may disguise its significance: extrapolating the mechanism of a user-owned tree to create a mechanism for a unified tree cannot work. Those electronic citations would be good at connecting to an online source, and being able to visit it. A traditional reference-note citation would be able to describe many more source types, including multiple sources in the same note, and be able to add discursive notes on the availability, reliability, or objectivity of the information. Although such a reference-note citation could be compiled into a digital representation, it’s the positioning of the source reference that causes the problems.

Citing each source when making a reasoned stepwise argument that leads to a specific conclusion is something that requires narrative. It is not the same as plainly citing a death certificate as evidence of a date of death, or a census page as evidence of a birth date. Online trees do not accommodate anything like a research report, but they could. Back in What to Share, and How - Part II, I considered an approach to collaboration where the unit of sharing included narrative, citations, lineage, timelines, and geography.

[1] I fall into this category myself. My fully sourced and documented data is held in locally, and my online tree exists purely to attract distant relatives in order to share with them. This implies that an online tree should not be judged by its number of source references but this is an increasingly difficult stance to take.

Wednesday, 11 June 2014

Bootstrapping a Data Standard

There haven’t been many discussions about data standards for a while now. What is the practicality of their development? Are there conflicting requirements? What would be a good way to proceed?

In 2012, FHISO (Family History Information Standards Organisation) were created to develop the first industry standard for the representation of genealogical data. This was to be a collaborative project, incorporating interested parties from across the wider community, and result in a freely available, open standard with international applicability. Although they elected their first Board in 2013, they have recently gone “radio silent” until they have sorted out some serious logistical problems with key Board members.[1] What lies ahead for them, though? In The Commercial Realities of Data Standards, I discussed some of the conflicting requirements of a data standard, and also differentiated the terms data model and file format.

To many people, the issue is simply one of developing an agreed file format, and so there shouldn’t be much difficulty. In reality, it is the data model — the abstract representation of the data — that requires the effort to define. Once defined, any number of file formats can be used to represent it physically.

So why don’t we run with GEDCOM? It was never developed as a standard but it has achieved the status of a de facto standard. Well, we all know that it has serious faults and limitations, and some example analyses can be found on the BetterGEDCOM wiki at Shortcomings of GEDCOM and on Louis Kessler’s blog at Nine Necessities in a GEDCOM Replacement[2]. It also has a weak specification resulting in ambiguous implementation, and a proprietary syntax that would be at odds with the modern world of the Semantic Web. What it has in its favour is widespread acceptance, although this is weakened slightly by selective implementation amongst the various products.

There is some mileage in this possibility but I will come to it later. Almost without exception, the criticisms of GEDCOM are “detail changes”, such as changing the representation of a place, or a person, or a date, or a source citation. By far the biggest issues, though, are in terms of “structural changes”. You see, GEDCOM is a lineage-linked format designed specifically to represent biological lineage. No matter how you interpret the term genealogy[3], this obviously limits the scope of that representation. Any representation of historical data must support multi-person events as a core entity type[4], but GEDCOM is sadly lumbered with mere single-person events. This is an absolute necessity for the representation of family history.

However, with the addition of some other features, such as:

  • Places as top-level entities, and a possible focus for history.
  • Structured narrative (i.e. incorporating mark-up).
  • Group entities (see military example).

then the representation can avoid what I’ve previously called the “lineage trap” and become applicable to those other types of micro-history that are currently struggling for any type of comprehensive representation. This includes One-Name Studies, One-Place Studies. Personal historians, house histories, place histories (as opposed to the history of a place in terms of its people), and organisational histories. That lineage trap occurs when we artificially confine the historical representation to the events of a family. In reality, genealogical research needs that freedom, and yet there is little cost involved in addressing the greater generality. It is mainly a matter of adopting the right perspective when defining the data model.

Although this sounds excellent, and it could unify the disparate parts of the historical-research community, there are a couple of obstacles to the approach. The first is the effect on existing products — both desktop and online — and whether their creators would want to increase the scope to this extent. Imagine, for instance, a very simple tree-building product. Let’s suppose that is merely holds the details of people, including the dates of their vital events (birth, marriage, and death), and links to their offspring. Exporting to a representation with an historical focus should not be a problem since the necessary parts for depicting lineage would be a subset within it. However, what should it do if it tried to import a contribution generated by a more powerful product; one that included the elements mentioned above? Indeed, the lineage part might be empty if it represented, say, the history of a place. Unless that data representation was accepted by a very significant part of the genealogical software world then the extra effort involved might be dismissed as unjustifiable.

Another obstacle is that GEDCOM, in conjunction with tree orientated products, has effectively compromised the structure of existing data.[5] Evidence will have been associated directly with people, possibly with duplication being necessary, rather than with the relevant events. This compromise may be beyond refactoring, meaning that the data would have to be export as-is rather than restructured to take advantage of a better representation.

So what about a two-tier standard? Many standards have different levels of scope to which conformity can be targeted, and I have recently criticised the ISO 8601 date standard for not doing this. One basic idea would be to revamp GEDCOM in order to fix the known issues, and to invest some structural changes in order to make it conceptually closer to a true historical standard. If an historical standard were defined now then GEDCOM would be incompatible with it since it does not have an event-based structure (in addition to its lineage). Creating a “GEDCOM+” would effectively build a bridge between the existing, very old, GEDCOM and that comprehensive standard for historical data since the new data models would be aligned. It would only be the scope of the data models that would differ.

Being able to factor-out enough of the data model of an historical standard, and then apply it to the lineage-linked GEDCOM in order to give it proper representations for its vital events, would be an interesting challenge. Event-linked variations of GEDCOM have been proposed before, but the important goal here would be ensuring a smooth path from the GEDCOM+ to the full historical standard, and ensuring that the latter is a superset of the former.

This idea was actually kicked around within FHISO since there are some interesting advantages. Although FHISO had very significant support within the industry, it still needed to establish its credibility. Working on a GEDCOM+ could have provided that opportunity, and allowed it to deliver a proper, documented standard in a realistic timeframe while the bigger standard could be developed to a longer timescale. Providing a newer GEDCOM would have also made more friends in the industry, and in the general community, since the incompatibilities between products is still a major gripe of genealogists and family historians at all levels.

Things never run that smoothly, though. GEDCOM is a proprietary format and the name is owned by the The Church of Jesus Christ of Latter-day Saints (LDS Church). While several developers have created variations of it, having their development work condoned, and being given permission to use the same name, are very unlikely scenarios. FamilySearch, the genealogical arm of the Church, were one of the most visible absences from the FHISO membership. Although the developers of GEDCOM-X were supportive, and understood that what they were doing and what FHISO were doing was fundamentally different, the Church as a whole is very complex and that same level of understanding was not pervasive. Had FamilySearch have been a member then that work could have been done with their involvement. There might still have been a political issue with it appearing that FHISO was working for FamilySearch but that would have been surmountable. It would have been possible to proceed anyway, and simply use a name other than GEDCOM, but then that might have introduced further fragmentation into the industry, and we certainly don’t need that.

Development of any successful new standard has to consider the upgrade path; the path by which existing software can embrace the new standard, and by which existing data can be migrated. If you believe, as I do, that the future needs more than a representation of lineage, and that there’s no fundamental reason to exclude other forms of micro-history, then that automatically introduces a gulf between that new representation and GEDCOM. Boostrapping a new standard by initially working on an enhanced GEDCOM model is the best way to proceed. The data syntax of that representation is irrelevant — it could use the old proprietary syntax if there was benefit to existing vendors — but its data model would help to bridge the old and new worlds.

[1] The issues are health-related, and I have been assured that FHISO has not gone away — as some have supposed — and that it will come back as a more organised and productive organisation.
[2] While Louis’s suggestions make mostly good sense, I am not citing the list as one that I entirely agree with. In particular, suggestion 7 (“No Extensions”) needs a note. I agree that schema extensions should be avoided, but certain types of extension are essential for generality and would not limit conformity. See both Extensibility and Digital Freedom.
[3] See “What is Genealogy?”,, Parallax View, 1 May 2014 (
[4] See “Eventful Genealogy”,, Parallax View, 3 Nov 2013 (; Also, the follow-up "Eventful Genealogy - Part II", 6 Nov 2013 (
[5] See “Evidence and Where to Stick It”,, Parallax View, 24 Nov 2013 (