Thursday, 26 March 2020

A Tree By Any Other Name Would Smell As Sweet


Given that the majority of genealogists are currently working on "their tree", it may be worth just taking a moment to understand what that means, and also what we think it means.

We may take it for granted that a family tree is a straightforward goal, and that its visualisation is equally straightforward. If so then you are going to be surprised.

Graph Theory

Mathematically, the concept of a tree is defined as part of graph theory, so let's just identify a few useful and accurate terms.

  • Vertices (or nodes) are the items being connected in the graph. Think of them as the persons in your family tree.
  • Edges (or links) are the connections between the vertices.
  • Path is a sequence of edges that joins a sequence of vertices.
  • Directed edge is one that has a specific direction. Graphs are usually directed or undirected according to the nature of all their edges.
  • Tree is an undirected graph in which any two vertices are connected by exactly one path. In other words, there is always a unique path to get from one vertex to any other.
  • Forest is an undirected graph in which any two vertices are connected by at most one path. In other words, the graph may have disjoint tree segments.
  • Layered graph drawing is a representation (not a graph type) in which the vertices of a directed graph are drawn in horizontal rows or layers to represent some common attribute (e.g. families or generations in a family tree).
  • Acyclic means that a graph has no directed cycles. In other words, there is no path that will loop you back to where you were.
  • Semi-directed cycle (or semi-cycle) is where the vertices of a loop are connected by directed edges but do not form a cycle. For instance, if three vertices, A,B,C, are connected by A→B, B→C, A→C  then it would constitute a semi-directed cycle (C→A would have completed a directed cycle).
  • DAG is a directed acyclic graph. Such graphs are frequently used for temporal ordering (i.e. events, including lineage ones) because of the unidirectional nature of time.
Note that mathematically, the use of the term 'tree' in is not simply a comment on a visualisation looking like the branches (or the roots) of a real tree.

Basic Family Trees

Although we expect a family tree to display in a top-down approach, where biological parents point to children, we cannot guarantee that the underlying data has a specific representation for the physical union between two people. Trying to equate that biological element with marriage is far too naive for real lineage. The data format known as GEDCOM is well-known to have a "family" concept that embraces two spouses — tellingly termed the husband and wife — and their associated children, but the fallacy is clear for all to see:  a generic family is extremely hard to define , and the implied "nuclear family" is an idealised concept. Worse still, it is using the social concept of a family when it's the biological concept of a union that is meaningful for a "lineage-linked format". The format is also known to have had interpretational difficulties with the notion of a family, and has tried various ways to include adopted children, thus straying from a pure lineage-based linkage. In fact, all we can guarantee is that each person has just one progenitive father and one progenitive mother,[1] even if they're unknown.

Probably the simplest family tree is one where we show direct ancestors, known as an ancestry chart or pedigree chart. Because each vertex has two connected vertices on the level above then it also constitutes a binary tree.


Figure 1 - Binary pedigree chart.

But note that this representation (generated here by the SVG Family-Tree Generator, but not uncommon) has a single upward edge connecting to a bound pair of parent vertices. This is useful because it provides a handle to select details of the parents' specific union, and it helps with the visualisation (particularly in cases of step- or half-siblings) as simply having two independent edges pointing to each person's parent vertices would rapidly become hard to follow.

The converse of this illustration, usually called a descendancy chart (and sometimes incorrectly referred to as a decent-type pedigree chart — pedigree is about blood-line ancestors) is where we show the children of a common ancestor and their spouse(s), and then the children of the children, etc.


Figure 2 - Simple descendancy chart.

Complications

The first thing to note is that Fig.1 and Fig.2 represent extreme cases. Suppose that we were interested in our direct ancestors, but also their siblings and the children of their siblings. For instance:


Figure 3 - Chart showing ancestors, their siblings, and their children.

This small illustration works, but in general it would not be possible to display such relationships without lines all crossing over each other. Whether you want to do this depends on whether "your tree" is primarily for people carrying your surname, starting from some root ancestor. This rather sexist approach is still quite common, despite the fact that surnames do not carry our genetics.

An important issue is pedigree collapse, where an edge crosses over to other branches to create a semi-directed cycle. The following illustration is of a first-cousin marriage.


Figure 4 - First-cousin marriage.

The fact that there is, now, no unique path to get from these married cousins to their grandparents means that the chart is technically not a tree, although it is still a DAG.

We've mentioned that non-biological parents would create problems if placed on a chart depicting lineage, but why is that? Well, such parents are not exclusive of biological parents, and it's not uncommon for someone to have had foster parents and adoptive parents in addition to their biological parents. They're still part of the family history, irrespective of any personal preference to the contrary, but they need specialised tools for their visualisation.

Sequential marriages are relatively common, but related to this are half-siblings, step-siblings, non-marital unions, and non-paternity events (NPEs). The following illustration depicts a man who was married twice, having a son with the first wife and a daughter with the second. At some point, he had also had a non-marital union with a woman resulting in an illegitimate daughter (note that the green circle is changed, here, to reflect this status). Also, the man's second wife was previously married and had an associated son, plus a daughter that was the result of an NPE (note the dashed line reflecting this).


Figure 5 - Half-siblings, step-siblings, sequential marriages, non-marital unions, and NPEs.

Finally, suppose we have cause to include people who are not related by blood or by marriage. I suppose this could include the families of adoptive parents or guardians, but a bigger example would be anyone performing a one-name or one-place study.


Figure 6 - Disjoint trees.

Note that this illustration, which shows a neighbouring family, is technically called a forest as it consists of disjoint trees.

Databases

So, the connections between people are manyfold in number and type, and the naive picture of everyone forming a single tree from some root ancestors (or possibly even Adam and Eve) is entirely unrealistic. Storing these connections in the data is not a problem, in principle, although there is no universal standard, and what we have is unlikely to have defined unambiguous ways of handling all the scenarios that we've highlighted. The real problem is in their visualisation!

What many people do not notice is that their genealogy software, be it desktop or online, usually presents just a workable section of the stored data at once. If you had 10,000+ people in your tree then it would look rather like a crocheted football field if presented all at once, but to present just a few generations around some person of interest — especially during maintenance of that tree — is much more useful, and easier. Such software allows you to navigate from one person of interest to another, and so will continue to support a naive impression of your "family tree".

Of course, we'll continue to use the term "family tree", and we'll continue to think of it as looking like a real tree, with branches and roots, but if you could assimilate the underlying data as a computer would then you would realise how different and complex it really is.


[1] Actually, technology is capable of engineering children with DNA from three or more “parents” (see uk-government-ivf-dna-three-people).