Thursday, 17 December 2020

STEMMA Latest

If you are involved in the software representation of family history then you might justifiably ask what happened to STEMMA.

STEMMA is a private project to look at the digital representation of micro-history, including but not limited to family history. Work on it commenced in 2011, and the first specification appeared online in 2012. Primary goals were a distancing from the ubiquitous "build your family tree" notion of genealogy, avoidance of the use of trees as "wardrobes of hangers" upon which all and sundry can be placed, and a stemming the software trend of digesting everything to name-value pairs (usually described as "facts").

Its significance in relation to recent work by both FHISO and FamilySearch is low because it still treads a quite different path. It is not interested only in people, or in biological lineage, and its wider choice of historical subjects (including places, groups, and animals) has allowed software to leverage their similarities and orthogonality, such as hierarchical relationships and the handling of multiple names.

STEMMA does not use a database, which means that it is its own import/export format, and so is better-suited to long-term data storage because it does not require a separate backup format. More than this, though, a STEMMA file can be considered a "non-sequential document". Opening a STEMMA attachment will begin with a prescribed landing page, but the choice of where to navigate to from that page is a decision to be made by the reader. There is no sequential or hierarchical ordering of any pages, but there are many internal semantic links that can be followed based on some conscious rationale, and these might uncover lineage relationships (including trees), images, places, maps, or even narrative — all of which are integral parts of the data.

It is shocking that so few products accommodate narrative text, either for documenting the research process, authored works, story telling, memories, documenting biographical details, or for the transcription of old documents; and I do not know whether to laugh or cry when I hear software people still talking about programmatically generating text to lace together their recorded "facts". Software does not "understand" your research process, and should merely help you with its organisation and analysis. Human beings do not understand computer-speak and should not be presented with whatever it is that software people think is so convenient for their endeavours. Text is here to stay, and it should be an essential component of your data!

The latest public version of STEMMA is still V4.1, although a number of small changes in specification and direction have occurred internally. Little work has taken place on the informational sub-model that was to support a dynamic research process (see Our Days of Future Passed — Part I, and its follow-up parts II and III), but a spin-off of the associated experimentation was the SVG Family-Tree Generator (SVG-FTG) that is now a separate product, and soon to get a major upgrade. Work has focused, instead, on the conclusional sub-model, and one of the changes involves place hierarchies.

STEMMA recognises that places are not the same as point locations that can be given specific coordinates. Even if such a point is relaxed to be a closed polygon (say for describing the boundary of a town or village) or an open polygon (say for mapping a street, noting that European streets are rarely straight), then a place still has an identity that is independent of its location or its name(s). Boundaries and names may change over time, but a place is still a place. Originally, it was hoped that each place would have a unique parent place that was appropriate to the nature of the hierarchy. For instance, that an administrative place would have an administrative parent, or an ecclesiastical place would have an ecclesiastical parent, but the reality is too messy. The difference between geographical and administrative relationships can be vague, especially when including local administration in additional to national administration. As a result, each place is now deemed to have a single (but time-dependent) canonical parent, but the hierarchy is no longer of a specific type (i.e. it may vary depending upon the level in the hierarchy). The scheme still makes use of related entity linkages to connect items across hierarchies, such as registration districts (for civil registration of births) to ecclesiastical parishes (for baptisms), but the whole field remains challenging.

The only other project that I am aware of that is treading a similar path is the History Research Environment (HRE) that began in 2016. This has an incorporated not-for-profit UK company, History Research Environment Ltd, although work seems to be centred more in Australia. It comprises an open-source project to "create a free platform-independent application for the serious amateur or professional historical researcher", and was still under development at the time of writing. The main similarity with STEMMA is in the focus towards general history, and their tagline is "Towards a history of almost anything". It will be very interesting to see if they can muster commercial success in a field that might be considered specialist, and which high-profile genealogical advertising blatantly ignores.

The STEMMA project now has to share my efforts with the SVG-FTG (and other tools such as MetaProxy) as well as the publishing of my first book — not genealogy, but more on that another time.

The website for STEMMA has recently been moved from Google Sites to neocities.org. This was necessary because the old Google Sites (which was always a little clunky) is being replaced with a new one, and users have been given an ultimatum to convert, but the conversion tools are wholly inadequate for porting my old site across, despite the content being primarily textual. Google have also made a policy decision to provide no HTML editor and no way of importing the HTML of any pre-existing site. Good luck with that! It sounds like it could be another project ready to flounder.

So, the new link for the website is https://parallaxview.co/stemma. The URI http://stemma.parallaxview.co is, as before, reserved for STEMMA namespaces. The legacy URL of http://www.familyhistorydata.parallaxview.co should redirect if still used (once my DNS configurations does what I tell it).

Friday, 22 May 2020

MetaProxy (v3.0)

MetaProxy was introduced at the start of 2019 as a free Windows tool allowing meta-data such as archival descriptions, search terms, provenance, and even transcriptions, to be associated with images and other data files in your genealogical data. This article describes the new features in V3.0 of the Windows edition; these do not apply to the Mac edition.

Although the program has a small following, it is not yet well-known, and is not even considered a "genealogical tool" in some quarters. However, following some recent work to fix reported issues with Windows 'Photo Viewer' and the 'Photos' Store App under Windows 10, it was decided to give those users more control over their layout.

Information on the availability may be found on the associated summary page: MetaProxy Summary. The kit also includes a PDF user guide.

So What is New?

A particularly useful feature of MetaProxy turned out to be its collection feature, where double-clicking on a root buddy file would automatically open up a series of image data files and their individual buddy files. The new INI-file setting of Collections=False can be used to turn this off if required (the default is True), but this also allows the use of a different file type for such root buddy files. We'll here talk of *.coll for root buddy files and *.meta for normal buddy files, but the actual file extension may be chosen by the user.

Because of the similarity between the display of a collection and the use of a traditional photo album, a tiled mode has been implemented. This is controlled by two new INI-file parameters: TileH and TileV, which specify, respectively, the number of horizontal and vertical tile positions over the screen area. If both are zero (the default) then tiled mode is disabled, otherwise each will default to 1 if unspecified. This mode employs overlay mode for individual buddy files, and so it overrides any separate SideBySide setting.

If the root buddy file of the collection example in the original article (RisalpurCemetery.meta) is renamed to RisalpurCemetery.coll, then the INI-file might specify a grid of 3x2 for the display as follows:

[metaproxy]

CreateType=.meta

[.coll]

TileV=2

TileH=3

This would then result in a layout similar to the following, where each individual image data file is overlaid with its specific buddy file, and the original root buddy file (if it has no data file of its own) is tiled separately:

But the tiled mode is not just for collections. If a normal buddy file has multiple data files associated with it then they can be tiled in a similar way. For instance, given a buddy file called Test_ID-34.meta2 that's associated with two separate images (a *.jpg and a *.jpeg file) and a Word document (*.doc in this case), then an INI-file setting of:

[metaproxy]

CreateType=.meta

[.meta2]

TileH=3

would result in the following layout:

This shows the Word document and the two images spread across the width of the screen, and the buddy file overlaid on the last of the images. Where people have larger screens than the one used in this example then this becomes a convenient way to see all of the related details.

NB: if you're using the normal overlay mode (SideBySide=False setting) then specifying TileH=1 or TileV=1 will force the image viewer to occupy the full screen area rather than its default size and position.

Microsoft Mechanisms

While developing this tool, it became clear that Microsoft has a variety of ways for launching the viewer for data files (e.g. image viewers), and no central mechanism for finding their main windows. For instance:

1.    Normal process creation for the image viewer (or document viewer). The handle of its top-level window is then determined. Most cases fall into this category, including Microsoft Office Picture Manager, Microsoft Paint, Microsoft Word, and of course the Notepad text editor.

2.    When launching the viewer, the data file is simply handed over to an existing instance of the program, which then creates a new tab for it. Adobe Acrobat and Web browsers are examples of this.

3.    When the viewer is actually a DLL rather than an EXE, it has to be loaded into a special 'container process' called dllhost. Windows 'Photo Viewer' is an example of this.

4.    When the viewer is one of the cut-down 'Store Apps' available under Windows 10 then it follows a different set of rules, and the normal Windows APIs have limited accessibility to them. 'Photos' (aka 'Microsoft.Photos.exe') is an example of this.

Note that if the data file is shown in the tab of a single-instance viewer (case 2) then the tiled mode mentioned above will not work as intended since the viewer cannot occupy more than one tile location.

Diagnostics

In the event of problems being reported within the Facebook support group, a diagnostic log file can now be generated via the program copy called metaproxy-D.exe (also available from the same Dropbox link).

These log files should be emailed to the author in order to assist in a resolution. There's a 'Contact Form' in the right-hand panel of this blog-post.

Thursday, 30 April 2020

Weight of Evidence

A very short educational piece, this time, on the subject of evidence.

Evidence is very important to genealogists. It is information that supports, or contradicts (as in 'evidence to the contrary'), some specific claim. When we make claims, such as those about parentage, then we need supporting evidence, and we also need to explain any evidence that appears to go against our claims.

But is that everything we need to know? Well, no; not all evidence carries the same weight. In order to illustrate this particular point, I want to introduce you to the 'raven paradox', sometimes known as 'Hempel's paradox' since it was formulated by philosopher Carl Gustav Hempel in the 1940s to illustrate a contradiction between inductive logic and intuition.

Hempel starts this paradox with the proposition 'all ravens are black'. This can be turned about-face to yield the equivalent proposition 'if something is not black then it is not a raven'. The first of these is quite straightforward, and the sight of a black raven would be evidence supporting that proposition. However, that about-face proposition is less straightforward because the sight of anything other than a raven, and that isn't black, would be evidence for it. Hence, if you were eating a green apple then it's not black and it's not a raven, and so it supports the second proposition. The paradox is that something totally unrelated to ravens, or even to birds of any kind, appears to be evidence for that first proposition: 'all ravens are black'.

So what gives? Surely, the fact that you're eating a green apple, or wearing a red hat, or any number of unrelated observations, cannot really be evidence about the colour of ravens. Philosophers have debated this paradox ever since because that's what they like to do, but the answer is relatively simple. Yes, those observations really are evidence but their weight is so weak that they're effectively insignificant.

In order to understand what's going off, we need to consider the scope of the propositions. In this particular case, where the proposition is about discrete entities (ravens) and properties that are fixed (colour), then you can imagine sets of possibilities, but a more general scheme would involve abstract mathematical spaces of possibilities. Anyway, the spaces of possibilities for black-ravens, non-black-ravens, black-other, and non-black-other are vastly different in extent. Having an observation that supports non-black-other (an astronomically huge space) is insignificant compared to one that directly supports black-ravens, even though the propositions all-ravens-are-black and if-non-black-then-not-a-raven are logically equivalent. In contrast, if we observed just one instance of non-black-ravens (the space for which we've asserted to have zero extent) then it would be hugely significant.

The lesson, here, is that the same claim can be expressed in different, but logically equivalent ways, and this has a huge bearing on the significance of an item of information supporting the claim. The weight, or significance, of some evidence depends on the scope of the claim, and some cases — such as demonstrating beyond reasonable doubt that 'if something is not black then it is not a raven' — would be impractical to pursue. Putting things another way, the concept of 'sufficient evidence' depends on the scope, or the number of possibilities, covered by the claim.

Thursday, 26 March 2020

A Tree By Any Other Name Would Smell As Sweet

Given that the majority of genealogists are currently working on "their tree", it may be worth just taking a moment to understand what that means, and also what we think it means.

We may take it for granted that a family tree is a straightforward goal, and that its visualisation is equally straightforward. If so then you are going to be surprised.

Graph Theory

Mathematically, the concept of a tree is defined as part of graph theory, so let's just identify a few useful and accurate terms.

Vertices (or nodes) are the items being connected in the graph. Think of them as the persons in your family tree.

Edges (or links) are the connections between the vertices.

Path is a sequence of edges that joins a sequence of vertices.

Directed edge is one that has a specific direction. Graphs are usually directed or undirected according to the nature of all their edges.

Tree is an undirected graph in which any two vertices are connected by exactly one path. In other words, there is always a unique path to get from one vertex to any other.

Forest is an undirected graph in which any two vertices are connected by at most one path. In other words, the graph may have disjoint tree segments.

Layered graph drawing is a representation (not a graph type) in which the vertices of a directed graph are drawn in horizontal rows or layers to represent some common attribute (e.g. families or generations in a family tree).

Acyclic means that a graph has no directed cycles. In other words, there is no path that will loop you back to where you were.

Semi-directed cycle (or semi-cycle) is where the vertices of a loop are connected by directed edges but do not form a cycle. For instance, if three vertices, A,B,C, are connected by A→B, B→C, A→C then it would constitute a semi-directed cycle (C→A would have completed a directed cycle).

DAG is a directed acyclic graph. Such graphs are frequently used for temporal ordering (i.e. events, including lineage ones) because of the unidirectional nature of time.

Note that mathematically, the use of the term 'tree' in is not simply a comment on a visualisation looking like the branches (or the roots) of a real tree.

Basic Family Trees

Although we expect a family tree to display in a top-down approach, where biological parents point to children, we cannot guarantee that the underlying data has a specific representation for the physical union between two people. Trying to equate that biological element with marriage is far too naive for real lineage. The data format known as GEDCOM is well-known to have a "family" concept that embraces two spouses — tellingly termed the husband and wife — and their associated children, but the fallacy is clear for all to see: a generic family is extremely hard to define , and the implied "nuclear family" is an idealised concept. Worse still, it is using the social concept of a family when it's the biological concept of a union that is meaningful for a "lineage-linked format". The format is also known to have had interpretational difficulties with the notion of a family, and has tried various ways to include adopted children, thus straying from a pure lineage-based linkage. In fact, all we can guarantee is that each person has just one progenitive father and one progenitive mother,[1] even if they're unknown.

Probably the simplest family tree is one where we show direct ancestors, known as an ancestry chart or pedigree chart. Because each vertex has two connected vertices on the level above then it also constitutes a binary tree.

Figure 1 - Binary pedigree chart.

But note that this representation (generated here by the SVG Family-Tree Generator, but not uncommon) has a single upward edge connecting to a bound pair of parent vertices. This is useful because it provides a handle to select details of the parents' specific union, and it helps with the visualisation (particularly in cases of step- or half-siblings) as simply having two independent edges pointing to each person's parent vertices would rapidly become hard to follow.

The converse of this illustration, usually called a descendancy chart (and sometimes incorrectly referred to as a decent-type pedigree chart — pedigree is about blood-line ancestors) is where we show the children of a common ancestor and their spouse(s), and then the children of the children, etc.

Figure 2 - Simple descendancy chart.

Complications

The first thing to note is that Fig.1 and Fig.2 represent extreme cases. Suppose that we were interested in our direct ancestors, but also their siblings and the children of their siblings. For instance:

Figure 3 - Chart showing ancestors, their siblings, and their children.

This small illustration works, but in general it would not be possible to display such relationships without lines all crossing over each other. Whether you want to do this depends on whether "your tree" is primarily for people carrying your surname, starting from some root ancestor. This rather sexist approach is still quite common, despite the fact that surnames do not carry our genetics.

An important issue is pedigree collapse, where an edge crosses over to other branches to create a semi-directed cycle. The following illustration is of a first-cousin marriage.

Figure 4 - First-cousin marriage.

The fact that there is, now, no unique path to get from these married cousins to their grandparents means that the chart is technically not a tree, although it is still a DAG.

We've mentioned that non-biological parents would create problems if placed on a chart depicting lineage, but why is that? Well, such parents are not exclusive of biological parents, and it's not uncommon for someone to have had foster parents and adoptive parents in addition to their biological parents. They're still part of the family history, irrespective of any personal preference to the contrary, but they need specialised tools for their visualisation.

Sequential marriages are relatively common, but related to this are half-siblings, step-siblings, non-marital unions, and non-paternity events (NPEs). The following illustration depicts a man who was married twice, having a son with the first wife and a daughter with the second. At some point, he had also had a non-marital union with a woman resulting in an illegitimate daughter (note that the green circle is changed, here, to reflect this status). Also, the man's second wife was previously married and had an associated son, plus a daughter that was the result of an NPE (note the dashed line reflecting this).

Figure 5 - Half-siblings, step-siblings, sequential marriages, non-marital unions, and NPEs.

Finally, suppose we have cause to include people who are not related by blood or by marriage. I suppose this could include the families of adoptive parents or guardians, but a bigger example would be anyone performing a one-name or one-place study.

Figure 6 - Disjoint trees.

Note that this illustration, which shows a neighbouring family, is technically called a forest as it consists of disjoint trees.

Databases

So, the connections between people are manyfold in number and type, and the naive picture of everyone forming a single tree from some root ancestors (or possibly even Adam and Eve) is entirely unrealistic. Storing these connections in the data is not a problem, in principle, although there is no universal standard, and what we have is unlikely to have defined unambiguous ways of handling all the scenarios that we've highlighted. The real problem is in their visualisation!

What many people do not notice is that their genealogy software, be it desktop or online, usually presents just a workable section of the stored data at once. If you had 10,000+ people in your tree then it would look rather like a crocheted football field if presented all at once, but to present just a few generations around some person of interest — especially during maintenance of that tree — is much more useful, and easier. Such software allows you to navigate from one person of interest to another, and so will continue to support a naive impression of your "family tree".

Of course, we'll continue to use the term "family tree", and we'll continue to think of it as looking like a real tree, with branches and roots, but if you could assimilate the underlying data as a computer would then you would realise how different and complex it really is.

[1] Actually, technology is capable of engineering children with DNA from three or more “parents” (see uk-government-ivf-dna-three-people).