Friday, 22 May 2020

MetaProxy (v3.0)


MetaProxy was introduced at the start of 2019 as a free tool allowing meta-data such as archival descriptions, search terms, provenance, and even transcriptions, to be associated with images and other data files in your genealogical data. This article describes the new features in V3.0 of the Windows edition; these do not apply to the Mac edition.

Although the program has a small following, it is not yet well-known, and is not even considered a "genealogical tool" in some quarters. However, following some recent work to fix reported issues with Windows 'Photo Viewer' and the 'Photos' Store App under Windows 10, it was decided to give those users more control over their layout.

To re-iterate some previous details: the program and a PDF guide can be downloaded from:


The Mac version has its own Dropbox folder, but the associated Facebook support group deals with both editions: https://www.facebook.com/groups/541621946332678/.

So What is New?

A particularly useful feature of MetaProxy turned out to be its collection feature, where double-clicking on a root buddy file would automatically open up a series of image data files and their individual buddy files. The new INI-file setting of Collections=False can be used to turn this off if required (the default is True), but this also allows the use of a different file type for such root buddy files. We'll here talk of *.coll for root buddy files and *.meta for normal buddy files, but the actual file extension may be chosen by the user.

Because of the similarity between the display of a collection and the use of a traditional photo album, a tiled mode has been implemented. This is controlled by two new INI-file parameters: TileH and TileV, which specify, respectively, the number of horizontal and vertical tile positions over the screen area. If both are zero (the default) then tiled mode is disabled, otherwise each will default to 1 if unspecified. This mode employs overlay mode for individual buddy files, and so it overrides any separate SideBySide setting.

If the root buddy file of the collection example in the original article (RisalpurCemetery.meta) is renamed to RisalpurCemetery.coll, then the INI-file might specify a grid of 3x2 for the display as follows:

[metaproxy]
CreateType=.meta

[.coll]
TileV=2
TileH=3

This would then result in a layout similar to the following, where each individual image data file is overlaid with its specific buddy file, and the original root buddy file (if it has no data file of its own) is tiled separately:


But the tiled mode is not just for collections. If a normal buddy file has multiple data files associated with it then they can be tiled in a similar way. For instance, given a buddy file called Test_ID-34.meta2 that's associated with two separate images (a *.jpg and a *.jpeg file) and a Word document (*.doc in this case), then an INI-file setting of:

[metaproxy]
CreateType=.meta

[.meta2]
TileH=3

would result in the following layout:




This shows the Word document and the two images spread across the width of the screen, and the buddy file overlaid on the last of the images. Where people have larger screens than the one used in this example then this becomes a convenient way to see all of the related details.

NB: if you're using the normal overlay mode (SideBySide=False setting) then specifying TileH=1 or TileV=1 will force the image viewer to occupy the full screen area rather than its default size and position.

Microsoft Mechanisms

While developing this tool, it became clear that Microsoft has a variety of ways for launching the viewer for data files (e.g. image viewers), and no central mechanism for finding their main windows. For instance:

1.    Normal process creation for the image viewer (or document viewer). The handle of its top-level window is then determined. Most cases fall into this category, including Microsoft Office Picture Manager, Microsoft Paint, Microsoft Word, and of course the Notepad text editor.

2.    When launching the viewer, the data file is simply handed over to an existing instance of the program, which then creates a new tab for it. Adobe Acrobat and Web browsers are examples of this.

3.    When the viewer is actually a DLL rather than an EXE, it has to be loaded into a special 'container process' called dllhost. Windows 'Photo Viewer' is an example of this. 

4.    When the viewer is one of the cut-down 'Store Apps' available under Windows 10 then it follows a different set of rules, and the normal Windows APIs have limited accessibility to them. 'Photos' (aka 'Microsoft.Photos.exe') is an example of this.

Note that if the data file is shown in the tab of a single-instance viewer (case 2) then the tiled mode mentioned above will not work as intended since the viewer cannot occupy more than one tile location.

Diagnostics

In the event of problems being reported within the Facebook support group, a diagnostic log file can now be generated via the program copy called metaproxy-D.exe (also available from the same Dropbox link).

These log files should be emailed to the author in order to assist in a resolution. There's a 'Contact Form' in the right-hand panel of this blog-post.

Thursday, 30 April 2020

Weight of Evidence


A very short educational piece, this time, on the subject of evidence.

Evidence is very important to genealogists. It is information that supports, or contradicts (as in 'evidence to the contrary'), some specific claim. When we make claims, such as those about parentage, then we need supporting evidence, and we also need to explain any evidence that appears to go against our claims.
But is that everything we need to know? Well, no; not all evidence carries the same weight. In order to illustrate this particular point, I want to introduce you to the 'raven paradox', sometimes known as 'Hempel's paradox' since it was formulated by philosopher Carl Gustav Hempel in the 1940s to illustrate a contradiction between inductive logic and intuition.

Hempel starts this paradox with the proposition 'all ravens are black'. This can be turned about-face to yield the equivalent proposition 'if something is not black then it is not a raven'.  The first of these is quite straightforward, and the sight of a black raven would be evidence supporting that proposition. However, that about-face proposition is less straightforward because the sight of anything other than a raven, and that isn't black, would be evidence for it. Hence, if you were eating a green apple then it's not black and it's not a raven, and so it supports the second proposition. The paradox is that something totally unrelated to ravens, or even to birds of any kind, appears to be evidence for that first proposition: 'all ravens are black'.

So what gives? Surely, the fact that you're eating a green apple, or wearing a red hat, or any number of unrelated observations, cannot really be evidence about the colour of ravens. Philosophers have debated this paradox ever since because that's what they like to do, but the answer is relatively simple. Yes, those observations really are evidence but their weight is so weak that they're effectively insignificant.

In order to understand what's going off, we need to consider the scope of the propositions. In this particular case, where the proposition is about discrete entities (ravens) and properties that are fixed (colour), then you can imagine sets of possibilities, but a more general scheme would involve abstract mathematical spaces of possibilities. Anyway, the spaces of possibilities for black-ravens, non-black-ravens, black-other, and non-black-other are vastly different in extent. Having an observation that supports non-black-other (an astronomically huge space) is insignificant compared to one that directly supports black-ravens, even though the propositions all-ravens-are-black and if-non-black-then-not-a-raven are logically equivalent. In contrast, if we observed just one instance of non-black-ravens (the space for which we've asserted to have zero extent) then it would be hugely significant.

The lesson, here, is that the same claim can be expressed in different, but logically equivalent ways, and this has a huge bearing on the significance of an item of information supporting the claim. The weight, or significance, of some evidence depends on the scope of the claim, and some cases — such as demonstrating beyond reasonable doubt that 'if something is not black then it is not a raven' — would be impractical to pursue. Putting things another way, the concept of 'sufficient evidence' depends on the scope, or the number of possibilities, covered by the claim.

Thursday, 26 March 2020

A Tree By Any Other Name Would Smell As Sweet


Given that the majority of genealogists are currently working on "their tree", it may be worth just taking a moment to understand what that means, and also what we think it means.

We may take it for granted that a family tree is a straightforward goal, and that its visualisation is equally straightforward. If so then you are going to be surprised.

Graph Theory

Mathematically, the concept of a tree is defined as part of graph theory, so let's just identify a few useful and accurate terms.

  • Vertices (or nodes) are the items being connected in the graph. Think of them as the persons in your family tree.
  • Edges (or links) are the connections between the vertices.
  • Path is a sequence of edges that joins a sequence of vertices.
  • Directed edge is one that has a specific direction. Graphs are usually directed or undirected according to the nature of all their edges.
  • Tree is an undirected graph in which any two vertices are connected by exactly one path. In other words, there is always a unique path to get from one vertex to any other.
  • Forest is an undirected graph in which any two vertices are connected by at most one path. In other words, the graph may have disjoint tree segments.
  • Layered graph drawing is a representation (not a graph type) in which the vertices of a directed graph are drawn in horizontal rows or layers to represent some common attribute (e.g. families or generations in a family tree).
  • Acyclic means that a graph has no directed cycles. In other words, there is no path that will loop you back to where you were.
  • Semi-directed cycle (or semi-cycle) is where the vertices of a loop are connected by directed edges but do not form a cycle. For instance, if three vertices, A,B,C, are connected by A→B, B→C, A→C  then it would constitute a semi-directed cycle (C→A would have completed a directed cycle).
  • DAG is a directed acyclic graph. Such graphs are frequently used for temporal ordering (i.e. events, including lineage ones) because of the unidirectional nature of time.
Note that mathematically, the use of the term 'tree' in is not simply a comment on a visualisation looking like the branches (or the roots) of a real tree.

Basic Family Trees

Although we expect a family tree to display in a top-down approach, where biological parents point to children, we cannot guarantee that the underlying data has a specific representation for the physical union between two people. Trying to equate that biological element with marriage is far too naive for real lineage. The data format known as GEDCOM is well-known to have a "family" concept that embraces two spouses — tellingly termed the husband and wife — and their associated children, but the fallacy is clear for all to see:  a generic family is extremely hard to define , and the implied "nuclear family" is an idealised concept. Worse still, it is using the social concept of a family when it's the biological concept of a union that is meaningful for a "lineage-linked format". The format is also known to have had interpretational difficulties with the notion of a family, and has tried various ways to include adopted children, thus straying from a pure lineage-based linkage. In fact, all we can guarantee is that each person has just one progenitive father and one progenitive mother,[1] even if they're unknown.

Probably the simplest family tree is one where we show direct ancestors, known as an ancestry chart or pedigree chart. Because each vertex has two connected vertices on the level above then it also constitutes a binary tree.


Figure 1 - Binary pedigree chart.

But note that this representation (generated here by the SVG Family-Tree Generator, but not uncommon) has a single upward edge connecting to a bound pair of parent vertices. This is useful because it provides a handle to select details of the parents' specific union, and it helps with the visualisation (particularly in cases of step- or half-siblings) as simply having two independent edges pointing to each person's parent vertices would rapidly become hard to follow.

The converse of this illustration, usually called a descendancy chart (and sometimes incorrectly referred to as a decent-type pedigree chart — pedigree is about blood-line ancestors) is where we show the children of a common ancestor and their spouse(s), and then the children of the children, etc.


Figure 2 - Simple descendancy chart.

Complications

The first thing to note is that Fig.1 and Fig.2 represent extreme cases. Suppose that we were interested in our direct ancestors, but also their siblings and the children of their siblings. For instance:


Figure 3 - Chart showing ancestors, their siblings, and their children.

This small illustration works, but in general it would not be possible to display such relationships without lines all crossing over each other. Whether you want to do this depends on whether "your tree" is primarily for people carrying your surname, starting from some root ancestor. This rather sexist approach is still quite common, despite the fact that surnames do not carry our genetics.

An important issue is pedigree collapse, where an edge crosses over to other branches to create a semi-directed cycle. The following illustration is of a first-cousin marriage.


Figure 4 - First-cousin marriage.

The fact that there is, now, no unique path to get from these married cousins to their grandparents means that the chart is technically not a tree, although it is still a DAG.

We've mentioned that non-biological parents would create problems if placed on a chart depicting lineage, but why is that? Well, such parents are not exclusive of biological parents, and it's not uncommon for someone to have had foster parents and adoptive parents in addition to their biological parents. They're still part of the family history, irrespective of any personal preference to the contrary, but they need specialised tools for their visualisation.

Sequential marriages are relatively common, but related to this are half-siblings, step-siblings, non-marital unions, and non-paternity events (NPEs). The following illustration depicts a man who was married twice, having a son with the first wife and a daughter with the second. At some point, he had also had a non-marital union with a woman resulting in an illegitimate daughter (note that the green circle is changed, here, to reflect this status). Also, the man's second wife was previously married and had an associated son, plus a daughter that was the result of an NPE (note the dashed line reflecting this).


Figure 5 - Half-siblings, step-siblings, sequential marriages, non-marital unions, and NPEs.

Finally, suppose we have cause to include people who are not related by blood or by marriage. I suppose this could include the families of adoptive parents or guardians, but a bigger example would be anyone performing a one-name or one-place study.


Figure 6 - Disjoint trees.

Note that this illustration, which shows a neighbouring family, is technically called a forest as it consists of disjoint trees.

Databases

So, the connections between people are manyfold in number and type, and the naive picture of everyone forming a single tree from some root ancestors (or possibly even Adam and Eve) is entirely unrealistic. Storing these connections in the data is not a problem, in principle, although there is no universal standard, and what we have is unlikely to have defined unambiguous ways of handling all the scenarios that we've highlighted. The real problem is in their visualisation!

What many people do not notice is that their genealogy software, be it desktop or online, usually presents just a workable section of the stored data at once. If you had 10,000+ people in your tree then it would look rather like a crocheted football field if presented all at once, but to present just a few generations around some person of interest — especially during maintenance of that tree — is much more useful, and easier. Such software allows you to navigate from one person of interest to another, and so will continue to support a naive impression of your "family tree".

Of course, we'll continue to use the term "family tree", and we'll continue to think of it as looking like a real tree, with branches and roots, but if you could assimilate the underlying data as a computer would then you would realise how different and complex it really is.


[1] Actually, technology is capable of engineering children with DNA from three or more “parents” (see uk-government-ivf-dna-three-people).

Wednesday, 18 December 2019

Another Tree Can Be a Valid Source


I’m just taking a short break from my work to write about “valid sources”. I was prompted to do this after reading an article on the Family History Daily website entitled “Another Person’s Family Tree is Not a Valid Source”, posted approximately March 2018. The article is anonymous but Melanie Mayo-Laakso is the website’s founder and editor.

The thrust of the article is straightforward, and is not challenged here: that information from someone else’s tree is very likely to be inaccurate, and that you should at least verify the information in more reliable records before adding it to your own tree. This is particularly important since providers of online family trees make it oh-so-easy to copy information into your own tree, whether accurate and relevant, or not. Quoting from that article,

The issue arises from the fact that many people don’t view the information contained in a family tree any differently than they do the data found in a record source. When they are presented with individuals from a tree that appear to match their needs they see the data as existing research and very often copy the information without a thought.

The challenge presented here is to do with the nature of a ‘source’, and that online family trees have distorted this in the minds of their users. Furthermore, to explain that family trees are “valid sources”, and that the difference is primarily in their degree of reliability.

First, let’s dispel some related myths:

  • A "source" is simply a source of information that you have used in some research, and not specifically information that you've followed blindly, or even that you agree with.
  • Genealogy is not just about discrete bits of information: the so-called “facts”.
  • No source is guaranteed to be factual, and all sources must be assessed with a critical eye some more than others.
  • Many answers will never be found directly in a single source.

Why are the associated myths relevant? Well, these points suggest that there is more, in real research, than collecting discrete “facts”. Sometimes, you need to make a case that involves looking at multiple sources, and ones that may contain conflicting information. Writing up this type of inferential genealogy is what makes the difference between information (just something a source says) and evidence (something that substantiates, or refutes, a claim you have made). NB: This is not just something that professional or academic genealogists do, but people in other fields of research as well, although their terminology may differ.

Now the problem with online trees is that they circumvent this sequence, and subscribers are led to believe that “sources” yield discrete reliable "facts", and anything that doesn't yield such cannot be a "source". These trees can easily make a connection between such a discrete “fact” and some database entry, but that says nothing more than where the information came from. Very few trees — in fact, I have never seen one — include any type of narrative explaining why a cited database entry (or image) is in any way relevant, let alone analysing multiple sources to derive a considered conclusion when there are no direct answers.

Sources may be original or derivative, where a derivative may be close (e.g. a facsimile or a scan) or distant (e.g. transcribed or translated), and so at best an online tree must be considered a derivative form that compiles information from other sources. They are no less a source than any of the other derivative sources already offered by your genealogy provider, even though their accuracy may well be poorer. But no source is guaranteed to be accurate, whether it’s a database, an online image, or even a stamped birth certificate directly from the relevant government office.

Note that a source may also be an ‘authored work’, which is a form that looks at information from several other sources, and rather than simply compiling it, it analyses the information to derive specific conclusions. The nature of these works means that they have to consider all types of source, whether original or derivative, whether reliable or sloppy, whether agreeing or conflicting, whether primary or secondary information, whether official or private information, and even including authored works by other writers.  To date, none of the genealogy providers have got their head around this concept, and how it works in the rest of the research world (cf. “Research in Online Trees”), but the principles stand.

So, to summarise, in writing up your research, you can utilise whatever sources of information that are relevant to your argument, as long as you evaluate them with the appropriate critical eye.