Saturday, 1 December 2018

The Future of Online Trees

Many people have written about the ills of online family trees, including me, but that won't stop me writing about them again. Yes, they're full of errors — more than some people want to admit — but is there an answer? Is there a future for them?

I first want to summarise some of those problems, and then suggest that the current situation is not sustainable. Much damage has already been done to the collective knowledge of our family histories, but also to the reputation of genealogy itself. If online trees are wrong then those errors will propagate through simplified collaboration, they will tarnish the interpretation of DNA data, and they will stymie any attempt to use AI technology to suggest further connections.

One basic problem is that users are encouraged to create their trees directly from raw digitised resources, including transcripts, held by the online provider. Although it is possible to include data from external sources, there is some debate about whether all providers acknowledge them when grading their trees. So what's the problem here? Is there any other way to create a tree?

Well, yes and no! A tree very rarely captures any of the logic employed when its relationships were determined, or of the histories of the individuals in the tree — usually prerequisite knowledge for solving the hard identity and relationship problems. Yes, they may include citations or electronic bookmarks to show where information came from, but that isn't the same as explaining why the information is correct or relevant. Also, they are incapable of supporting complex conclusions that have correlated information from multiple sources.

This sets the stage for errors that are unchecked, but when coupled with the ability to easily copy data from one tree to another then it means that the errors will proliferate. Not only that, the source of an error, whether known or not, will likely be permanent, even if the author abandoned their tree.

I came upon such a case just the day before writing this article. I don't want to pick on this specific person but it typified a very real problem with online trees. I had contacted this person because their tree appeared to offer solutions to a deep problem that I was collaborating on with another person. On looking at the tree, I could see that the conclusions were very wrong (e.g. conflating two given names to create a name found nowhere in the evidence, simply to tie-off an apparent anomalous reference), but there were no sources given. On approaching the owner they admitted that they had received a gift of membership to this provider, during the period of which they copied data from a cousin's tree, and that they were no longer a fully-paid member. So even though they no longer dabbled with genealogy, their naive tree would outlive their membership and be used by other beginners, ad infinitum. Ideally, the industry should be generating something of historical value, but all we have is a cauldron of ills and woes.

Such cases are compounded by the fact that no provider I know of offers a mechanism to flag a tree, or part thereof, as tentative, under-construction, or otherwise work-in-progress. This would be useful for experienced researchers as well as beginners, and could go some way to stopping the proliferation of erroneous conclusions.

A more complete solution would be to make room for research narrative — not just pieces of plain text stuffed into the records for specific persons or families — and I have proposed this several times before, e.g. Feeding the Trees. The problem here is that there isn't much inclination amongst providers to do this, possibly because they don't want to be first, or rich-text (formatted with images, tables, citations, etc.) is somehow harder to handle, or simply that they don't see the need when everyone has a word processor.

All is not lost, though, because there are genealogists out there who write-up their research and make it publicly accessible, usually via a blog. You must be able to see where I'm taking this, now. If there was a way to reference such research in an online tree then it would relieve the more-casual user of the burden of the heavy lifting. In effect, rather than reference raw digitised records directly, and with no logic explained, they could reference work contributed by others — work that explained how conclusions were reached as well as citing the relevant sources.

Well, I have proposed this before, too: Blogs as Genealogical Sources suggested that published research (especially in blogs) could be indexed as external sources by providers. Also, that the indexing of an article URL would be done with the author's permission, and with no bulk text being copied or indexed by the provider — actions that would violate copyright and divert traffic away from that author's article. This win-win scenario wouldn't require a large investment by the provider, and so it was strange that only one showed any interest and asked me for more details. Such work can already be found by researchers, but only via general-purpose search engines such as Google. How much more convenient if genealogy subscribers saw them as another source group and be allowed to perform genealogical searches that included them.

I had heard rumours of a past RootsTech announcement by Findmypast of an initiative called "verified reference trees" but I could find no official mention of it. It sounded good but would probably be impractical — who could produce them? How many would you be likely to generate? How much of a given pedigree could one contributor realistically cover?

What I'm saying here is that online trees need a rethink if they are to have a future. Simply putting more and more digitised resources at their disposal is not bad, but there are more fundamental problems that need to be addressed, ones that will ultimately come back to bite those providers who are ignoring the future.

Will we still be working on the same trees in 20 years time, and in the same way? Some pundits have suggested that our family histories will have all been resolved by then through collaborative unified trees. ROTFLMAO! There are already more inaccurate conclusions online than accurate ones, and it is not easy to tell which are which — citations alone do not cut it! Some other pundits have suggested DNA testing will solve everything. ROTF again! Even limiting yourself to biological lineage, and ignoring much family history, there are cases for which DNA can never help. For instance, read my conclusion at Jesson Lesson.

There are two very broad categories of user that providers need to consider: those who are experienced and likely to do a lot of the research leg work, sharing their research details with others; and those who want to play with online tools to create a simple tree of their own. Although the latter category may be able to contribute valuable knowledge and lore from recent generations, most would struggle to reconstruct older generations from raw information alone for the simple fact that it's hard.

So whose work should they use? Well, trees don't need to be officially "verified". Works of written research can be read by users — any users — who can then judge for themselves on the depth of the research and of the soundness of its conclusions. I'm not suggesting that these users can't use raw information, only that published research will give them more rungs on the ladder; bigger pieces of the jigsaw to assemble.

Another RootsTech announcement that I recall was that of employing AI to find new connections in the data. This is worth a paragraph of its own before leaving this subject because AI is akin to black magic for many people. Applying AI to raw information is a recipe for disaster because, as any researcher knows, it can never be taken at face value. AI would simply be correlating information from many record groups and attempting naive identity resolution and family reconstruction. It would know nothing real-life situations and the reasons that records may have lied or that families had moved, or of real human relationships and the reasons they may have been forged or broken. The naivety of such algorithms would ultimately manifest as the appearance of fictitious people, or of duplicates for the same person (merging of inappropriate people would also be a risk but probably to a lesser extent). There will always be the need for human researchers, and any application of AI must have their additional inputs.

So, has anyone seriously tried to envisage what genealogists will be doing in a couple of decades time? Will the same providers still be offering online trees for users to build from raw information, and will beginners be researching the same people as we are doing now? Will all of today's trees still be online then, and will the error rate in those trees be reduced? If so then how? Product Managers beware! Your industry's future is at stake unless some visionaries step forward. Genealogy is big business at the moment, but it cannot remain as it currently is. Companies may be throwing their hat at DNA testing, but what happens when that fad passes its peak and it is taken for granted? A company's valuation is dependent on a predicted future for its market, so buyer beware also.

[see follow-up at Research in Online Trees]