Many people have written about the ills of online family trees,
including me, but that won't stop me writing about them again. Yes, they're
full of errors — more than some people want to admit — but is there an answer?
Is there a future for them?
I first want to summarise some of those problems, and then
suggest that the current situation is not sustainable. Much damage has already
been done to the collective knowledge of our family histories, but also to the reputation
of genealogy itself. If online trees are wrong then those errors will propagate
through simplified collaboration, they will tarnish the interpretation of DNA
data, and they will stymie any attempt to use AI technology
to suggest further connections.
One basic problem is that users are encouraged to create
their trees directly from raw digitised resources, including transcripts, held
by the online provider. Although it is possible to include data from external
sources, there is some debate about whether all providers acknowledge them when
grading their trees. So what's the problem here? Is there any other way to
create a tree?
Well, yes and no! A tree very rarely captures any of the
logic employed when its relationships were determined, or of the histories of
the individuals in the tree — usually prerequisite knowledge for solving the
hard identity and relationship problems. Yes, they may include citations or
electronic bookmarks to show where information came from, but that isn't the
same as explaining why the information is correct or relevant. Also, they are incapable
of supporting complex conclusions that have correlated information from
multiple sources.
This sets the stage for errors that are unchecked, and when
coupled with the ability to easily copy data from one tree to another then it
means that the errors will proliferate. Not only that, the source of an error,
whether known or not, will likely be permanent, even if the author abandoned
their tree.
I came upon such a case just the day before writing this
article. I don't want to pick on this specific person but it typified a very
real problem with online trees. I had contacted this person because their tree
appeared to offer solutions to a deep problem that I was collaborating on with
another person. On looking at the tree, I could see that the conclusions were
very wrong (e.g. conflating two given names to create a name found nowhere in
the evidence, simply to tie-off an apparent anomalous reference), but there
were no sources given. On approaching the owner they admitted that they had received
a gift of membership to this provider, during the period of which they copied
data from a cousin's tree, and that they were no longer a fully-paid member. So
even though they no longer dabbled with genealogy, their naive tree would
outlive their membership and be used by other beginners, ad infinitum. Ideally, the industry should be generating something
of historical value, but all we have is a cauldron of ills and woes.
Such cases are compounded by the fact that no provider I
know of offers a mechanism to flag a tree, or part thereof, as tentative,
under-construction, or otherwise work-in-progress. This would be useful for
experienced researchers as well as beginners, and could go some way to stopping
the proliferation of erroneous conclusions.
A more complete solution would be to make room for research
narrative — not just pieces of plain text stuffed into the records for specific
persons or families — and I have proposed this several times before, e.g. Feeding
the Trees. The problem here is that there isn't much inclination amongst providers
to do this, possibly because they don't want to be first, or rich-text
(formatted with images, tables, citations, etc.) is somehow harder to handle,
or simply that they don't see the need when everyone has a word processor.
All is not lost, though, because there are genealogists out
there who write-up their research and make it publicly accessible, usually via
a blog. You must be able to see where I'm taking this, now. If there was a way
to reference such research in an online tree then it would relieve the
more-casual user of the burden of the heavy lifting. In effect, rather than
reference raw digitised records directly, and with no logic explained, they
could reference work contributed by others — work that explained how
conclusions were reached as well as citing the relevant sources.
Well, I have proposed this before, too: Blogs
as Genealogical Sources suggested that published research (especially in
blogs) could be indexed as external sources by providers. Also, that the
indexing of an article URL would be done with the author's permission, and with
no bulk text being copied or indexed by the provider — actions that would
violate copyright and divert traffic away from that author's article. This
win-win scenario wouldn't require a large investment by the provider, and so it
was strange that only one showed any interest and asked me for more details.
Such work can already be found by researchers, but only via general-purpose
search engines such as Google. How much more convenient if genealogy subscribers
saw them as another source group and were allowed to perform genealogical searches
that included them.
I had heard rumours of a past RootsTech announcement by Findmypast
of an initiative called "verified
reference trees" but I could find no official mention of it. It
sounded good but would probably be impractical — who could produce them? How
many would you be likely to generate? How much of a given pedigree could one
contributor realistically cover?
What I'm saying here is that online trees need a rethink if
they are to have a future. Simply putting more and more digitised resources at
their disposal is not bad, but there are fundamental problems that need to
be addressed, ones that will ultimately come back to bite those providers who
are ignoring the future.
Will we still be working on the same trees in 20 years time,
and in the same way? Some pundits have suggested that our family histories will
have all been resolved by then through collaborative unified trees. ROTFLMAO! There
are already more inaccurate conclusions online than accurate ones, and it is
not easy to tell which are which — citations alone do not cut it! Some other
pundits have suggested DNA testing will solve everything. ROTF again! Even
limiting yourself to biological lineage, and ignoring much family history,
there are cases for which DNA can never help. For instance, read my conclusion
at Jesson
Lesson.
There are two very broad categories of user that providers
need to consider: those who are experienced and likely to do a lot of the research
leg-work, sharing their research details with others; and those who want to
play with online tools to create a simple tree of their own. Although the
latter category may be able to contribute valuable knowledge and lore from
recent generations, most would struggle to reconstruct older generations from
raw information alone for the simple fact that it's hard.
So whose work should they use? Well, trees don't need to be
officially "verified". Works of written research can be read by any user who can then judge for themselves on the depth of the research
and of the soundness of the conclusions. I'm not suggesting that these users
can't use raw information, only that published research will give them more
rungs on the ladder; bigger pieces of the jigsaw to assemble.
Another RootsTech announcement that I recall was that of
employing AI to find new connections in the data. This is worth a paragraph of
its own before leaving this subject because AI is akin to black magic for many people. Applying AI to raw information is a recipe
for disaster because, as any researcher knows, it can never be taken at face
value. AI would simply be correlating information from many record groups and
attempting naive identity resolution and family reconstruction. It would know
nothing of real-life situations and the reasons that records may have lied or that
families had moved, or of real human relationships and the reasons they may
have been forged or broken. The naivety of such algorithms would ultimately
manifest as the appearance of fictitious people, or of duplicates for the same
person (merging of inappropriate people would also be a risk but probably to a lesser
extent). There will always be the need for human researchers, and any
application of AI must have their additional inputs.
So, has anyone seriously tried to envisage what genealogists
will be doing in a couple of decades time? Will the same providers still be
offering online trees for users to build from raw information, and will
beginners be researching the same people as we are doing now? Will all of
today's trees still be online then, and will the error rate in those trees be
reduced? If so then how? Product Managers beware! Your industry's future is at
stake unless some visionaries step forward. Genealogy is big business at the
moment, but it cannot remain as it currently is. Companies may be throwing
their hat at DNA testing, but what happens when that fad passes its peak and it
is taken for granted? A company's valuation is dependent on a predicted future
for its market, so buyer beware also.
[see follow-up at Research in Online Trees]
No comments:
Post a Comment