Thursday, 6 December 2018

Research in Online Trees

My previous post, The Future of Online Trees, prompted a flurry of reaction, most of which was positive; however, it did suggest, implicitly, that many people find it hard to think beyond their entrenched views, and that my explanation may have assumed too much.

This follow-up collects together some of the explanatory comments that I'd since posted around the Web, and tries to make a coherent  argument for what genealogical research should entail.

The previous article made several negative comments about existing online trees, including:

  • That assembling their associated conclusions directly from raw digitalised information is not always easy, and gets very hard as you go further back in time (e.g. before census returns and civil registration).
  • That it's hard to tell naive trees from properly researched ones, and that, no, a bunch of citations are not a useful indicator.
  • That naive trees usually persist long after a creator may have  abandoned them, and could steer new researchers down the wrong path.
  • That proof arguments (i.e. reasoned explanation), as opposed to simple proof statements (i.e. citations), are almost never provided.
The primary issue behind my article was that many genealogical conclusions require their research work to be written up in order for them to be assessed, and for that research to be cited either by trees or other research work. Proof statements alone are only applicable when the sources offer direct answers and do not conflict with each other, but identity problems and family reconstruction can require lengthy arguments that examine multiple sources. The results will often address groups of correlated people rather than just some specific person, and so it's not realistic to expect that work to be tucked away in a single person entity (on a tree) or in a single person page.

An associated issue is that the contributors to online trees — and probably the users of genealogy software in general — routinely talk about individual "claims", and the supporting sources for those claims, as though they're all independent of each other. This is fallacy! The idea that a specific claim can be justified in isolation, and linked directly to one or more sources that give a direct answer, is a huge oversimplification of the research process, and yet this is a mindset that is hard to argue with.

One of the positive things I suggested in my previous article (possible the only one) was that there are researchers who do publish their work online (e.g. in blogs), and that online trees could referenced their research as "authored works": a recognised source category that supplements those of "original" and "derivative". There is no issue at all with representing this in GEDCOM files — the data format most often used to transfer data between two places — nor any significant issue with online providers recognising such work as a specific source category.

Many traditional genealogists write-up their work in academic journals, but this is more about kudos than about helping a  community of genealogists: few of us will be subscribers to such journals. This is a shame because they cannot distance themselves from online genealogy, nor ignore the associated problems, because we're all tarnished with the same brush. If we describe our work as "genealogy" then it will be linked automatically to the prevailing impression of its most common form: online trees.

It may be hard to see what I'm getting at if you haven't participated in research in other fields. All the fields that I am aware of, such as in science and medicine, rely on published works. This could be in journals or online, but by far their biggest difference from what is currently considered genealogical research is that newer works cite older works. The consensus is then built up through layers of research, each of which may support or refute previous work. There's a saying about standing on the shoulders of giants, and it makes perfect sense: someone could have spent a lifetime solving one particular deep mystery, and so to expect someone else (beginner or otherwise) to find the same answer directly from raw online information is unrealistic. I cannot think of any other area of research that works as genealogy currently does, and where conclusions are either copied blindly from those of someone else or constructed independently from raw information. This is a little like a surgeon creating an independent textbook themselves by simply dissecting the evidence — a cadaver in this case. Knowledge and progress come from sharing research, and by building on the research of others. It's step-wise, progressive, and takes time. And without seeing any written research then you cannot tell whether someone made their conclusions in 30 minutes or 30 years.

So what size of work are we talking about? Is it just a single paragraph? Well, it could be, or it could be a couple of thousand words, as with several blog articles that I've encountered. I have two unpublished works of 5000 words, myself, that I want to contribute to the community, and for posterity, but also a work-in-progress that is already at 10,000 words — such is the complexity.

A note on the use of wikis as a medium for collaborative research is necessary because they were mentioned by a few people. It is true that wikis can be, and are, used for such research purposes, but they have significant weaknesses. They are often limited in the richness of their presentation — usually amounting to more of a protracted discussion, as the old BetterGEDCOM wiki demonstrated — but genealogical research requires support for rich formatting, images, tables, and citations. Not all blogs offer this, but there are usually ways of achieving it (see Summarised Blogger Tips for instance). Wikis have little, if any, editorial control, and no attribution support beyond their confines. Also, that they constitute a confined medium — forcing people to contribute outside any personal medium or prior work — would put too many people off. By contrast, blogs are not confined, they may be linked or associated with other work by the author, and their articles have immediate attribution. Wikipedia was also mentioned as an example of successful collaboration, but it has strict rules that prevent original work or theories being presented. It relies on secondary sources, and so implicitly collects information that is already in the mainstream. This certainly doesn't prevent edit wars but it does place it apart from collaborative wiki-based genealogy.

So, my suggestion is to separate attributable research work from tree-based conclusions, and to cite such work for the harder cases rather than just some raw information. This suggestion is not rocket science so why aren't we doing it?

Saturday, 1 December 2018

The Future of Online Trees

Many people have written about the ills of online family trees, including me, but that won't stop me writing about them again. Yes, they're full of errors — more than some people want to admit — but is there an answer? Is there a future for them?

I first want to summarise some of those problems, and then suggest that the current situation is not sustainable. Much damage has already been done to the collective knowledge of our family histories, but also to the reputation of genealogy itself. If online trees are wrong then those errors will propagate through simplified collaboration, they will tarnish the interpretation of DNA data, and they will stymie any attempt to use AI technology to suggest further connections.

One basic problem is that users are encouraged to create their trees directly from raw digitised resources, including transcripts, held by the online provider. Although it is possible to include data from external sources, there is some debate about whether all providers acknowledge them when grading their trees. So what's the problem here? Is there any other way to create a tree?

Well, yes and no! A tree very rarely captures any of the logic employed when its relationships were determined, or of the histories of the individuals in the tree — usually prerequisite knowledge for solving the hard identity and relationship problems. Yes, they may include citations or electronic bookmarks to show where information came from, but that isn't the same as explaining why the information is correct or relevant. Also, they are incapable of supporting complex conclusions that have correlated information from multiple sources.

This sets the stage for errors that are unchecked, but when coupled with the ability to easily copy data from one tree to another then it means that the errors will proliferate. Not only that, the source of an error, whether known or not, will likely be permanent, even if the author abandoned their tree.

I came upon such a case just the day before writing this article. I don't want to pick on this specific person but it typified a very real problem with online trees. I had contacted this person because their tree appeared to offer solutions to a deep problem that I was collaborating on with another person. On looking at the tree, I could see that the conclusions were very wrong (e.g. conflating two given names to create a name found nowhere in the evidence, simply to tie-off an apparent anomalous reference), but there were no sources given. On approaching the owner they admitted that they had received a gift of membership to this provider, during the period of which they copied data from a cousin's tree, and that they were no longer a fully-paid member. So even though they no longer dabbled with genealogy, their naive tree would outlive their membership and be used by other beginners, ad infinitum. Ideally, the industry should be generating something of historical value, but all we have is a cauldron of ills and woes.

Such cases are compounded by the fact that no provider I know of offers a mechanism to flag a tree, or part thereof, as tentative, under-construction, or otherwise work-in-progress. This would be useful for experienced researchers as well as beginners, and could go some way to stopping the proliferation of erroneous conclusions.

A more complete solution would be to make room for research narrative — not just pieces of plain text stuffed into the records for specific persons or families — and I have proposed this several times before, e.g. Feeding the Trees. The problem here is that there isn't much inclination amongst providers to do this, possibly because they don't want to be first, or rich-text (formatted with images, tables, citations, etc.) is somehow harder to handle, or simply that they don't see the need when everyone has a word processor.

All is not lost, though, because there are genealogists out there who write-up their research and make it publicly accessible, usually via a blog. You must be able to see where I'm taking this, now. If there was a way to reference such research in an online tree then it would relieve the more-casual user of the burden of the heavy lifting. In effect, rather than reference raw digitised records directly, and with no logic explained, they could reference work contributed by others — work that explained how conclusions were reached as well as citing the relevant sources.

Well, I have proposed this before, too: Blogs as Genealogical Sources suggested that published research (especially in blogs) could be indexed as external sources by providers. Also, that the indexing of an article URL would be done with the author's permission, and with no bulk text being copied or indexed by the provider — actions that would violate copyright and divert traffic away from that author's article. This win-win scenario wouldn't require a large investment by the provider, and so it was strange that only one showed any interest and asked me for more details. Such work can already be found by researchers, but only via general-purpose search engines such as Google. How much more convenient if genealogy subscribers saw them as another source group and be allowed to perform genealogical searches that included them.

I had heard rumours of a past RootsTech announcement by Findmypast of an initiative called "verified reference trees" but I could find no official mention of it. It sounded good but would probably be impractical — who could produce them? How many would you be likely to generate? How much of a given pedigree could one contributor realistically cover?

What I'm saying here is that online trees need a rethink if they are to have a future. Simply putting more and more digitised resources at their disposal is not bad, but there are more fundamental problems that need to be addressed, ones that will ultimately come back to bite those providers who are ignoring the future.

Will we still be working on the same trees in 20 years time, and in the same way? Some pundits have suggested that our family histories will have all been resolved by then through collaborative unified trees. ROTFLMAO! There are already more inaccurate conclusions online than accurate ones, and it is not easy to tell which are which — citations alone do not cut it! Some other pundits have suggested DNA testing will solve everything. ROTF again! Even limiting yourself to biological lineage, and ignoring much family history, there are cases for which DNA can never help. For instance, read my conclusion at Jesson Lesson.

There are two very broad categories of user that providers need to consider: those who are experienced and likely to do a lot of the research leg work, sharing their research details with others; and those who want to play with online tools to create a simple tree of their own. Although the latter category may be able to contribute valuable knowledge and lore from recent generations, most would struggle to reconstruct older generations from raw information alone for the simple fact that it's hard.

So whose work should they use? Well, trees don't need to be officially "verified". Works of written research can be read by users — any users — who can then judge for themselves on the depth of the research and of the soundness of its conclusions. I'm not suggesting that these users can't use raw information, only that published research will give them more rungs on the ladder; bigger pieces of the jigsaw to assemble.

Another RootsTech announcement that I recall was that of employing AI to find new connections in the data. This is worth a paragraph of its own before leaving this subject because AI is akin to black magic for many people. Applying AI to raw information is a recipe for disaster because, as any researcher knows, it can never be taken at face value. AI would simply be correlating information from many record groups and attempting naive identity resolution and family reconstruction. It would know nothing real-life situations and the reasons that records may have lied or that families had moved, or of real human relationships and the reasons they may have been forged or broken. The naivety of such algorithms would ultimately manifest as the appearance of fictitious people, or of duplicates for the same person (merging of inappropriate people would also be a risk but probably to a lesser extent). There will always be the need for human researchers, and any application of AI must have their additional inputs.

So, has anyone seriously tried to envisage what genealogists will be doing in a couple of decades time? Will the same providers still be offering online trees for users to build from raw information, and will beginners be researching the same people as we are doing now? Will all of today's trees still be online then, and will the error rate in those trees be reduced? If so then how? Product Managers beware! Your industry's future is at stake unless some visionaries step forward. Genealogy is big business at the moment, but it cannot remain as it currently is. Companies may be throwing their hat at DNA testing, but what happens when that fad passes its peak and it is taken for granted? A company's valuation is dependent on a predicted future for its market, so buyer beware also.

[see follow-up at Research in Online Trees]

Thursday, 15 November 2018

Organising Digital Resources

I want to take you on a brief tour of what it means to index your digital resources, and how this is a better method of organising them than creating physical connections. Although this articles is primarily about online genealogical resources, it equally applies to local ones on your personal computer.

A surprising number of Web sites still try to organise resources using a physical hierarchy between their pages. For instance, one that had pages related to places might be organised according to the associated geographical hierarchy.

Figure 1 – Naive implementation of hierarchical organisation.

As explained at Impermanent Links, this is a bad method for multiple reasons including the fact that it ties the URLs to one particular layout, and that layout is not particularly useful from the perspective of the Web server (e.g. for maintenance).

An earlier article, Hierarchical Sources, explained that organising physical resources by their provenance and then indexing them according to how you want to see or access them is not only preferential but would be the archival approach. But what do I mean by "indexing"?

In order to explain further, I need to review some terminology because people from different backgrounds may use the same terms in dissimilar ways. People who are familiar with academic articles and journal submissions will understand the difference between keywords and index-terms: keywords are specific terms identified by the author — usually in the abstract rather than in the body — and which would be found through a full-text search; index-terms are categories or topics used externally to aid document retrieval. Index-terms are not usually chosen by the author and are part of a controlled vocabulary, meaning that there are no problems with synonyms, variant spellings, or name clashes. However, in the field of database indexes, the term key is a synonym of index-term, and so terms such as keyword and keyword-search may then be ambiguous.

So, to summarise things so far, it is better to organise resources according their provenance, or their innate properties (as opposed to content), and to separately index them according to their subjects or categories. Grouping resources by provenance makes it easier to describe them (e.g. where and when they were obtained), to move/copy them, to supplement them (with related or new resources), or to support versioning. Any hierarchical organisation can then be done entirely through external indexing.

But aren't index-terms simply tags with no implied hierarchy? How would the place example (above) be handled? Index-terms are not tags — that's a description that better-fits the concept of keywords. Controlled Index-term vocabularies are very often defined to be hierarchical, and in the place example "Nottinghamshire" would be a term subordinate to "England". Whether the category name itself, e.g. "Place", is considered a top-level of the hierarchy is a design choice.

The definition of such a vocabulary would not only indicate which are subordinate to which, but may have associated meta-data that described each term and enumerated any alternative spellings or historical variations. Such meta-data would therefore assist in the selection of an appropriate index-term, after which resources matching that term would be unambiguous. Note, too, that if the term names had variations in other languages (e.g. occupations) then the controlled vocabulary is untouched, and those variations are enumerated in the same meta-data; whether the user selected "butcher" or "boucher", the retrieval of the resource would still use the index-term "Butcher".

Figure 2 – Organisation by two hierarchical indexes.

This diagram illustrates that a major advantage of this approach is that you can support multiple independent hierarchies. Organising resources according to both place and surname would not be possible using a single physical page hierarchy, and would be a maintenance nightmare if you tried to implement multiple physical hierarchies. A resource indexed by both the place terms "Epperstone" and "Screveton" could also be selected through the use of the common parent "Nottinghamshire", or even "England". Also, a resource relevant to both the surname Lincoln and the English town of Lincoln could be unambiguously indexed according to both with no confusion at all.

As many hierarchical indexes could be used as necessary, and new ones could be added without touching the underlying resources.

Figure 3 – Organising by multiple hierarchical indexes.

As well as being separately hierarchical, these indexes are inclusive. That means resources can be selected based on whether they are associated with terms in different dimensions. For instance, references to resources relevant to the Nottinghamshire village of Coddington (expressed here using the shorthand path "England.Nottinghamshire.Coddington"), "Surname.Astling", and "Occupation.Tailor". Not only that, resources can be selected using criteria involving Boolean operators, such as "England.Nottinghamshire" AND NOT "England.Nottinghamshire.Epperstone".

Relational databases can easily handle this style of indexing, including multiple indexes and Boolean queries. Each match would simply yield a physical identifier for the associated resource, such as a document or page name.

Saturday, 8 September 2018

SVG Family-Tree Generator (v5.0)

This is the official name of the free software design tool described in my previous posts Interactive Trees in Blogs Using SVG and More on SVG Family Trees. This post announces some important changes for the v5.0 release.



There has been a Facebook group for this tool for some time now, called "SVG Family-Tree Generator". The membership is significant but comparatively low for a free tool with substantial functionality. One of the reasons is probably that the tool included too many configuration options for the casual user, and not enough stuff "out of the box". This has changed for this version, and some of the new features are described below.


Another reason is probably that the tool (installation kit, documentation, and samples) were available from Dropbox by invitation only — some of the previous enquiries about it were obviously from software developers looking to make a fast buck rather than from genuine genealogists, who I am happy to support. Now that the functionality in this version has become much more rounded, that Dropbox folder has been opened up with a public link so that anyone can download it:


Just download all the files into a local directory, say on your desktop or in your documents area, and read the 'SVG Installation.pdf' document.

Scaling and Presentation


It was always difficult to find the right magic spell to get the family trees to display with the correct size, position, and features, in all page situations. This version has made huge leaps there and it is recommended that previous subscribers rerun their tree definition files (*.txt) through the latest version to take advantage of the improvements.




The documentation was always a bit lax about which modifier keys (e.g. Shift) could be used with mouse clicks in the final browser output, and what function they each achieved. In order to help users of different browsers (especially Internet Explorer), and Mac users, a practical default usage is now documented, although new options will support reconfiguration if anyone has a need to match the conventions of some existing Web page.




Since this tool was originally designed for my own use, and for representing lineage situations in narrative research articles as opposed to conclusions in someone's database, then I had no need of GEDCOM support.


After much thinking, I finally decided to implement a GEDCOM Loader native to the SVG Family-Tree Generator. You can now select GEDCOM files from disk, browse their contents, and copy-and-paste persons or families directly into the Tree Designer window. You can also convert whole GEDCOM files if you wish.


This copying or conversion of the data to SVG Family-Tree Generator includes the automatic generation of captions, tooltips (i.e. "hover text"), biographical notes, life events of many types, and the special HTML mark-up required for its Timeline support.


So what does this mean in practice? Well, if you converted a GEDCOM file directly to a *.txt tree definition file, and then generated the usual HTML output using this tool, it would immediately include all the major features such as pop-up biographical information panels, hover text, controls to pan or zoom one tree at a time (rather than a whole Web page), and timeline reports.


This is all "out of the box", with no programming involved.



As a demonstration of these features — all of which could be used to display your own trees in subscription-free Web pages, or blog posts, for your family to access — a version of the existing Timeline Demo is embedded in this article.


Shift+Click (or Alt+Click in most browsers) will select a specific person-box, or a family-circle (which then selects the two spouses and all their direct children). The 'Plus' icons in the person-boxes will also do the same as the Shift+Click operations. The 'Eye' icons will expand any thumbnail image in the person-box. The 'Select All' button selects all person-boxes and all family-circles.


The 'Show' button collects the timeline events for the selected items, sorts them, and displays them in a timeline report. The 'Dismiss' button closes the report. The 'Clear' button clears all the selected items.


Pop-up information panels, giving the full biographical details, appear by clicking on the respective person-box or family-circle, and these can be dismissed by Ctrl+Click (or CMD+Click on a Mac) on any person-box or family-circle, as appropriate. Note that clicking on a green event description in the timeline report will also show the corresponding information panel containing that event.


As can be seen, the timeline reports can either take events from a specific tree or from multiple trees, and this can be useful when trying to correlate different histories.


Generated by Parallax View's SVG Family-Tree Generator V4.5.0. See Married 6 Feb 1726 at Averham St. Michael & All Angels. Married 27 Nov 1738 in Long Bennington, Lincs. Married 15 Jun 1735 at Coddington All Saints. Married 19 Aug 1751 at Newark St. Mary Magdalene. Married 25 Aug 1766 at Coddington All Saints. Married on 18 Mar 1775 at Coddington All Saints. Married on 22 Jul 1784 at Coddington All Saints. Married 9 Feb 1803 at Screveton St. Wilfrid. Thomas was a POW in Napoleonic France until 1814, while Margaret took up with a Thomas Meads in Epperstone. Buried 2 Jan 1756 at Coddington All saints. James Astling (?–1755) Select this person Mary Hall (?–1735) Select this person Elizabeth Willson Select this person Expand image Example button only Example button only William Dickinson Select this person Rebecca Goodbarne Select this person James was born c1726 and buried 1 Aug 1726 at Coddington All Saints. James Astling (1726–1726) Select this person Baptised 19 Jul 1730 at Coddington All Saints. Married Mary Bowman 27 Jul 1756 at Coddington All Saints. Mary died 1805 aged 74 and buried 12 Feb 1805 at Coddington All Saints. Edward Astling (1730–?) Select this person Baptised 10 Sep 1732 at Coddington All Saints. Mary Astling (1732–?) Select this person Baptised 13 Apr 1735 at Coddington All Saints. Buried 18 Dec 1735 at Coddington All Saints. John Astling (1735–1735) Select this person Baptised 17 Sep 1727 at Coddington All Saints. Buried 10 Oct 1789 at Coddington All Saints. James Astling (1727–1789) Select this person Buried 26 May 1772 at Coddington All Saints. Mary Frandell (?–1772) Select this person d. 1783, aged 52, of "Distemper fever" and was buried 1 Feb 1783 at Coddington All Saints. Elizabeth Taylor (c1731–1783) Select this person Baptised 4 Jul 1743 in Long Bennington, Lincs. Buried 11 Nov 1824 at Coddington All Saints. Elizabeth Dickinson (1743–1824) Select this person Died before July 1784. Thomas Baker (c1739–?) Select this person Born 28 Sep 1784 in Coddington. Baptised on 10 Oct 1784 at Coddington All Saints. Died 1869, aged 92, and was buried at Woodborough St. Swithun on 13 Oct 1869. Margaret Astling (1784–1869) Select this person Baptised 14 Feb 1782 at Lowdham St. Mary The Virgin. Buried 4 Jan 1850 at Bingham St. Mary and All Saints. Thomas Hallam (1782–1850) Select this person Baptised 26 Aug 1753 at Coddington All Saints. Mary Astling (1753–?) Select this person Baptised 5 Feb 1755 at Orston St. Mary. First wife (Elizabeth) died 1798, aged 40, of "a lingering consumption" and was buried 17 Jul 1798 at Coddington All Saints. Married Elizabeth Watson (b. c1758) 24 Dec 1798 at Coddington All Saints. Buried 9 Jul 1815 at Coddington All Saints. James Astling (1755–1815) Select this person Baptised 14 Nov 1756 at Coddington All Saints. Married Elizabeth Whaite 3 Jul 1787 at Barnby-in-the-Willow All Saints. Died 1834 aged 76 and buried 16 Dec 1834 at Coddington All Saints. John Astling (1756–1834) Select this person Baptised 1 Oct 1758 at Coddington All Saints. Edward Astling (1758–?) Select this person Baptised 2 Mar 1760 at Coddington All Saints. Joseph Astling (1760–?) Select this person Baptised 25 Apr 1762 at Coddington All Saints. Sharlot Astling (1762–?) Select this person Baptised 11 Sep 1763 at Coddington All Saints. Sarah Astling (1763–?) Select this person Baptised 24 Mar 1765 at Coddington All Saints. Died 1841 aged 76 and buried 27 Jul 1841 at Coddington All Saints. David Astling (1765–1841) Select this person Baptised 1 May 1768 at Coddington All Saints. Buried 28 Dec 1769 at Coddington All Saints. Thomas Astling (1768–1769) Select this person Baptised 29 Apr 1770 at Coddington All Saints. Martha Astling (1770–?) Select this person

Generated by Parallax View's SVG Family-Tree Generator V4.5.0. See Henry Proctor Select this person Expand image Elizabeth Turton Select this person Expand image William Stanton Select this person Expand image Emma J. Ashbee Select this person Expand image William H. Proctor Select this person Expand image Annie E. I. Stanton Select this person Expand image




The documentation was getting a bit weighty so it has now been split into a proper User Guide ('SVG User Guide.pdf') and a more in-depth set of program notes for people who want to get under the hood ('SVG Utility.pdf').