Tuesday 20 December 2016

Impermanent Links

Why are Web hyperlinks so unstable? Is there a specific reason for this in the field of genealogy? Could archival approaches help? Will the Internet ever learn, and will genealogy survive?

Decaying URL links
Figure 1 – Decaying URL links.

OK, the introduction is a little emotive, but the issue of unstable hyperlinks is a bane for any researcher or author wanting to cite online resources, or for anyone merely wishing to return to something of interest.

A hyperlink is simply a field that can physically take you to a different location, either in the current document/page or in a different one, but in the context under discussion they relate to hyperlinks on the Internet. These are the links that connect the HTML pages and so form the core of the World Wide Web.

Each page has an address, or URL (Uniform Resource Locator), by which links find their target. The vast majority of these begin with the “http://” that we’re all familiar with, and which indicates that they are using the HTTP protocol. Hence, to be more accurate, the problem is really that of unstable URLs rather than unstable hyperlinks; when the address of the target is changed (or deleted) then the links to it become broken.

Supermarket Mentality

As with supermarket shelves, there is a perception in Web design that tearing down some organised arrangement and replacing it with a different one will always have an advantage. This might include giving things a fresh look, providing easier access, or simply justifying someone's employment. The persons responsible are not looking at such changes from the same perspective as the poor Web user (or supermarket shopper) who is trying to find the same thing as they used before. This so-called Link rot is an issue affecting all Web sites, not just genealogical ones.

A particular problem with genealogical sites, and with many historical ones in general, is that we usually have no option but to specify the search criteria by which we found some item. But citing an index is not the same as citing the underlying item. When some re-indexing occurs then our previous criteria may no longer be appropriate. It is not uncommon, too, to find databases renamed, or merged, thus exacerbating the problem of reproducibility.

Although less prevalent, the loss of a Web site — say after failing to pay for its upkeep — is another potential cause of broken links. Even if its holdings are snapped up by some other site then the associated URLs, and possibly the very organisation of the resources, will have changed.

A sorry example of this involves the “Your Archives” facility that was launched in 2007 by The National Archives of the UK (TNA). Their description of it read “…providing an online platform for users to contribute their knowledge of archival sources held by The National Archives and other archives throughout the UK”, and it acquired a huge amount of information that was not available in their other collections. One of the community projects was the “Historical Streets Project” that not only had indexes of which census pages covered particular streets, but also allowed people to “…write stories about localities, properties, institutions, and businesses etc.” Over 31,000 people registered and contributed but it was then abandoned in 2012 and the facility closed. The contributions are still on their Web site but they were dumped into a read-only archived location (e.g. http://yourarchives.nationalarchives.gov.uk/index.php?title=Category:1841_census_registration_districts becoming http://webarchive.nationalarchives.gov.uk/20130221233217/http://yourarchives.nationalarchives.gov.uk/index.php?title=Category:1841_census_registration_districts) with the result that many internal links are now broken.

Link Rot

In 2010, Bugeja and Dimitrova showed that citations to online resources have a rate of decay that can be measured using a half life, akin to radioactive decay.[1] Radioactive decay occurs exponentially, and the radioactivity reduces by a factor of two after each passage of a fixed period, known as the half life. The same was shown to occur with the URLs in journal volumes with 50% of them becoming broken after each half life.

Half-life of URL links
Figure 2 – Half-life of URL links.

Schemes and Disciplines

So where are the weak spots in all this? Well, the basic structure of a URL may be summarised as follows:

scheme:  [//hostname]  [/]path  [?query]  [#fragment]

For instance:


The scheme usually indicates the protocol: HTTP in this case.

The hostname involves a hierarchical sequence of domain names, beginning with the top-level domain or TLD (“.uk”), and operating right-to-left through a second-level domain name (“.co”), etc. Each is a subdomain of the previous domain, so “www” is a subdomain of “familyhistorydata.parallaxview.co.uk”, etc.

The path appears as a sequence of slash-separated folder names, and usually maps directly to an equivalent hierarchical path on the Web server, but not always. The last part of the path is usually a file name, but this particular example describes page objects in a database rather than page files on disk and so there’s no *.html file.

The optional fragment identifier may indicate the placement of, or direction to, a specific item, and is most often the location of a heading or anchor-point within an HTML page.

We’ll mention ‘query’ in a moment.

The most common problem with the URL is that it may expose aspects of the Web server’s technology and current physical organisation. The term Semantic URL describes a URL form that is cleaner and more user-friendly because it describes a conceptual organisation instead of a particular physical one. It is usually said that these decouple the user interface (UI) from the server’s implementation, but it could be argued that URLs were never designed to be directly visible in UIs. Irrespective of this, they achieve a great longevity because the server can be reorganised without breaking previous URLs.

The term used to describe stable or persistent URLs is Permalinks. To some extent, the many URL-shortening sites such as bit.ly, goo.gl, and tinyurl.com can achieve this, although they generally shorten long URLs by hashing the characters using base-36 or base-62 arithmetic, and result in apparently random characters. Some of these sites can generate more readable names, at the expense of some extra length, but the results are still flattened with the loss of any hierarchical semantics. Their advantages lay in situations of restricted length, such as in SMS text messages, and tracking for usage statistics, but they also get abused by spammers because you cannot see where they will redirect you.

Persistent URLs (PURLs) relate to a particular redirection service that allows manageable (i.e. modifiable) mappings between your public URLs and the underlying ones. These are differentiated from Permalinks by them using a different domain and being aimed at lifetimes of decades rather than years.

The general perception is that we need a stable and documented address for online resources so that we can direct someone to the exact same page and displayed data. Certainly this is an issue if you’ve hard-coded URLs in printed material, or you’ve embedded a URL in a hard-coded QR code, or you’ve hard-coded either of these on something physical like a grave marker. If you can’t guarantee the persistence of the URL then you really need an intermediate one that can be managed and redirected as appropriate.

This may be the general perception but it isn’t the full story. For a start, the increasingly dynamic nature of the Web means that the same URL does not display the same data, even when nothing has been rearranged. The page may allow some interaction with a back-end system, such as a database, and so what you see may depend on what you typed in the page, and possibly what you did before. This is already a problem for projects such as the Internet Archive since it cannot guarantee to restore fully operational Web sites.

For genealogy sites, the most important manifestation of this is the search operation. The vast majority of resources are indexed by one or more fields (primarily personal names) and there is no published URL format that will take you directly to the same page or data that you’ve found via your search; you have to cite the search parameters and hope that anyone following your citation will find the same information, and the same edition of that information (more on this later). So if we’re providing a URL, do we give the generic one of the content provider, or one taking you to their search dialogue, or something else?

Some resources may consist of just browsable un-indexed images, or tabulated data, and their citations might then use waypoints in order to guide someone else to the same information.

A little earlier, I indicated that URLs can also include a ‘query’ clause, although I deferred a description of it. The clause consists of an ampersand-separated list of name=value terms, and it is employed when access to the associated online resource involves an element of logic that cannot be represented by a simple hierarchical ‘path’.  A search operation is a good example, and the search parameters might be represented as individual query terms, e.g.


Google’s approach is slightly different in that they encode the complete search string, and more, in the fragment identifier, e.g.


Unfortunately, there is no requirement for the associated site to update the active URL in respect of what you’ve typed, or what local action you have requested. Such sites do not offer the opportunity to make effective use of a URL in a citation.

Another example is the OpenURL concept which uses the URL query string to encode the elements of a citation in order to retrieve a corresponding Web resource from an unspecified target. Although general-purpose in principle, it is used almost exclusively for published books and journals available online. The idea is that some resolver will parse the citation elements and return a link to a participating library or repository hosting an online copy.

Archival Approach

If you cite something in an archive, you would include the codes by which it was catalogued, and you wouldn't attempt to give its precise physical location in the building (although you might mention that it had to be retrieved from some auxiliary storage area). It then seems wrong that online resources are expected to be cited using their electronic location: their URL.

If the provider made available permalinks, and they were humanly-readable semantic URLs, then it would help, but it would not mean that data had actually been organised that way so where is the guiding principle?

Imagine if the content providers used an archival approach to their collections of images and transcribed extracts, treating them according to the natural hierarchy associated with their provenance and source arrangement (see Hierarchical Sources). Most URLs betray the fact that genealogical data organisation is determined by software principles rather than by archival principles, and with the provenance and structure of the source data being an afterthought at best. Such hierarchical arrangements would provide the natural levels at which to include archival descriptions, including the precious source-of-the-source that is so often inaccurate or omitted.

More than this, though, it would allow the provenance of the provider’s data to be plainly visible: such information as who produced a given image, who transcribed certain details, and when was that transcription last updated (i.e. corrected)?

There’s a tendency to think of a citation as a pointer to the original information source. Of course, this is wrong! While the source-of-the-source is important, the citation is primarily a pointer to the information that you consulted, and online sources will be derivatives of those originals; images will not be exact copies, and transcriptions (including transcribed extracts) will often be inaccurate (see Anatomy of a Source).

So what approach might help, given that content providers will have digital images and/or databases of transcribed information, as opposed to physical documents or other artefacts? Well, if they followed archival principles then they would catalogue sources as they’re scanned, and as they’re transcribed, treating their digital resources as sources in their own right, undergoing accession into their digital repository.

Case Study

The case I want to use in order to illustrate my point is the decennial census of the UK, and this will involve looking at how it is currently presented by Ancestry and Findmypast.

The census returns were taken on the same day across the UK but were the subject of different jurisdictions and are now stored and organised differently. Those of England and Wales are stored at The National Archives of the UK (TNA). Those of Scotland were stored at the General Register Office for Scotland, Edinburgh (GROS); however, on 1 April 2011, the GROS was merged with the National Archives of Scotland to form the National Records of Scotland. The surviving pre-1921 all-Ireland censuses are stored at the National Archives of Ireland, but later ones for the northern counties are stored at the Public Records Office of Northern Ireland (PRONI). Although not technically part of the UK,[2] those for the Crown dependencies of the Channel Islands and the Isle of Man are also held at TNA.

TNA has a comprehensive scheme for cataloguing the materials it holds, and they publish recommendations for how it should be used to cite their materials at Citing Documents. For the censuses, this amounts to using the departmental code, series number, piece number, and book number (for 1841) for the specific census item; and then internal identifiers of folio and page number to identify a specific page within that item. For instance, ‘HO 107/11/12, folio 12, page 19’ in 1841, or ‘RG 9/2460, folio 43, page 27’ in 1861. These alphanumeric codes are used by virtually all UK researchers, but they are often viewed as cryptic and unhelpful in the US where all relevant long-hand details are expected in the citation, including the county, ecclesiastical parish, registration district, and possibly more.

What may not be known is that online copies of TNA’s images, and associated transcriptions, have to be indexed by these codes. I cannot provide a reference but I understand that this stipulation is in the licensing necessary from TNA. As a result, it means that they provide a much more accurate way of locating a known page than the vagaries of name-based searches, but it does not mean that additional detail cannot be provided. Indeed, the content-provider name is essential since their images and transcriptions will not be identical to others, as is identification of the person or family being referenced on the page. Citations are also about how source information supports or refutes an argument, and so any detail that establishes the nature of the source and the strength/weakness of its information will always be useful. However, failure to give all those archival codes would be doing a disservice to any reader with a UK interest!

NB: While these codes apply to those censuses held by TNA, they do not apply to the others such as the Scottish ones. This gives a small problem for the content provider if they wish to present a single census collection for the UK.
Picking on 1861 for the purposes of illustration, Ancestry has an “1861 UK Census Collection”, described as “The 1861 Census of England, Wales, Scotland, Channel Islands and Isle of Man”, which is accurate according to the contents therein, but it doesn’t contain Ireland, which was in the UK at the time. Although a search across the whole collection is the norm, it is also possible to search the individual databases in that collection by selecting from the list at the bottom of the search page:

1861 Channel Islands Census
1861 England Census
1861 Isle of Man Census
1861 Scotland Census
1861 Wales Census

For those relevant to TNA, extra input fields are presented for the piece, folio, etc.

Findmypast has an “1861 England, Wales & Scotland Census” database that presents input fields for piece, folio, etc., in all cases. The database also includes the Crown dependencies and so the title and description are both inaccurate according to the contents, even though it is not described as a “UK” collection. It is not as easy to restrict your searches to a particular country or island in this database. Back in 2015, the company came under fierce criticism because they’d introduced pseudo-TNA codes to apply to these input fields for the Scottish census transcriptions (ScotlandsPeople presented the images, but Findmypast were not allowed to use them), none less so than from Chris Paton of The British GENES Blog.[3]

Such criticism was well-founded if it related to the introduction of fake TNA codes that were not distinguished from the real ones, and which would therefore create confusion. However, if Findmypast had simply introduced a system of archival cataloguing for their own materials (following TNA’s precedent) then criticism would have been ill-founded; such an approach would have been wise, and that’s effectively what I am recommending here. Unfortunately, it is unlikely that this was Findmypast’s intention.

Entering a document reference into TNA’s Web site (e.g. “RG 9/2460”) displays an archival description of the associated item, but neither Ancestry nor Findmypast can do this because they have no reference that is separate from their search functions. Let’s take a moment to look at their URLs (at the time of writing) for “James Procter” in the aforementioned 1861 census page, found directly via piece, folio, and page.

The list of people on this page in Ancestry’s “1861 UK Census Collection” corresponds to an opaque URL of:


It is opaque because there is little evidence of the original parameters that I specified. The transcribed extract for James is at the following URL, and the image URL for the page is so long that I won’t bother presenting it:

The list of people on that page in Findmypast’s “1861 England, Wales & Scotland Census” corresponds to a URL of:

This is much more readable, and the search parameters are clearly visible. However, the specific transcribed extract for James corresponds to the following:

This has lost all connection with the search parameters, and it contains undocumented codes that may or may not be persistent.

So what am I saying here? The different censuses should be accessible individually, thus respecting their provenance, and not dumped into a single amorphous database; Ancestry makes a good job of this. The provider should employ a set of persistent identifiers to organise their information according to its natural structure and provenance. Each page image and each transcribed extract (e.g. details of a given census person) should have a unique and persistent URL to access it directly via these identifiers, without having to go through the same search again — and so without having to pray that the indexing hasn’t changed.

The censuses held by TNA are interesting because the pages are already indexed by their original archival codes. These are termed natural keys since they are part of the original data and are not fabricated purely for some database indexing. When such keys are well-defined (and particularly when they’re mandated) then the provider’s identifiers should acknowledge them; however, the provider’s identifiers must be more detailed because (a) images and transcribed extracts must be two distinct derivatives of the same original information, and (b) the extracts may be at a lower level than the original source item level, or even below the individual page level, as in the case of a census.

Or, in summary, all online genealogical data should:

  • Expose persistent semantic URLs for each image and for each transcribed extract,
  • Include documented persistent identifiers in those URLs related to the provenance and natural structure of the associated source, as held by the provider,
  • Link the provider’s identifiers to corresponding archival descriptions, to the provenance of the information, and to source-of-the-source information.

[1] Michael Bugeja and Daniela V. Dimitrova, Vanishing Act: The Erosion of Online Footnotes and Implications for Scholarship in the Digital Age (Duluth, Minnesota: Litwin Books, 2010).
[2] For the purposes of the British Nationality Act 1981, the “British Islands” include the United Kingdom (Great Britain and Northern Ireland), the Channel Islands, and the Isle of Man.
[3] Chris Paton, “FindmyPast, Scottish census sources, and Moby Dick”, The British GENES Blog, posted 27 Jan 2015 (http://britishgenes.blogspot.ie/2015/01/findmypast-scottish-census-sources-and.html : accessed 16 Dec 2016).

Saturday 3 December 2016

QIL: A Normative Scheme for Labelled Narrative

QIL is a scheme for labelling arbitrary segments of narrative when writing-up deductive reasoning such as ‘proof arguments’. I have been using this locally for my own work as there appear to be no published recommendations in this area. Depending on their background, other writers may find this presentation useful.

Narrative is, by its very nature, sequential. When one part needs to reference another part, either forwards or backwards, then we take it for granted that there will be an associated reference point: some identifier or label that will allow us to find the target text.

Chapters and sections are obvious examples of this. For instance: “see Ch. 12, and Sec. 12.1” (actual abbreviations dependent upon your style guide). The separating space would usually be a non-breaking space in order to prevent the number being separated onto a different line. Both of these cases would usually have alternative textual titles, too, although the chapter name and section heading would be less used for frequent intra-document references.

We also take it for granted that individual non-narrative items, such as Tables, Figures, Plates, etc., can be referenced directly. Their numbering usually runs consecutively throughout a document, beginning with number 1, and separately from each other.[1] For instance: “see Figure 12”.

The QIL Goals

The goals of this scheme were:

  • Visually identifiable way of labelling and referencing arbitrary narrative sentences or paragraphs
  • Associating certain semantics with the labels by using distinct introducers
  • Ensuring that the scheme is useable in both documents created via a word-processor and in online blog posts or other Web pages
  • Ensuring that their usage in electronic documents and online makes effective use of the corresponding hyperlink support

While headings could be used to label paragraphs, it would break the flow of the surrounding narrative, and so it may not be justified simply to create a label for future reference.

Although not part of the QIL scheme, I have used the following convention for headings when collecting together information from large-scale cluster research. It allows two neighbouring heading levels to specify a person followed by the sources relating to them, or a source followed by personae mentioned in that source.
Person-based (Person → Sources)
A number of sources provide enough direct and non-conflicting evidence to create a coherent snapshot of a person with some of their lineage or history. This will usually reference multiple sources, which would be described under secondary headings.

Source-based (Source → Personae)
A given source provides references to one or more individuals (or ‘personae’) whose identities and relationships to references in other sources has yet to be determined. Those distinct personae would be described under secondary headings.

However, I also needed to be able to create much lower-level labels for deductive logic. Having identifiable labels at this level may also help keep software visualisations in-step with the written argument, but this is an area which has yet to be explored to its full potential


Mathematics is a subject that includes an array of concepts that may need to be labelled and referenced, including equations, lemmas, and axioms. The most familiar one to the layperson will be equations, and I will briefly examine how these are handled in order to find a precedent for QIL.

When manipulating equations, such as when deriving a mathematical proof, references to prior equations will be essential. For instance: “Rearranging Eq. (1.2) and substituting Eqs. (2.1) and (2.2) gives…”. Note that the equations are typically numbered using the enclosing section number plus some sequential equation number within that section. Note also that associated references are usually preceded by the abbreviation “Eq.” or “Eqs.”.[2]

                                           E = mc2                                                                             (3.2)

Any scheme that applies to segments of narrative must be as simple as this, and embrace the basic requirements of relative numbering and the distinguishing of references from associated labels.

QIL In Microsoft Word

The QIL scheme began in Microsoft Word using the following format:

L(n.n.n):         Label
L(n.n.n)          Reference to label

The last integer (’.n’) was relative to the active heading-2 level numbering (the preceding ‘n.n’). The introducer, “L”, was then supplemented with a couple of other letters to provide some essential semantics, as follows:

  • Q(n.n.n): A query regarding anomalous evidence, conflicts, etc., that requires subsequent explanation or resolution
  • I(n.n.n): An inference made from evidence and/or other inferences that may be used subsequently
  • L(n.n.n): A general label for anything that may be cited later

Generating these is quite easy in Word because there are corresponding field codes: the invisible directives that can be embedded within your text. For instance, in order to generate “Q(2.3.7):” at a location within a heading-2 level of “2.3”, you could do the following:

  • Type the “Q(“
  • Ctrl+F9 to open a pair of field-code braces
  • Right-click and select ‘Edit Field…’
  • Select ‘StyleRef’, with the options of ‘Heading 2’ level and ‘Insert paragraph number’
  • Type the “.”
  • Ctrl+F9 again
  • Select ‘Seq’. Change the field code to “SEQ Q”, “SEQ I”, etc., as appropriate
  • Type the final “):”

Yes, this sounds excessive but there are easier ways.

The SEQ field code takes an arbitrary identifier so you can support separate sequences running concurrently: “Q”, “I”, and “L” in this scheme. They each start from 1, and continue until reset with a code such as “SEQ Q \r1”; hence, you would need one of these if switching to a new section. Although basing it on heading-1 level would be workable, and dependent upon your written style, the alternative of simply using global numbering, as in “Q(23):”, may be easier to generate but harder to find the corresponding label in your narrative.

If we go to the ‘Word Options’ and temporarily turn on the option to see these field codes (usually under ‘Advanced > Show document content > Show field codes instead of their values’), then we would see:

Q({ STYLEREF "Heading 2" \n }.{ SEQ Q }):

An easier way to generate these labels is to use copy-and-paste; any of them can be copied to a different section, selected with the mouse, and right-click ‘Update Field’ in order to create an entirely new one. However, my recommended method is to use the Word ‘Building Blocks’ feature, formerly called ‘AutoText’.

Select examples of each of the three labels (one at a time), type Alt+F3, and save them as named Building Blocks (e.g. QLab, ILab, LLab). You could also save copies of the corresponding ‘reset’ cases (e.g. QLabR, ILabR, LLabR). You can summon these at any time by typing their name followed by F3. The beauty of this scheme is that these definitions are stored in a Word file called ‘Building Blocks.dotx’, and so will be available in future documents.

Also, showing the underlying field codes — should you ever need them — can be done more easily using Alt+F9 to toggle their display.

Before we can generate a reference to one of these labels, we need to bookmark its textual location. Select the label with the mouse, but not the trailing “:”. Using ‘Bookmark’ in the ‘Links’ area of the ‘Insert’ tab, type in a name for the bookmark, and select ‘Add’. We then have a named location (rather than just a label) that we can reference later.

To generate a reference to it, select ‘Cross-reference’ in the same ‘Links’ area, select ‘Bookmark’ and ‘Bookmark text’ in the two drop-down lists, leave ‘Insert as hyperlink’ unchecked (for now), and chose a named bookmark. This will generate a copy of the corresponding bookmarked text (e.g. Q(2.3.7)) rather than the bookmark name, which is merely a local symbolic name. Looking again at the invisible field codes would reveal something similar to:

Example reference is { REF Q_2_3_7 \h }.

It is tempting to just use a name such as “Q_2_3_7”, as I have been doing in this article, but the inevitable moving around of sections will require labels to be re-generated (done by Ctrl+A to select everything in the document, and F9 to update all of its fields), and that would cause the symbolic name to get out of step with the actual text. A better scheme is to adopt functional names: ones that describes the associated query, inference, or whatever. Note that there are limitations on these names: basically 1–40 characters, beginning with a letter or underscore, and containing no punctuation or whitespace.

What we have here is a workable scheme for printed documents; we now need to look at it in the context of electronic documents and blogs (or other Web pages).

Word Hyperlinks

Word has two mechanisms for hyperlinking its bookmarks, and they are subtly different, although their terminology sounds too similar to the casual user:

  • Select ‘Cross-reference’ in the ‘Links’ area of the ‘Insert’ tab. Select ‘Bookmark’ and ‘Bookmark text’ from the two drop-down lists, and ensure that ‘Insert as hyperlink’ is checked. I’ll refer to these as “Bookmark hyperlinks”.
  • Select ‘Hyperlink’ from the same ‘Links’ area. Select ‘Place in This Document’ on the left, and select the required bookmark from the main tree panel. I’ll refer to these as “URL Hyperlinks”.

Comparing these mechanisms using the above bookmark-reference example gives the following respective field codes.

Example reference is { REF Q_2_3_7 \h }.
Example reference is { HYPERLINK \1 “Q_2_3_7” }.

Now these both hyperlink to the same bookmark, but their capabilities are not the same.

  1. When you insert a bookmark hyperlink then it nicely substitutes the bookmarked text — the QIL label in our case — as opposed to some symbolic bookmark name that is more relevant to the author than to the reader. When you insert a URL hyperlink then you only get the bookmark name. Although you can change the ‘Text to display’, you may need a good memory to recall the bookmarked label. Also, any such display text is fixed, and will not be kept in-step if you move sections around, since the display text is not part of the field code. Clearly this Word feature has not been thought out very well.
  2. A bookmark hyperlink looks like normal text until you hover over it, whereas a URL hyperlink is underlined and rendered in blue like normal Web hyperlinks. If you hover over either then they nicely show you the underlying bookmark name and the instruction ‘Ctrl+Click to follow link’.
  3. Most importantly, when saving the document as “Web Page, Filtered” — required for blogs and other HTML versions — then bookmark hyperlinks are discarded, but URL hyperlinks are not!

At first, I thought that neither of these did what I needed, and that my scheme was therefore doomed without major change.

Blog Hyperlinks

I’d previously written about using bookmarks in blogs at Using Bookmarks with Blogger, although I had to update that article following the findings of this more recent one.

The HTML equivalent of a bookmark is called an “anchor”, and is represented by the <a> element. Given an example sentence of:

L(4.1.3): This resolves the query at Q(1.2.6).

the HTML version we would hope for would be similar to:

<a name=”L_4_1_3”>L(4.1.3)</a>: This resolves the query at <a href=”#Q_1_2_6”>Q(1.2.6)</a>.

Where “L_4_1_3” and “Q_1_2_6” are simply the bookmark names that I had chosen.

In Using Microsoft Word with Blogger, I recommended saving Word documents as “Web Page, Filtered”, and not just “Web Page”, when using them to generate new blog posts. This is because it filters out the excessively verbose and complicated HTML that Word generates by default, and is necessary for the correct operation of feedburner (notifying people of new blog-posts).

The first thing I noticed was that the displayed document, following the new saved-as format, is not what is saved to disk; it appears to be using the full Word HTML version, which is why it all looks fine. The disk version will not have any representation of bookmark hyperlinks. Also, the anchor points stop after the first parenthesis, showing ‘<a name=”L_4_1_3”>L(</a>4.1.3)’ rather than the correct form in the example above. This appears to be a bug in Word 2007/2010 where the presence of the field codes representing those numbers clashes with the bookmarking, but it is not a showstopper because at least the anchor begins at the correct text location.

Given the flexibility that Word is endowed with, I was rather surprised by the hiccups and limitations when trying to implement this scheme. Although I mentioned some of the issues on a Word forum, I never expected any changes because — based on previous experience — you would be told ‘that’s the way it is’, ‘it works as intended’, and ‘we don’t recommend what you’re trying to achieve’. I can’t abide that sort of reaction from the fringes (the actual developers would probably be interested), but I wasn’t going to give up that easily either.

With the two types of Word hyperlink, one was much easier to use while the other was the one that was represented in the HTML. What I found was an unexpected hybrid that was not that onerous, and which seems to work in both 2007 and 2010 versions:

  • Generate a bookmark hyperlink for your chosen bookmark name
  • This deposits the bookmarked text (the QIL label) rather than the name, as required
  • If you’ve already forgotten the bookmark name in the preceding seconds, just hover over it
  • Select the deposited reference with the mouse and convert it to a URL hyperlink to the same bookmark

The interesting thing about this is that it still looks like the normal bookmark hyperlink (not underlined, and not coloured blue) but it really is a URL hyperlink and so works when saving as “Web Page, Filtered” in order to generate your blog page (see Using Microsoft Word with Blogger). The display text gets correctly updated in this case (unlike the manually inserted display-text mentioned earlier) since any text selected at the time the URL hyperlink is inserted becomes its display text, and in this scenario that contained the necessary field codes.

This scheme isn’t as complicated as it sounds, but some hoops had had to be jumped through. Using macros or add-ins could make it much more streamline.

[1] Joe Schall, "Effective Technical Writing in the Information Age: Textual References to Figures and Tables", John A. Dutton e-Education Institute - Pennsylvania State University (https://www.e-education.psu.edu/styleforstudents/c4_p11.html : accessed 28 Nov 2016).
[2] Drs. Nathan Champagne, Scott Gold, Steve Jones, Terry McConathy, and Ramu Ramachandran, “Guidelines for Equations, Units, and Mathematical Notation: An addendum to the Thesis/Dissertation Guidelines provided by the Graduate School...”, College of Engineering & Science [COES] - Louisiana Tech University (http://www.latech.edu/graduate_school/thesis_dissertations/coes_equation_guidelines.pdf : undated but parts last updated 23 Feb 2006, accessed 28 Nov 2016).

Wednesday 16 November 2016

FAN Principles Unfolded

What is the FAN Principle, and when would we use it? What generalisations of it can be employed? How does it relate to analytical methods from outside of genealogy? How do the different methods relate to our intended goals?

There is some confusion over what the FAN acronym stands for. I previously thought that it was Family, Associates, and Neighbours;[1] and later became Friends, Associates, and Neighbours,[2] possibly because it’s so easy to lapse into “Friends, Romans, and countrymen …”.[3]. The historical difference between friends (or enemies) and weaker associates could be a subjective one, and so not as useful to us in directing our research; however, knowing the extent of a person’s family might also be part of the goal to which this technique is applied in the first place. More recently, the terms get merged as Friends & Family, Associates, and Neighbours, with the acronym being unofficially extended to FFAN.

The sources I gave for the two variants of FAN, above, were from the same year and the same author (Elizabeth Shown Mills) and so didn’t support my initial impression; I had to get more details. According to Elizabeth, she started using the term FAN Principle in her Advanced Research Methodology (ARM) track at the Institute of Genealogy and Historical Research (IGHR), Samford University, during the early 1990s. That ARM track commenced in 1986, but its inspiration came the previous year when a defeatist response in the APG quarterly newsletter prompted her to analyse her own successes and failures. At that time, she emphasised neighbours and associates since she didn’t believe that family needed emphasising. Also, whole-family genealogy, as opposed to direct-line genealogy, was comparatively uncommon, and what there was mostly amounted to just following males with the same surname. Her technique was much wider than this, and required individuals to be placed in their respective community context in order to find new sources of relevant information. After several years of teaching this, she hit upon the notion that “Every ancestor had their FAN Club: Friends, Associates, and Neighbours.” With the explosion of Internet genealogy during the mid-2000s, whole-family research (not the same as simple name gathering) had become positively rare, and that’s when she started referring to Friends & Family, Associates, and Neighbours.

The FAN Principle is a research method for studying individuals in the context of their FAN Club in order to widen the search for relevant information.[4]  It is employed when we have no documents that give us direct evidence about a person’s identity, origin, and/or parentage. Although commercial genealogy may suggest that you can “build your tree” directly from their records, we all know that this rarely works; it’s not long before we can’t find someone, or we can’t identify someone (usually a woman), or there are alternatives that are too close for us to distinguish. You’re then in the world of inferential genealogy where you have to study the scant information available and make an argument for what the truth might have been — the better the argument, the more reliable the conclusion.

FAN Example

There’s a concise example of the FAN Principle provided in QuickLesson 11. It describes a Mary Smith who married a James Boyd in 1853, and shows how correlating various sources relating to her FAN Club allowed her family to be identified.

In this case, because there were no sources directly identifying Mary, it begins by looking at the most obvious person in her FAN Club: her husband, and then looking at his FAN Club. This is an accepted technique for identifying women during those times as they would be conspicuous by their absence in most records.

Targeted research to identify Mary Smith via her husband
Figure 1 – Targeted research to identify Mary Smith via her husband.

The above diagram illustrates that different sources relate to different sets of associates of the target person: James Boyd. For simplicity, it doesn’t show all the associates in this example, or all the sources; the following table includes all the sources.

Road order
Land registry
1850 census
Boyd family




Smith neighbours







Table 1 – Relationship of associates and sources for Boyd/Smith example.

By correlating the information provided by those sources, and by considering their contexts (dates, ages, occupation, etc.), then an argument is made for the wife of James Boyd being Mary C. Smith, daughter of the neighbouring William and Jane Smith.

Cluster Analysis

The FAN principle is also described by the terms Cluster Research[5] or Cluster Genealogy[6], but not the term Cluster Analysis. Cluster Analysis is a long-standing research method that has been applied to many different fields. I will take a brief tour of it in order to determine what, if any, relationship exists to the FAN Principle.

Cluster analysis separates data (or objects) into groups that are meaningful, useful, or both. Methods of cluster analysis fall into two broad types: ones where the clustering is evident in the data itself (empirical) and ones where we deliberately categorise data according to some shared property (categorical).

With empirical cluster analysis, data is typically plotted in some data space (e.g. pressure against temperature) or geographical space and the distribution of points examined for clusters. Although it’s usually easy to distinguish clusters visually, there are many different algorithms for locating them and establishing their boundaries. It would then be necessary to explain the number or shape of the clusters, or even their very existence.

Empirical cluster analysis: clusters evident in the data, and require explanation
Figure 2 – Empirical cluster analysis: clusters evident in the data, and require explanation.

One example of this method might be when analysing the geographical distribution of some disease or ailment. A genealogical example might be when looking at the distribution and movements of a family.

With categorical cluster analysis, data (or objects) are separated into groups that reflect some common attribute or property. This may be an abstraction to support statistics or summarisation, or a precursor to some other type of analysis. It is this type, therefore, to which the FAN Principle is related; the groups of family, friends, neighbours, and other associates, are the clusters into which we have separated the general associates of a target person.

Categorical cluster analysis: separation in preparation for some other study
Figure 3 – Categorical cluster analysis: separation in preparation for some other study.

Because the conceptual clusters have been predefined in this method then we might expect to see alternative results if we change them. In fact, in both types of analysis, clusters may be either strict partitional (each object only in one cluster), hierarchical (object in a child cluster is also in the containing parent cluster), or overlapping (object may be in multiple, non-exclusive clusters).

Overlapping, non-exclusive clusters
Figure 4 – Overlapping, non-exclusive clusters.

The FAN Club clusters are overlapping rather than hierarchical. For instance, not all family are neighbours, and not all neighbours are family.

There exists a specific cluster variant that deserves a mention: graph-based. In this variant, objects in a cluster are connected to other members of the same cluster by some relationship type, and not connected to members of other clusters. This is different to the clusters of objects that share a common property. At a stretch, it might be possible to describe a family cluster in this way as all its members are connected by some sort of family relationship; however, it is not particularly relevant to the FAN Club as associations there are specifically relative to the target person rather than to other members.

FAN Club

Let’s take a deeper look at the nature of the FAN Club. We have already mentioned that its clusters are overlapping, or non-exclusive, and citied the example that not all family are neighbours and vice versa. In fact, the clusters are simply the target’s associates grouped for priority of investigation. The FAN QuickSheet contains a diagram illustrating the concept of targeted research, where concentric circles represent clusters of associates, ordered according to the strength of their connection with the target, which might be searched from the innermost outwards.

  • Target person
  • Known relatives and in-laws
  • Others of same surname
  • Associates and neighbours of target
  • Associates of those associates

First, notice that these circles are not literal interpretations of the FAN acronym; it would be a limiting folly to treat the acronym as some simple prescription. Even this diagram is just a guide, though, and we might conceive of more circles according to the nature of the particular problem and some knowledge of the potential sources. For instance, people of the same surname who are also neighbours are more likely to be family members (known or unknown) and so potentially represent a stronger connection than ones elsewhere. I once subdivided neighbours to prioritise ones in the same occupation, based on the assumption that they may have worked in the same place as the target.

So are all these associates just acquaintances? Not really; the use of the word also embraces cases of a general connection, or association, between those persons and the target. In essence, these associations are the properties shared by the members of each cluster. The associates might be acquaintances … or they might be family members who had never met, or persons of the same surname but different family, etc. A favourite of mine, also mentioned in the FAN QuickSheet, is the list of persons interred in the same or neighbouring burial plots because it often throws up surprising further connections.

The targeted-research list, above, illustrates another technique: that of looking at indirect associates, or associates-of-associates in this case. Looking at the FAN Club of an associate may be necessary in order to understand the nature of a connection, or to investigate further connections; this will begin to form a network rather than a set of connections anchored on the target person. The worked example of Mary Smith uses this technique as it looks at the FAN Club of her husband. We can envisage many variations of this, such as family-of-neighbours or family-of-family (including in-laws), but where should we stop? Well, there’s no shortage of scope but an iterative approach, customised for each case, would be more practical than trying to enumerate every possible direct and indirect associate at the outset. The strength of an indirect association depends on the product of the strengths of the individual direct associations, and so it can fall-off very quickly.

A recent case of mine involved the identity of a George Kirk, for whom there was no visible birth/baptism record. By identifying his father (Joseph Kirk), from his second marriage certificate, and so finding his mother (Elizabeth2 Hutchinson), and then looking at her parents (Joshua & Elizabeth1), and then identifying all her siblings, it was possible to show that George was actually Elizabeth’s2 own son, but baptised as the very youngest son of Joshua & Elizabeth1 before she got married. What I’d done was to deliberately look at the family-of-family of George.

Whether we’re looking at direct or indirect associates, looking at related FAN Clubs means that we have intersecting clusters.

Intersection of FAN Clubs for direct and indirect associates
Figure 5 – Intersection of FAN Clubs for direct and indirect connections.

Those intersecting clusters represent the fact that there may be some shared associates. This will be far more likely for a direct associate (i.e. the FAN Club of someone in the target’s FAN Club) than for an indirect associate (e.g. the FAN Club of someone who has an associate in common with the target’s FAN Club).

All this means, of course, is that those lives are interlocked, and the history of one will affect and be affected-by the history of others. Putting it another way: you cannot research an individual in isolation!

It is usually said that whole-family research is a prerequisite for cluster research; however, I will suggest that family reconstitution is a more fundamental notion because it applies to arbitrary families rather than specifically your own. The term is defined in one dictionary as follows: “The technique of compiling family trees for as many people as possible in a chosen area of study, e.g. a parish, so as to obtain detailed demographic data on matters such as age at marriage, or expectation of life”.[7] While this is a fair definition, I take issue with the emphasis on demographics, particularly from a family-history dictionary.

The concept is fundamental, therefore, because it underpins several distinct genealogical pursuits:

  • Whole-family genealogy. While I cannot find a strict definition, it can be described in terms of its differences from direct-line genealogy where only direct ancestors (maternal and paternal) are researched. Whole-family genealogy means that the siblings of every ancestor are also researched, and possibly their descendants too. Either of these may be constrained by surname, such that only direct ancestors with your surname are considered (possibly for establishing a particular pedigree), or only descendants of some single progenitor who carry your surname (usually for a so-called “your family tree”). Whole-family genealogy also means looking at the offspring of any multiple marriages, and also the marriages of the women in each generation.
  • One-name studies. Studying everyone of a given surname, including its variants. This might be worldwide, or it might be constrained by place and/or time period.
  • One-place-studies. Studying the whole population of a given community, such as a village or hamlet.

Family reconstitution is essential for any of these pursuits because it is the first step in establishing the structure of some community; without that then you could not investigate the associations of a family with other families or individuals.

So let’s cast the net even further: what about micro-history? Well, all of the above pursuits are variations of micro-history, but my own use of this term would also include historical subjects other than persons, such as places, groups (e.g. regiments, companies, classes), and animals.

Genealogy is almost always about persons rather than places, or any other historical subject, but the same method would be applicable in all cases. For instance, we could analyse places in a similar manner to persons, and establish the identity of a place reference through an examination of its associations. This would force a difference between treating a place as some property for clustering persons and treating it as an independent entity with its own identity and associations. In reality, establishing the identity of a given person may first require the identity of some place to be established, and so it is artificial to think of these cases as fundamentally separate.

Link Analysis

Having found items of relevant information in the extra sources from the community context, it is then time to make an argument for the identity of some person, or for the biological relationships within some family. We’re now out of cluster analysis and into another long-standing method: Link Analysis.

This Wikipedia page actually gives a pretty poor summary of link analysis; it gives the impression that it is all about large-scale software processing of connections found in bulk data. While this may be the current usage, it is a method that predates the computer age, and it was originally used as a way of visualising connections in logical deduction — see the introduction to Our Days of Future Passed — Part III.

Use of Link Analysis for analysing and correlating source information
Figure 6 – Use of Link Analysis for analysing and correlating source information.

In other words, the application of genealogy software to this method would primarily be about visualisation, and helping us keep track of sources, information, and specific references. An Internet search for images relating to “Link Analysis Software” gives many ideas for visualisation, but they all share the same fault: they are node-heavy and assume that most of the information is related to the objects (nodes) rather than to the links (edges). In a genealogical context, each link would have to embrace any quoted information (or links to associated transcriptions), the relevant source, and our analysis or deduction (in narrative form), whereas the objects would mostly represent person references (as opposed to identified people who lived).

So where would the transition arise between clusters and links? We have already mentioned that cluster research, and the FAN Principle, employ cluster analysis in terms of categorising persons for targeted research. That research would find relevant information that could be used to create an argument for establishing someone’s identity, but it is highly unlikely that any single item will be enough to achieve this. Correlating and comparing those items is where link analysis would be employed, irrespective of whether this was done in your head, with a pencil and paper, or with some new software tool.


My original intention with this analysis was to look at the essence of the FAN Principle, and so understand how it is applied to address specific research problems; to compare this research method with certain ones outside of genealogy; and to understand the relationship between these methods and the various genealogical (or historical) goals that we may aspire to. Out of this deeper understanding of the overall landscape should come the ability to develop software that might better help in those pursuits — and particularly in their visualisations — as opposed to simply interpreting the FAN acronym literally.

What I didn’t anticipate was the level to which cluster research applies in all veins of genealogy. Just as historical context is essential for the study of historical events, so community context is essential in the study of individuals. It is probably one of the most fundamental concepts in any historical research, and it lies at the heart of many of its pursuits, in additional to its application to solving difficult identity problems. What a shame, then, that modern Internet genealogy encourages people to deal only with direct answers to simple questions; more Charlie foxtrot than cluster research!

Inferential genealogy should be a concept applicable to everyone’s research, but it has sadly become associated with the professional or the academic. I have already heaped much of the blame for this on commercial genealogy’s simplistic model (see Reaping What We Sow — Part I and Reaping What We Sow — Part II) but can anything be done to counter it? Some inferential cases may be very complicated and involve extensive associations in order to make an argument, but not all will be so complicated as to be out of the reach of the more ordinary genealogist. What is missing is a set of powerful tools for visualising the associations, and supporting our inferences by accepting written narrative at appropriate places. Of course, it would have to be a benefit rather than a chore, and the easier it was to use then the greater that benefit.

[1] Elizabeth Shown Mills, QuickSheet: The Historical Biographer's Guide to Cluster Research (The FAN Principle) (Baltimore: Genealogical Publishing Co., 2012); hereinafter cited as FAN QuickSheet.
[2] Elizabeth Shown Mills, “QuickLesson 11: Identity Problems & the FAN Principle”, Evidence Explained: Historical Analysis, Citation & Source Usage (https://www.evidenceexplained.com/content/quicklesson-11-identity-problems-fan-principle : posted 26 Aug 2012, accessed 24 Oct 2016); hereinafter cited as QuickLesson 11.
[3] From Mark Anthony’s speech in William Shakespeare’s play: Julius Caesar.
[4] In genealogy, that is. The term “Fan principle” may also be a reference to the “Fan dipole” antenna design for optimised bandwidth, or the “Fan-Line Principle” that employs three fan lines in stock market predictions.
[5] Mills, FAN QuickSheet.
[6] “Cluster genealogy”, Wikipedia (https://en.wikipedia.org/wiki/Cluster_genealogy : accessed 4 Nov 2016).
[7] David Hey, The Oxford Dictionary of Local and Family History (Oxford University Press, 1997),s.v. “family reconstitution”.