Why are Web hyperlinks so unstable? Is there a specific
reason for this in the field of genealogy? Could archival approaches help? Will
the Internet ever learn, and will genealogy survive?
Figure 1 – Decaying URL links.
OK, the introduction is a little emotive, but the issue of
unstable hyperlinks is a bane for any researcher or author wanting to cite
online resources, or for anyone merely wishing to return to something of interest.
A hyperlink is simply a field that can physically take you
to a different location, either in the current document/page or in a different
one, but in the context under discussion they relate to hyperlinks on the
Internet. These are the links that connect the HTML pages and so form the core
of the World Wide Web.
Each page has an address, or URL (Uniform
Resource Locator), by which links find their target. The vast majority of these
begin with the “http://” that we’re all familiar with, and which indicates that
they are using the HTTP
protocol. Hence, to be more accurate, the problem is really that of unstable
URLs rather than unstable hyperlinks; when the address of the target is changed
(or deleted) then the links to it become broken.
As with supermarket shelves, there is a perception in Web design
that tearing down some organised arrangement and replacing it with a different
one will always have an advantage. This might include giving things a fresh
look, providing easier access, or simply justifying someone's employment. The
persons responsible are not looking at such changes from the same perspective
as the poor Web user (or supermarket shopper) who is trying to find the same
thing as they used before. This so-called Link rot is an issue
affecting all Web sites, not just genealogical ones.
A particular problem with genealogical sites, and with many
historical ones in general, is that we usually have no option but to specify
the search criteria by which we found some item. But citing an index is not
the same as citing the underlying item. When some re-indexing occurs then our
previous criteria may no longer be appropriate. It is not uncommon, too, to
find databases renamed, or merged, thus exacerbating the problem of
reproducibility.
Although less prevalent, the loss of a Web site — say after
failing to pay for its upkeep — is another potential cause of broken links.
Even if its holdings are snapped up by some other site then the associated URLs,
and possibly the very organisation of the resources, will have changed.
A sorry example of this involves the “Your Archives”
facility that was launched in 2007 by The National Archives of the UK (TNA). Their
description of it read “…providing an online platform for users to contribute
their knowledge of archival sources held by The National Archives and other
archives throughout the UK”, and it acquired a huge amount of information that
was not available in their other collections. One of the community projects was
the “Historical Streets Project” that not only had indexes of which census
pages covered particular streets, but also allowed people to “…write stories
about localities, properties, institutions, and businesses etc.” Over 31,000
people registered and contributed but it was then abandoned in 2012 and the
facility closed. The contributions are still on their Web site but they were dumped
into a read-only archived location (e.g. http://yourarchives.nationalarchives.gov.uk/index.php?title=Category:1841_census_registration_districts
becoming http://webarchive.nationalarchives.gov.uk/20130221233217/http://yourarchives.nationalarchives.gov.uk/index.php?title=Category:1841_census_registration_districts)
with the result that many internal links are now broken.
In 2010, Bugeja and Dimitrova showed that citations to
online resources have a rate of decay that can be measured using a half life, akin to radioactive decay.[1] Radioactive
decay occurs exponentially, and the radioactivity reduces by a factor of two
after each passage of a fixed period, known as the half life. The same was shown to occur with the URLs in journal
volumes with 50% of them becoming broken after each half life.
Figure 2 – Half-life of URL links.
So where are the weak spots in all this? Well, the basic
structure of a URL may be summarised as follows:
scheme: [//hostname] [/]path [?query] [#fragment]
For instance:
http://www.familyhistorydata.parallaxview.co.uk/home/document-structure/person#PERSON
The scheme usually
indicates the protocol: HTTP in this case.
The hostname involves a
hierarchical sequence of domain names, beginning with the top-level domain or
TLD (“.uk”), and operating right-to-left through a second-level domain name
(“.co”), etc. Each is a subdomain of the previous domain, so “www” is a
subdomain of “familyhistorydata.parallaxview.co.uk”, etc.
The path appears as a
sequence of slash-separated folder names, and usually maps directly to an
equivalent hierarchical path on the Web server, but not always. The last part
of the path is usually a file name, but this particular example describes page
objects in a database rather than page files on disk and so there’s no *.html
file.
The optional fragment
identifier may indicate the placement of, or direction to, a specific item, and
is most often the location of a heading or anchor-point within an HTML page.
We’ll mention ‘query’ in a moment.
The most common problem with the URL is that it may expose
aspects of the Web server’s technology and current physical organisation. The
term Semantic URL
describes a URL form that is cleaner and more user-friendly because it
describes a conceptual organisation instead of a particular physical one. It is
usually said that these decouple the user interface (UI) from the server’s
implementation, but it could be argued that URLs were never designed to be
directly visible in UIs. Irrespective of this, they achieve a great longevity
because the server can be reorganised without breaking previous URLs.
The term used to describe stable or persistent URLs is Permalinks. To some extent,
the many URL-shortening
sites such as bit.ly, goo.gl,
and tinyurl.com can achieve this, although
they generally shorten long URLs by hashing the characters using base-36 or
base-62 arithmetic, and result in apparently random characters. Some of these
sites can generate more readable names, at the expense of some extra length,
but the results are still flattened with the loss of any hierarchical
semantics. Their advantages lay in situations of restricted length, such as in
SMS text messages, and tracking for usage statistics, but they also get abused
by spammers because you cannot see where they will redirect you.
Persistent
URLs (PURLs) relate to a particular redirection service that allows
manageable (i.e. modifiable) mappings between your public URLs and the
underlying ones. These are differentiated from Permalinks by them using a
different domain and being aimed at lifetimes of decades rather than years.
The general perception is that we need a stable and documented
address for online resources so that we can direct someone to the exact same
page and displayed data. Certainly this is an issue if you’ve hard-coded URLs
in printed material, or you’ve embedded a URL in a hard-coded QR code, or you’ve hard-coded
either of these on something physical like a grave marker. If you can’t
guarantee the persistence of the URL then you really need an intermediate one
that can be managed and redirected as appropriate.
This may be the general perception but it isn’t the full
story. For a start, the increasingly dynamic nature of the Web means that the
same URL does not display the same data, even when nothing has been rearranged.
The page may allow some interaction with a back-end system, such as a database,
and so what you see may depend on what you typed in the page, and possibly what
you did before. This is already a problem for projects such as the Internet Archive
since it cannot guarantee to restore fully operational Web sites.
For genealogy sites, the most important manifestation of
this is the search operation. The vast majority of resources are indexed by one
or more fields (primarily personal names) and there is no published URL format
that will take you directly to the same page or data that you’ve found via your
search; you have to cite the search parameters and hope that anyone following
your citation will find the same information, and the same edition of that
information (more on this later). So if we’re providing a URL, do we give the
generic one of the content provider, or one taking you to their search
dialogue, or something else?
Some resources may consist of just browsable un-indexed
images, or tabulated data, and their citations might then use waypoints in order to guide someone else
to the same information.
A little earlier, I indicated that URLs can also include a
‘query’ clause, although I deferred a description of it. The clause consists of
an ampersand-separated list of name=value
terms, and it is employed when access to the associated online resource involves
an element of logic that cannot be represented by a simple hierarchical
‘path’. A search operation is a good
example, and the search parameters might be represented as individual query terms,
e.g.
http://search.example.com?given-name=Tony&surname=Proctor
Google’s approach is slightly different in that they encode
the complete search string, and more, in the fragment identifier, e.g.
https://www.google.com/#safe=off&q=wiki+internet+archive
Unfortunately, there is no requirement for the associated
site to update the active URL in respect of what you’ve typed, or what local
action you have requested. Such sites do not offer the opportunity to make
effective use of a URL in a citation.
Another example is the OpenURL concept which uses the
URL query string to encode the elements of a citation in order to retrieve a corresponding
Web resource from an unspecified target. Although general-purpose in principle,
it is used almost exclusively for published books and journals available
online. The idea is that some resolver will parse the citation elements and
return a link to a participating library or repository hosting an online copy.
If you cite something in an archive, you would include the
codes by which it was catalogued, and you wouldn't attempt to give its precise
physical location in the building (although you might mention that it had to be
retrieved from some auxiliary storage area). It then seems wrong that online
resources are expected to be cited using their electronic location: their URL.
If the provider made available permalinks, and they were
humanly-readable semantic URLs, then it would help, but it would not mean that
data had actually been organised that way so where is the guiding principle?
Imagine if the content providers used an archival approach
to their collections of images and transcribed extracts, treating them
according to the natural hierarchy associated with their provenance and source arrangement
(see Hierarchical
Sources). Most URLs betray the fact that genealogical data organisation is
determined by software principles rather than by archival principles, and with
the provenance and structure of the source data being an afterthought at best. Such
hierarchical arrangements would provide the natural levels at which to include
archival descriptions, including the precious source-of-the-source that is so often
inaccurate or omitted.
More than this, though, it would allow the provenance of the
provider’s data to be plainly visible: such information as who produced a given
image, who transcribed certain details, and when was that transcription last
updated (i.e. corrected)?
There’s a tendency to think of a citation as a pointer to
the original information source. Of course, this is wrong! While the
source-of-the-source is important, the citation is primarily a pointer to the
information that you consulted, and online sources will be derivatives of those
originals; images will not be exact copies, and transcriptions (including
transcribed extracts) will often be inaccurate (see Anatomy
of a Source).
So what approach might help, given that content
providers will have digital images and/or databases of transcribed information,
as opposed to physical documents or other artefacts? Well, if they followed
archival principles then they would catalogue sources as they’re scanned, and
as they’re transcribed, treating their digital resources as sources in their
own right, undergoing accession into their digital repository.
The case I want to use in order to illustrate my point is the
decennial census of the UK, and this will involve looking at how it is
currently presented by Ancestry and Findmypast.
The census returns were taken on the same day across the UK but were the subject of different jurisdictions and are now stored and organised differently. Those of England and Wales are stored at The National Archives of the UK (TNA). Those of Scotland were stored at the General Register Office for Scotland, Edinburgh (GROS); however, on 1 April 2011, the GROS was merged with the National Archives of Scotland to form the National Records of Scotland. The surviving pre-1921 all-Ireland censuses are stored at the National Archives of Ireland, but later ones for the northern counties are stored at the Public Records Office of Northern Ireland (PRONI). Although not technically part of the UK,[2] those for the Crown dependencies of the Channel Islands and the Isle of Man are also held at TNA.
TNA has a comprehensive scheme for cataloguing the materials
it holds, and they publish recommendations for how it should be used to cite their
materials at Citing
Documents. For the censuses, this amounts to using the departmental code,
series number, piece number, and book number (for 1841) for the specific census
item; and then internal identifiers of folio and page number to identify a
specific page within that item. For instance, ‘HO 107/11/12, folio 12, page 19’
in 1841, or ‘RG 9/2460, folio 43, page 27’ in
1861. These alphanumeric codes are used by virtually all UK researchers, but
they are often viewed as cryptic and unhelpful in the US where all relevant long-hand
details are expected in the citation, including the county, ecclesiastical
parish, registration district, and possibly more.
What may not be known is that online copies of TNA’s images,
and associated transcriptions, have to be indexed by these codes. I cannot
provide a reference but I understand that this stipulation is in the licensing
necessary from TNA. As a result, it means that they provide a much more
accurate way of locating a known page than the vagaries of name-based searches,
but it does not mean that additional detail cannot be provided. Indeed, the content-provider
name is essential since their images and transcriptions will not be identical
to others, as is identification of the person or family being referenced on the
page. Citations are also about how source information supports or refutes an
argument, and so any detail that establishes the nature of the source and the
strength/weakness of its information will always be useful. However, failure to
give all those archival codes would be doing a disservice to any reader with a
UK interest!
NB: While these codes apply to those censuses held by TNA,
they do not apply to the others such as the Scottish ones. This gives a small
problem for the content provider if they wish to present a single census
collection for the UK.
Picking on 1861 for the purposes
of illustration, Ancestry has an “1861 UK Census Collection”, described as “The
1861 Census of England, Wales, Scotland, Channel Islands and Isle of Man”,
which is accurate according to the contents therein, but it doesn’t contain
Ireland, which was in the UK at the time. Although a search across the
whole collection is the norm, it is also possible to search the individual
databases in that collection by selecting from the list at the bottom of the
search page:
1861 Channel Islands Census
1861 England Census
1861 Isle of Man Census
1861 Scotland Census
1861 Wales Census
For those relevant to TNA, extra input fields are presented
for the piece, folio, etc.
Findmypast has an “1861 England, Wales & Scotland
Census” database that presents input fields for piece, folio, etc., in all
cases. The database also includes the Crown dependencies and so the title and
description are both inaccurate according to the contents, even though it is
not described as a “UK” collection. It is not as easy to restrict your searches
to a particular country or island in this database. Back in 2015, the company
came under fierce criticism because they’d introduced pseudo-TNA codes to apply
to these input fields for the Scottish census transcriptions (ScotlandsPeople presented the images, but Findmypast were not allowed
to use them), none less so than from Chris Paton of The British GENES Blog.[3]
Such criticism was well-founded if it related to the
introduction of fake TNA codes that were not distinguished from the real ones,
and which would therefore create confusion. However, if Findmypast had simply
introduced a system of archival cataloguing for their own materials (following
TNA’s precedent) then criticism would have been ill-founded; such an approach
would have been wise, and that’s effectively what I am recommending here.
Unfortunately, it is unlikely that this was Findmypast’s intention.
Entering a document reference into TNA’s Web site (e.g. “RG 9/2460”) displays an archival description of the associated item, but neither Ancestry nor Findmypast can do this because they have no reference that is separate from their search functions. Let’s take a moment to look at their URLs (at the time of writing) for “James Procter” in the aforementioned 1861 census page, found directly via piece, folio, and page.
The list of people on this page in Ancestry’s “1861 UK Census Collection” corresponds to an opaque URL of:
http://search.ancestry.co.uk/cgi-bin/sse.dll?db=uki1861&gss=sfs28_ms_db&new=1&rank=1&msT=1&_F0007B87=2460&_F0007B88=43&_F000597C=27&MSAV=1&uidh=u54
It is opaque because there is little evidence of the original
parameters that I specified. The transcribed extract for James is at the
following URL, and the image URL for the page is so long that I won’t bother presenting
it:
The list of people on that page in Findmypast’s “1861 England,
Wales & Scotland Census” corresponds to a URL of:
This is much more readable, and the search parameters are
clearly visible. However, the specific transcribed extract for James
corresponds to the following:
This has lost all connection with the search parameters, and it contains undocumented codes that may or may not be persistent.
So what am I saying here? The different censuses should be
accessible individually, thus respecting their provenance, and not dumped into
a single amorphous database; Ancestry makes a good job of this. The provider
should employ a set of persistent
identifiers to organise their information according to its natural
structure and provenance. Each page image and each transcribed extract (e.g.
details of a given census person) should have a unique and persistent URL to
access it directly via these identifiers, without having to go through the same
search again — and so without having to pray that the indexing hasn’t changed.
The censuses held by TNA are interesting because the pages
are already indexed by their original archival codes. These are termed natural keys since they
are part of the original data and are not fabricated purely for some database
indexing. When such keys are well-defined (and particularly when they’re
mandated) then the provider’s identifiers should acknowledge them; however, the
provider’s identifiers must be more detailed because (a) images and transcribed
extracts must be two distinct derivatives of the same original information, and
(b) the extracts may be at a lower level than the original source item level, or even below the individual
page level, as in the case of a census.
Or, in summary, all online genealogical data should:
- Expose persistent semantic URLs for each image and for each transcribed extract,
- Include documented persistent identifiers in those URLs related to the provenance and natural structure of the associated source, as held by the provider,
- Link the provider’s identifiers to corresponding archival descriptions, to the provenance of the information, and to source-of-the-source information.
[1] Michael Bugeja and
Daniela V. Dimitrova, Vanishing Act: The Erosion of Online Footnotes and
Implications for Scholarship in the Digital Age (Duluth, Minnesota: Litwin
Books, 2010).
[2] For the purposes of
the British Nationality Act 1981, the
“British Islands” include the United Kingdom (Great Britain and Northern
Ireland), the Channel Islands, and the Isle of Man.
[3] Chris Paton, “FindmyPast,
Scottish census sources, and Moby Dick”, The
British GENES Blog, posted 27 Jan 2015 (http://britishgenes.blogspot.ie/2015/01/findmypast-scottish-census-sources-and.html
: accessed 16 Dec 2016).