Thursday 15 November 2018

Organising Digital Resources



I want to take you on a brief tour of what it means to index your digital resources, and how this is a better method of organising them than creating physical connections. Although this articles is primarily about online genealogical resources, it equally applies to local ones on your personal computer.

A surprising number of Web sites still try to organise resources using a physical hierarchy between their pages. For instance, one that had pages related to places might be organised according to the associated geographical hierarchy.


Figure 1 – Naive implementation of hierarchical organisation.

As explained at Impermanent Links, this is a bad method for multiple reasons including the fact that it ties the URLs to one particular layout, and that layout is not particularly useful from the perspective of the Web server (e.g. for maintenance).

An earlier article, Hierarchical Sources, explained that organising physical resources by their provenance and then indexing them according to how you want to see or access them is not only preferential but would be the archival approach. But what do I mean by "indexing"?

In order to explain further, I need to review some terminology because people from different backgrounds may use the same terms in dissimilar ways. People who are familiar with academic articles and journal submissions will understand the difference between keywords and index-terms: keywords are specific terms identified by the author — usually in the abstract rather than in the body — and which would be found through a full-text search; index-terms are categories or topics used externally to aid document retrieval. Index-terms are not usually chosen by the author and are part of a controlled vocabulary, meaning that there are no problems with synonyms, variant spellings, or name clashes. However, in the field of database indexes, the term key is a synonym of index-term, and so terms such as keyword and keyword-search may then be ambiguous.

So, to summarise things so far, it is better to organise resources according their provenance, or their innate properties (as opposed to content), and to separately index them according to their subjects or categories. Grouping resources by provenance makes it easier to describe them (e.g. where and when they were obtained), to move/copy them, to supplement them (with related or new resources), or to support versioning. Any hierarchical organisation can then be done entirely through external indexing.

But aren't index-terms simply tags with no implied hierarchy? How would the place example (above) be handled? Index-terms are not tags — that's a description that better-fits the concept of keywords. Controlled Index-term vocabularies are very often defined to be hierarchical, and in the place example "Nottinghamshire" would be a term subordinate to "England". Whether the category name itself, e.g. "Place", is considered a top-level of the hierarchy is a design choice.

The definition of such a vocabulary would not only indicate which are subordinate to which, but may have associated meta-data that described each term and enumerated any alternative spellings or historical variations. Such meta-data would therefore assist in the selection of an appropriate index-term, after which resources matching that term would be unambiguous. Note, too, that if the term names had variations in other languages (e.g. occupations) then the controlled vocabulary is untouched, and those variations are enumerated in the same meta-data; whether the user selected "butcher" or "boucher", the retrieval of the resource would still use the index-term "Butcher".


Figure 2 – Organisation by two hierarchical indexes.

This diagram illustrates that a major advantage of this approach is that you can support multiple independent hierarchies. Organising resources according to both place and surname would not be possible using a single physical page hierarchy, and would be a maintenance nightmare if you tried to implement multiple physical hierarchies. A resource indexed by both the place terms "Epperstone" and "Screveton" could also be selected through the use of the common parent "Nottinghamshire", or even "England". Also, a resource relevant to both the surname Lincoln and the English town of Lincoln could be unambiguously indexed according to both with no confusion at all.

As many hierarchical indexes could be used as necessary, and new ones could be added without touching the underlying resources.


Figure 3 – Organising by multiple hierarchical indexes.

As well as being separately hierarchical, these indexes are inclusive. That means resources can be selected based on whether they are associated with terms in different dimensions. For instance, references to resources relevant to the Nottinghamshire village of Coddington (expressed here using the shorthand path "England.Nottinghamshire.Coddington"), "Surname.Astling", and "Occupation.Tailor". Not only that, resources can be selected using criteria involving Boolean operators, such as "England.Nottinghamshire" AND NOT "England.Nottinghamshire.Epperstone".

Relational databases can easily handle this style of indexing, including multiple indexes and Boolean queries. Each match would simply yield a physical identifier for the associated resource, such as a document or page name.