I want to take you on a brief tour of what it means to index
your digital resources, and how this is a better method of organising them than
creating physical connections. Although this articles is primarily about online
genealogical resources, it equally applies to local ones on your personal
computer.
A surprising number of Web sites still try to organise resources
using a physical hierarchy between their pages. For instance, one that had
pages related to places might be organised according to the associated
geographical hierarchy.
Figure 1 – Naive implementation of hierarchical
organisation.
As explained at Impermanent
Links, this is a bad method for multiple reasons including the fact that it
ties the URLs to one particular layout, and that layout is not particularly
useful from the perspective of the Web server (e.g. for maintenance).
An earlier article, Hierarchical
Sources, explained that organising physical resources by their provenance
and then indexing them according to how you want to see or access them is not
only preferential but would be the archival approach. But what do I mean by
"indexing"?
In order to explain further, I need to review some
terminology because people from different backgrounds may use the same terms in
dissimilar ways. People who are familiar with academic articles and journal
submissions will understand the difference between keywords and index-terms: keywords are
specific terms identified by the author — usually in the abstract rather than in
the body — and which would be found through a full-text search; index-terms are
categories or topics used externally to aid document retrieval. Index-terms are
not usually chosen by the author and are part of a controlled
vocabulary, meaning that there are no problems with synonyms, variant
spellings, or name clashes. However, in the field of database indexes, the term
key is a synonym of index-term, and so terms such as keyword and keyword-search may then be ambiguous.
So, to summarise things so far, it is better to organise
resources according their provenance, or their innate properties (as opposed to
content), and to separately index them according to their subjects or categories.
Grouping resources by provenance makes it easier to describe them (e.g. where
and when they were obtained), to move/copy them, to supplement them (with
related or new resources), or to support versioning. Any hierarchical
organisation can then be done entirely through external indexing.
But aren't index-terms simply tags with no implied hierarchy?
How would the place example (above) be handled? Index-terms are not tags —
that's a description that better-fits the concept of keywords. Controlled Index-term
vocabularies are very often defined to be hierarchical, and in the place
example "Nottinghamshire" would be a term subordinate to "England".
Whether the category name itself, e.g. "Place", is considered a
top-level of the hierarchy is a design choice.
The definition of such a vocabulary would not only indicate
which are subordinate to which, but may have associated meta-data that described
each term and enumerated any alternative spellings or historical variations.
Such meta-data would therefore assist in the selection of an appropriate
index-term, after which resources matching that term would be unambiguous.
Note, too, that if the term names had variations in other languages (e.g.
occupations) then the controlled vocabulary is untouched, and those variations
are enumerated in the same meta-data; whether the user selected
"butcher" or "boucher", the retrieval of the resource would
still use the index-term "Butcher".
Figure 2 – Organisation by two hierarchical indexes.
This diagram illustrates that a major advantage of this
approach is that you can support multiple independent hierarchies. Organising
resources according to both place and surname would not be possible using a
single physical page hierarchy, and would be a maintenance nightmare if you
tried to implement multiple physical hierarchies. A resource indexed by both
the place terms "Epperstone" and "Screveton" could also be
selected through the use of the common parent "Nottinghamshire", or
even "England". Also, a resource relevant to both the surname Lincoln
and the English town of Lincoln could be unambiguously indexed according to
both with no confusion at all.
As many hierarchical indexes could be used as necessary, and
new ones could be added without touching the underlying resources.
Figure 3 – Organising by multiple hierarchical indexes.
As well as being separately hierarchical, these indexes are
inclusive. That means resources can be selected based on whether they are associated
with terms in different dimensions. For instance, references to resources
relevant to the Nottinghamshire village of Coddington (expressed here using the
shorthand path "England.Nottinghamshire.Coddington"), "Surname.Astling",
and "Occupation.Tailor". Not only that, resources can be selected
using criteria involving Boolean operators, such as "England.Nottinghamshire"
AND NOT "England.Nottinghamshire.Epperstone".
Relational databases can easily handle this style of
indexing, including multiple indexes and Boolean queries. Each match would
simply yield a physical identifier for the associated resource, such as a
document or page name.
No comments:
Post a Comment