Some interrelated topics to be discussed in this article:
What is a hierarchical source? How
does it relate to a hierarchical arrangement and to provenance? Is one
hierarchy enough? Does it affect our citations? Does it affect digital
organisation?
You may be thinking that these are unrelated topics but
let’s just begin with a question commonly posed in genealogical forums and
mailing lists: ‘how do I organise my media files?’. This usually translates
into ‘how do I name my files?’ or ‘how do I arrange the corresponding folder
hierarchy?’, and an example may be found at:
How
should I Organise My Digital Documents?.
Most people have at least tried to organise their digital
artefacts (i.e. document files and media files) by surname, and then realised
how impractical that is. For instance, should a marriage certificate be organised
by the groom’s or the bride’s surname? Whose surname do you use for a group
photograph that includes several generations of relatives and in-laws? What do
you do when you have inherited a photograph of ‘woman holding a baby’ but you
haven’t yet formed a positive identification?
Back in May 2013, Sarah Ashley presented her solution at
Organizing Your
Genealogical Documents. This was to use a source-based scheme where the
documents were each assigned sequential 4-digit identifiers while being
scanned. Different computer folders were then used to store the different
categories of material, such as vital events, newspaper articles, photographs,
census pages, etc., and the individual files were named using the corresponding
identifiers. In the following June, Louis Kessler presented an improved version
of this at
Source Based
Document Organization where he suggested a hierarchical organisation, and also
the use of the GEDCOM REFN tag (defined as: “A description or number used to
identify an item for filing, storage, or other reference purposes”) for linking
to the relevant documents.
These are good schemes but neither one suggests how those
flat or hierarchical identifiers should be allocated, nor whether (when
material was copied from some external source) there should be any relationship
to external cataloguing of the original or an online version. Also, in the case
of physical, rather than digital, artefacts then how should someone deal with a
collection donated-by or inherited-from another family member?
The answer to these issues can be found in
archival science.
Archivists have been doing this for years, and they have international
standards and a well-established vocabulary. In particular,
provenance is a core principle of
archival science, and it has two fundamental concepts:
respect des fonds —
basically grouping records according to their creator, or
fonds — and
original
order — basically maintaining the same record order as that of their
creator. In effect, I’m suggesting that we should manage our physical and
digital artefacts as a micro-archive.
In May 2013, Sue Adams produced an excellent description of
this approach on her
Family Folklore
blog at
Provenance
of a Personal Collection – Archival Accession, Arrangement and Description.
She explained that archival arrangement places all the items at positions in a
hierarchy reflecting their provenance, usage, and physical structure. An
archival description would then provide
information that served to identify, manage, locate and explain the archival
materials at each level.[1]
This definition is important because the resulting catalogue should identify
information about an item rather than
information within an item, and so not
express any analysis or conclusions — more on this later.
The
International Standard Archival Description (ISAD) defines a model for the
levels in a hierarchical arrangement, and this includes fonds, sub-fonds,
series, sub-series, files, and items[2];
items being the lowest level. A fonds (silent ‘d’ and ‘s’) is
a term for a grouping of documents that have been naturally accumulated by an
individual, family, or organisation as a result of their normal activities or
work. This replaces much of the usage of the older term collection which is now reserved for groupings that have been
assembled rather than created. The difference is effectively whether the
grouping relates to a common provenance rather than a common characteristic.
Following Sue’s lead, I’ll select an example using a reference
to a page in the 1901 census of England
[3],
as held by The National Archives of the UK (TNA): piece 3191, folio 125, page
19. Their guide to citing their documents and catalogues presents the following
general document-reference formats:
[4]
dept-code series / piece
dept-code series / piece / item
The census folio and page number are actually internal
identifiers for those items, and so are relevant to a citation for some piece
of information but not to the cataloguing of the associated source. The result might
be something like:
RG 13/3191, f.125, p.19
Note that their recommendations involve the folio
abbreviations f./ff. rather than the fo./fos. ones that some readers may be
familiar with.
Evidence Explained covers the use of citations for
multi-levelled archival arrangements, and remarks that “Your citation should
follow the practice of the archive whose material you are using”.[5]
However, it also warns that this may lead to conflicting styles when dealing
with international sources, such as whether elements should be sequenced
large-to-small or small-to-large.[6]
It has been suggested, more than once, that digital images should
contain elements of meta-data that detail their provenance, and that this would
greatly help when people have downloaded otherwise-untraceable images from
online sources. The ubiquitous copying and downloading of digital images makes
it nigh on impossible to know where they first came from, and who they should
be attributed to. There is nothing technically impossible about this. For
instance, the
XMP
meta-data design applies to several image and document formats, and without
hindering applications that read them. It uses
namespaces to make it
applicable to any number of distinct meta-data sets, and it even has an
international standard: ISO 16684-1:2012. Unfortunately, XMP is still a
registered trademark of Adobe Systems Inc., and this has probably limited its
take-up. One of the safer (i.e. more portable) alternatives is something called
sidecar files, where
the meta-data is held separately from the associated data by using a second file
with a related name.
There is a good case for using something like XMP in this image-copying
scenario, but it becomes less useful for your micro-archive because (a) it will
likely contain physical as well as digital artefacts, and (b) the meta-data
will be applicable to different units in the hierarchy, and not simply to the
lowest-level items. Where a specific arrangement of the artefacts has been
created then it is more common to use a meta-data database. However, STEMMA’s
approach is to use its own file format to create machine-readable archival
descriptions. Its files are plain-text, and its Resource entities may be used
to describe each of the levels of a hierarchical arrangement. By incorporating
its inheritance mechanism, this allows the description, provenance, access
control, and any amount of meta-data to be provided for each unit in a single text
file that can be loaded and referenced by other STEMMA files.
STEMMA has two main entity types relevant to this
discussion:
- Resource
— a representation of a digital or physical artefact, or a combination of
these such as when you have a scan of an original letter in your
possession, or a digital photograph of a set of medals.
- Citation
— despite its name, this is a generalised reference to sources or
information held elsewhere. It includes the location of information within
a source, as well as the location of a source itself, and even allows for
the representation of attribution.
Both of these entities share a mechanism of
parameterisation. This means that they can both define a number of named
parameters, each having its own specific data-type. These can be used in a
similar way to the REFN tag, mentioned above, but with significant advantages.
That GEDCOM tag allows only for a single amorphous code; a code that cannot be
decomposed or reverse-engineered. Having the elements of a hierarchical
reference in separate parameters allows them to be used in a more powerful
fashion. For instance, returning to Sue’s example, the initial levels of her
arrangement might be described using the following entities:
<Resource Name='rRWC' Abstract=’1’>
<Title>Raymond Walter Coulson (1922-1997) collection</Title>
<Params>
<Param Name='Lev1'>CWC</Param>
<Param Name='File'/>
<Param Name='Folder1'>collections/${Lev1}</Param>
</Params>
<URL>file:${Folder1}/${File}</URL>
<Text>
Papers, photographs, correspondence, memorabilia and probate documents of Raymond Walter Coulson of 322 Aston Hall
Road, Aston, Birmingham, who died intestate on 24 May 1997.
</Text>
</Resource>
<Resource Name='rRWC_Probate' Abstract=’1’>
<Title>Probate file</Title>
<BaseResourceLnk Name='rRWC’/>
<Params>
<Param Name='Lev2' Type=’Integer’>1</Param>
<Param Name='Folder2'>${Folder1}/${Lev2}</Param>
</Params>
<URL>file:${Folder2}/${File}</URL>
<Text>
Compiled by [my dad], administrator for the estate of Raymond Walter Coulson, between May 1997 and January 1998.
</Text>
</Resource>
These effectively construct a hierarchy of Resource entities
describing the various units in that arrangement. An individual file, such as a
marriage certificate, could then be specified through its file name and the
Resource entity representing that archival unit. Yes, the folder names could
have been hard-coded, and the Resource entities crafted independently of each
other, but using the inheritance mechanism introduced in
Genealogical
Inheritance makes it more flexible and maintainable. Each Resource inherits
an accumulated set of parameters from the higher levels.
The parameter mechanism is a general-purpose tool, and may
be used to add specific items of meta-data that you want to separate out of the
associated archival description. One of the developers independently writing
software around the STEMMA specification recently presented me with a related
question. He was transferring photographic slides to a digital organisation,
and wanted to know how to deal with dates written on each slide frame. Since
parameters can be defined freely then I pointed out that a Resource one could
be defined for this purpose with a specific data-type of ‘Date’. In a Citation
entity, the parameters may be used to define citation elements; those discrete
values that would be later formatted into a traditional reference-note
citation.
Since both Resource and Citation entities share this
parameterisation mechanism then it is also possible to pass parameters from one
to another. Imagine, for instance, that the citation for the aforementioned
census page had parameters for the piece, folio, and page. If you had a local
image copy of it then it could be located using the same parameter values,
either substituted into a file name or a folder hierarchy. They could even be
used to interrogate a Web site in order to summon the census image on demand
(see ‘rCensusImage’ example at
Resource).
[7]
We’ve mentioned the hierarchy of a source inherent in its
archival arrangement, but are there any other examples of a source hierarchy? Well,
the chain of data provenance when we cite a source — that is, the relationship
between records and the individuals or organisations that have created, maintained,
reproduced, transcribed, indexed, otherwise modified them — also constitutes a
hierarchy. When we cite a derivative source, such as an online edition, or some
database, then we usually cite the source of the source in a secondary fashion.
[8] Provenance also applies to specific
information as well as to a source or source data. A common example is when
we’re citing an author who is citing other works; ones they have consulted but
which we haven’t. We may feel that the provenance of the information is
important to our case, but we cannot directly cite what we haven’t consulted.
This scenario is covered in some detail by
Evidence
Explained[9], but
consider the case where an author hasn’t cited their source, but we believe we
have identified an earlier version of their claim or statement. This may be
very important to our case, especially if there are subtle differences, but a
simple comment in a reference note may be insufficient to encompass our
justification and reasoning.
An important point here is that these forms of hierarchy are
facets of the real world, and not some subjective notion that software might
decide to support or ignore. This issue was recently discussed on the
FHISO TSC-Public mailing list starting at
Filing
Sources. STEMMA’s Citation entity was endowed with two types of
hierarchical linkage: ParentCitationLnk, in order to model provenance (see
Cite
Seeing), and BaseCitationLnk, in order to model the structure of groupings
and the structure within a given source (see
Genealogical
Inheritance).
So, both local materials and consulted materials held
elsewhere, including any associated digital images of them, can be represented
using some combination of Resource and Citation entities. The core genealogical
data is where we would analyse those materials and form our conclusions, and that
will necessarily require links or citations to those materials. However,
materials should never be catalogued according to such conclusions since they
may change. If you’re cataloguing a photograph of ‘woman holding a baby’, or a
painting of ‘a cracked vase with daisies’, then it must be independent of
opinions or conclusions. Even their archival description must only record what
we know about the materials rather than something we’ve determined from their
contents. Our core genealogical data will also need to reference these
materials from multiple points, and in different ways — something that renders a
simple name-based arrangement redundant.
This article is placing great emphasis on both our sources
and the local artefacts in our own micro-archives, but why? Isn’t one
arrangement as good as another? Why do we need to be concerned with provenance,
or with the arrangement used by some archive? The answer to this would be
obvious to an archivist, or to an historian, but less so to most genealogists.
The problem is that the majority of genealogy — and especially where it involves
online family trees — is people-centric. The pursuit very often boils down to
that of searching for a person’s name, or the vital events of a named person,
and since the results will mostly come from online data — data that is
deliberately keyed on personal names — then it has some consequences:
- The source of the information
is an afterthought. Although some Web sites allow a researcher to tag data
with links to their relevant online content, that is merely an electronic
bookmark (in the form of a URL) and not a real citation.
- Even when a researcher
references a source, it is only in the context of a citation. The belief
that the data is the answer, as
opposed to the source contains a
clue, means that any reasoning for the making of a considered argument
is being short-circuited.
Contrast this with the way someone might approach historical
research, where individual sources are assimilated and relevant items analysed
and correlated with information from other sources. That style of research
begins with a source rather than with a name. The origin, nature, and quality
of the source are then very important factors during its analysis.
This same point was recently raised by Jan Murphy on the
FHISO TSC-Public mailing at
Format
for Raw Source Content, and I’ll leave you with her own words — words
designed to keep the software mindset focused on the real world:
I hate to keep arguing this point
over and over again, but we are looking at documents and other source
material. We are not looking at people. We are looking at sources,
most of which (but not all) contain names.
A lot of beginning researchers,
including many of the people in the Genealogy Do-Over group, struggle to learn
how to cite their sources, and why? Because if you work in a people-centric
system the sources are always an afterthought.
** Post updated on 19 Apr 2017 to align with the changes in
STEMMA V4.1 **
[2] “Model of the levels
of arrangement of a fonds”, ISAD(G), appendix A-1, p.36.
[3] Whereas these TNA census references apply to England & Wales, they do not apply to Scotland.
Scotland has its own system (see http://www.scotlandspeople.gov.uk/)
and this has caused issues for sites such as findmypast that try to provide a UK-wide search form. The Ancestry equivalent only solicits
criteria such as piece/folio/page when specifically searching, say, the census
of England, but findmypast currently
solicits them in all UK cases, whether relevant or not. See Chris Paton’s views
on this at FindmyPast
- Scottish censuses.
[5] Elizabeth Shown Mills, Evidence
Explained: Citing History Sources from Artifacts to Cyberspace, 2nd ed.
(Baltimore, Maryland: Genealogical Pub. Co., 2009), p.116–119.
[6] E. S. Mills, sec.3.3
“International Differences”.
[7]
The idea of a reliable, non-internal
URL for summoning the image of a particular census page is a
nice idea that could help when sharing data with friends and relatives, or between
researcher and clients, without the paranoia associated
with T&Cs or copyright.
Although the recipient would need a subscription to the site, the idea could be
adopted by other providers to create a sort of genealogical "Open
URL" variation. It would be quite easy for them to offer because they
already have form-fill functionality that achieves the same type of lookup. However, the idea is strangely ignored.
[8] E. S. Mills, p.180
under “Citing the Source of a Source”.
[9] E. S. Mills, sec.2.21
“Citing the Source of a Source”.
No comments:
Post a Comment