What is a source? When are sources independent and when are
they not? How do citations describe related sources? These questions may seem
to have obvious answers, but a detailed analysis of the relationships is
essential when working electronically with sources.
Traditional genealogical work may take these relationships
for granted, but software considerations are too often relegated to a mere reference-note
citation added at the end of some narrative, or an electronic bookmark added at
some point in a tree. By ‘working electronically with sources’, I mean source-based
genealogy using a computer, and so the organisation of entities related to
both sources and citations is then a fundamental consideration. The use of
computer software as a tool during research, rather than simply for the
maintenance of some database afterwards, is still not the norm, but it will be
— one day — and not before time.
Figure 1 - Anatomy of a Source.
Let’s begin with some basic knowledge to set the scene. I
will frequently refer to terms and concepts from the works of Elizabeth Shown
Mills, and also Thomas W. Jones, which I hope will be obvious enough that I
don’t have to cite every single occurrence.
Expressed as succinctly as I can, information is semantic
data (data with meaning), and a source is the someone, something, or somewhere
from which the information was obtained. Citations are statements that identify
such sources, and these are most commonly recognised as the sentences used in
footnotes/endnotes. In truth, a footnote/endnote may contain multiple citation
sentences, each referencing a distinct source, but we’ll ignore that here for
simplicity. For the curious, an example may be found at Cite
Seeing.
A citation has a number of purposes: intellectual honesty
(not claiming prior work as your own), allowing your sources to be independently
assessed by the reader, and allowing the strength of your information sources
to be assessed. To this end, there are a number of core principles to good
citations: identification, description, and evaluation. The identification of
the source is what existing software tends to focus on most, and it amounts to
naming it and citing its location. If we include the location of the specific
information within the source then these details are conveniently summarised as
the five Ws: who, what, when,
where-is, and where-in.
One of the most important mechanisms in a citation sentence
is layers:
the segments separated by semicolons that are used to describe the provenance
of the source information, or of the source itself, and to provide analytical
notes. When only one copy of a source exists, or when it is very rare with only
a few copies existing, then citing the repository is the correct thing to do —
this contrasts with widely published materials such as books and newspapers.
When a source identifies the source of its own information then it is usually
termed the source-of-the-source,
and the corresponding layer would indicate this with a preceding “citing” or
“the author cites”. However, provenance may have been determined by independent
means, or there may be a mixture as in this example:
“Literary Miscellanea: Sketch of
a Railway ‘Navvy’ ”, book extract, Bath
Chronicle and Weekly Gazette (23 June 1859): p.6; citing: [Samuel Smiles,] The Story of the Life of George Stephenson [,
Railway Engineer (London: John
Murray, 1859)]; abridged by the author from the original and larger work: The Life of George Stephenson (1857).
A point to note is that a layer isn’t just another source;
it could also be a repository, it could be provenance details, or it could be
analytical notes. This means that layers cannot be modelled by simply linking
software entities that describe sources.
In one of the papers submitted to FHISO[1], a
set of desirable properties were presented for citations. One of these (section
2.10) is described as Canonical, and
it suggested that citations should be one-to-one with sources. Expressed
symbolically for two sources: S1 and S2, and their
respective citations: C(S1) and C(S2), this would mean
that S1=S2 and C(S1)=C(S2) would be
equivalent statements. However, real-life citations can never be that precise.
A case where this immediately breaks down is when the same
citation is expressed in different languages for users from different locales.
In other words, even for two absolutely identical sources, with identical
provenance and evaluation, there could be several different citations required
for the same information. Going further, there may be a need to support
different citation styles (e.g. CMOS, MLA), or different variants of a
reference note for the first and subsequent usage. These facts mean that any
organisation of software entities must support a one-to-many relationship
between sources and citations.
Real sources are not all equivalent, even when the
associated information has a common origin. Sources may be categorised as
original, derivative, or authored works. The latter is effectively a hybrid of
original opinions and conclusions, but derivative sources of information. An original is when the material exists in its
first oral or recorded form, and a derivative source is one produced by the copying
an original or where the content has been manipulated, such as transcripts, abstracts,
extracts, translations, and databases. Image copies are a special sub-category of
derivatives that include digital scans and photographic facsimiles. Since they
should capture the information exactly as it was then they are often treated
the same as originals. However, they are still technically derivatives since
the contrast may be lacking, or a film may be scratched, or the resolution too
low; even the loss of colour in a monochrome image may have removed essential
information. This doesn’t necessarily mean that a copy is always poorer than an
original since you could have a damaged original and a copy that was made
before the damage occurred — each case has to be evaluated on its own terms.
One of the really difficult areas to deal with — and one
I’ve been slowly building up to with this summary of basic knowledge — is where
derivatives were formed by manipulation of the content. This is because there
are so many ways that they can be formed, and the associated chain of
provenance can be difficult to determine. This subject was recently covered by
Sue Adams in a series of blog-posts culminating in The
Original in Context. For my own example, I’m choosing the baptism of an
Amelia Kirk at Nottingham St Mary in 1809. The Nottinghamshire Archive has the
original parish registers, and also the bishops’ transcripts: those hand-copied
derivatives that were sent to the diocesan centre each month. The archive also
has image copies of the parish registers on microfiche. The Nottinghamshire
Family History Society (NottsFHS) has a searchable database of transcribed
details that may be purchased on CD. Findmypast has incorporated a copy of the
NottsFHS transcriptions into their own online databases. Ideally, the information
in these derivative forms should be identical to that in the original, but that
would be a rare event indeed.
Given that the information should (in principle) be the
same, and ideally will not differ too much, how best would copies of
alternative derivatives be organised so that they can be worked on
(electronically) together? The derivation path may not be a straight line — there
could be branches — and so you may have different copies that you want to
compare and contrast. Treating them as wholly independent would be both
wasteful and inaccurate. In other words, the required organisation must
acknowledge the common origin of the information while still allowing the
sources to have their own citations, their own evaluations, and their own
resources (digital images, paper-based images, textual transcriptions, etc.).
There are two main ways that source references can be
related to similar source references: by containment and by derivation. We’ve
just looked at the derivation case where the information has a common origin
but has been copied or manipulated along the way; containment is the case where
you’re looking at a part of a larger source unit. The most common example of this
may be when referring to specific pages or chapters in a book, but it can apply
to many sources, including separate households or schedules on a given census
page. One requirement, here, is that it must be possible for the software to
know that two source references are to parts of the same unit (or item, in
archival terms), or that one of them is, itself, a reference to that larger
unit. For example, that two source references are to pages in the same book, or
that one is to the book as a whole.
When working with information from different parts of a
source, each distinct where-in reference will have an associated context which
must, at least, specify the where and
when. For instance, if citing pages
from a biography then they may mention the subject when he was in a particular
city, and during a particular time frame. In a census page, information for two
different households would have been taken at the same time but they would have
distinct addresses. This ability to dissect a source, and to characterise
references according to their context, is essential if assimilating information
for later analysis.
I’ll first look at STEMMA’s
source-citation relationships. As of V4.0, the Source entity connects to
multiple Citation entities and/or multiple Resource entities (e.g. media
files). This structure has gradually evolved through trying to model my
existing source-based research.
Figure 2 - STEMMA Source entity.
Looking at each of the functional issues mentioned above:
Support for
derivatives. The Source entity brings together source information that has
a common origin so that it can be compared and contrasted, and generally
assimilated in one place. In other words, it is not representing one unique source,
but sources with a common information origin. The <Frame> context section
embraces citations for all of those corresponding sources.
Support for
containment. The SourceLet sections allow the dissection of a source into
parts with a related context, and the associated citations would refer to
specific where-in parts of the sources already identified outside of the
SourceLets. Each SourceLet can also provide where
and when contextual details for the associated information. The option to
have two tiers of SourceLets — one for related derivatives and one for those
specific where-in parts — was not taken for simplicity.
Support for citation
language, modes, and styles. The Citation entity supports sets of preformatted
citation text strings in alternative languages and citation modes (e.g. first
reference note and subsequent ones). There is currently no support for
alternative citation styles such as CMOS or MLA. These preformatted strings are
all optional but the same Citation entity also supports a mandatory set of
citation-elements, such as author, title, and publisher in the case of a book.
These may be tagged with semantic types from alternative taxonomies, such as
Dublin Core, and inclusively so if desired.
Support for analysis.
Terms such as quality, reliability, and credibility may be used to describe a source or some information
obtained from it. Analysis of the source information as a whole, including the
derivatives with the same origin, is all done within a single Source entity.
Although comments and observations on those individual derivatives can still be
made, they are not divorced from the context of that shared provenance.
Support for layers.
The Citation entity models the layers in a citation using its parent hierarchy
(see below). This is possible because its extreme flexibility allows it to
model any of the following: a simple citation, a repository, provenance
details, and even attribution. The fact that the layers are supported by the
Citation entity, and not the by Source entity, is dictated by the observation that
layers are not just other sources.
As of V4.0, the citation layers may be characterised
according to the terms in the following table. Scheduled for V4.1 is an
additional Reworked category.
Layer-type
|
Comments
|
AbstractOf
|
A brief summary or a précis of --
|
Citing
|
Information cited
by the source. Source-of-the-Source.
|
Comment
|
Analytical comments.
|
ConsultedAs
|
Consulted through derivative, usually online or in
database
|
ExtractOf
|
Extracted portion from --
|
ImagedAt
|
Consulted through general image copy
|
MediaCopy
|
Media conversion from --
|
Provenance
|
Other provenance information, differing from ‘Citing’.
|
Repository
|
Location of original source.
|
ReworkOf
|
Revised, abridged, or otherwise modified from --
|
TranscriptionOf
|
Transcribed details from --
|
TranslationOf
|
Translated details from --
|
In GEDCOM, there
are Repository, Source Repository Citation, Source Record, and Source Citation
records. These are designed to implement a relatively straightforward, but
limited, citation→source→repository model with a scope for supporting
bibliographic citations — not for source analysis or source-based genealogy.
GEDCOM-X has a
wider focus than GEDCOM, although the current
draft specification is known to be incomplete in this area at the time of
writing.
Figure 3 - GEDCOM-X SourceDescription entity.
Although the SourceDescription relates to a unique source,
there are sets of linkages to other sources related by derivation and by containment.
It’s probably too early to see how this would work in practice, but the fact
that ‘analysis documents’ are tied to individual SourceDescriptions would
suggest that some collective analysis of derivatives would be difficult to
organise. Also, the treatment of parts of a source (i.e. containment) by
distinct SourceDescriptions would seem to compound the issues of collective
analysis.
There is no specific support for layers; they cannot be
handled by those derivation links since, as I’ve already mentioned, layers are
not just other sources. Each SourceDescription may have multiple citations
supporting different languages but there is no explicit consideration of alternative
citation modes (e.g. first/subsequent), or styles (e.g. MLA). The
SourceCitation entity is not hierarchical, and there are currently no
citation-elements in either SourceCitation or SourceDescription.
The SourceReference appears to be provided solely to allow
the attribution information of a SourceDescription to be overridden.
While software developers may understand what I’ve written
here, I know there will be a significant number of other people thinking ‘I
don’t get it. My software already handles sources’. There’s probably a good
reason for this and I want to mention it in rounding-off this article.
A scenario that those same people might be able to associate
more-closely with would be when adding a marriage date to their tree. They
might have found a record with the date and place recorded so they add it to
their tree and then include a citation for the record — or an electronic
bookmark if found in the online databases of the tree’s host. Where does my
“organisation of entities related to both sources and citations” fit into that?
Well, in that scenario, it has little relevance. As my second paragraph suggested,
most software only thinks about sources and citations as they appear in this
limited type of scenario. But if we’re doing source-based
genealogy then that organisation is a fundamental requirement.
Tradition genealogists — and good researchers everywhere — would
look at more than the one date and place mentioned in such a source; they would
look at other information, and the associated context of that information. If
anything caught their eye as potentially significant, or requiring further research,
then they would note it. That may happen in their heads (yikes!), or be written
with pencil & paper, or captured on their computer using a text editor
(e.g. Notepad) or some rich-media editor (e.g. Evernote). That note-taking is
essentially what I mean by “assimilation” of the source information, and my
approach to source-based genealogy is just including the note-taking and the initial
analysis into my main genealogy program. There’s no onus on me to draw
conclusions, or to attach any of it to a tree; it is a working area where
sources can be dissected, and information partially digested so that I can find
it and use it later.
I personally find this very natural since my career has
hitherto involved developing and using cutting-edge software , but I also
appreciate that the majority would not feel as comfortable relying on software
to this extent — especially when most of it appears to be form-fill data entry
of conclusions, and any basic methodology or representation of real-life
scenarios are denied.
** Post updated on 19 Apr 2017 to align with the changes in
STEMMA V4.1 **
[1] Luther Tychonievich, "Desirable Citation Properties", FHISO Call For Papers, CFPS 112 (http://fhiso.org/files/cfp/cfps112.pdf
: accessed 15 Dec 2015); this paper was not listed on the main 'papers
received' page (http://tech.fhiso.org/cfps/papers)
but was referenced from other papers.