It’s time to look at how we work with our sources, and the
impact that this has on citations. I expect many people to say that we all work
differently, but do we? If we fall into a small number of distinct cases then a
computer representation of our research, as opposed to merely our conclusions,
is an achievable goal.
Whether we like it or not, and irrespective of how we
conduct our research, there are different scopes within genealogy. Some
researchers are content with establishing their lineage or pedigree; some would
like to look at the history of their family; some have a much more general
interest in history, including the backdrop to their family’s lives, and the
micro-history of places, groups, and other subjects.
It’s difficult to break these into hard categories as it is
really a greyscale dependent upon our personal goals and interests. However, it
is much easier to categorise the fruits of our research. The consensus seems to
be that this is either conclusions or
evidence-and-conclusions, with
citation of sources being the differentiating factor, but this is certainly an
oversimplification, and maybe even a misuse of those same terms.
Let’s start by analysing the most simplistic of scope: that
of a plain family tree or pedigree. This would consist of an assemblage of
so-called “facts”: the names, dates, and places corresponding to the family’s
vital events, and their lineage-based relationships. Without any sources then
these “facts” are merely unsubstantiated claims, but what would source
citations add to them? As I recently commented on one of James Tanner’s posts (Why
not a ranking and review system for online family tree databases), it
doesn’t necessarily make the data any more accurate. I have seen many online
trees that cite census entries, or vital events, and yet are entirely wrong;
often with clearly impossible implications. Conversely, the absence of source
citations may mean that the tree was posted as "cousin bait" rather
than being a complete genealogy. At best, we might deduce that the inclusion of
citations means (a) that the data wasn’t simply copied from another tree, and
(b) that some effort was made to include that source information. However, the
ease with which online trees can add electronic citations — more accurately
described as electronic bookmarks (see Citations
for Online Trees) — weakens that latter deduction. Also, those electronic
citations are usually constrained to online data hosted by the same provider,
and so would not be a general mechanism.
A deeper issue is that these family-tree citations — whether
electronic or in traditional reference-note form — only work because the underlying data is a mere assemblage of
“facts”. A simple list of sources might constitute a proof summary, but that assumes that the evidence from those
sources is direct and non-conflicting for each claim. Dealing with the more
complex cases is often referred to as Inferential
Genealogy, but the representation of these cases, such as my establishing the
parentage of Sarah Hunt in the latter part of My
Ancestor Changed Their Surname, cannot suffice with a plain citation, or
even with a group of plain citations. If there isn’t a direct relationship
between a “fact” and its source then you need a proof argument, and that may require a little more than just
narrative. Although you would write such a proof argument using narrative, it may
need to make correlated references to multiple subjects, such as people, and to
multiple sources of information. If the online tree allowed you to upload this
narrative as plain text then it would have to be associated with a specific
person, or family, and the essential structure and relevance to other subjects
would be lost as a result.
On the surface of it, this appears to be saying that it’s
not enough to say where your information came from, and that you must also indicate
how it relates to your claims. The issue is more subtle, though, for any
computer representation — including online trees — since the structure and
position of that proof argument is crucially important. Returning to the case
of Sarah Hunt’s parentage, an associated proof argument would be as relevant to
both of her parents as to herself, and it may include non-familial persons in
the general case, so where do you attach it? Also, simply identifying a Frances
Hunt as her mother because such-and-such wouldn’t be enough if there were
multiple people with that name in the same tree.
Ideally, we need something in between our conclusions and
the underlying sources; something that not only provides a link but a
structured pathway explaining how and why. This is even an issue for trees that
want to cite sources that do not quite agree on someone’s date of birth.
Obviously a person should not have multiple birth events recorded, but it must
be possible to trace any selected date, or date-range, back to some correlation
of those source differences. As explained in Hierarchical
Sources, when fellow researchers examine your data, it is the combination
of your proof argument and its sources that they should be interested in, rather
than just your subjective conclusions. They would want the option to form different
or modified conclusions, and so that full
story is a fundamental issue for any type of data sharing.
Moving away from the simplistic scope of a family tree, and along
that greyscale to the more historical pursuits, reveals something quite
profound: a change of emphasis and a different approach. Since such researchers
are no longer looking for discrete “facts” to add to their existing tree then
they’re much more interested in anything and everything from a given source.
This change of emphasis results in the source being the main focus — not the
tree — and so the citation of that source is formed much earlier, and is rarely
an afterthought. Having selected a relevant source, such as a diary, will, military
record, letters, or an old book, then there’s typically an assimilation process
of deconstruction and interpretation — which is what I refer to by the title of
this article: source mining.
If I were to describe this source-mining process as locating
the subject references[1] in
the source, identifying their documented properties (e.g. person’s age,
person’s occupation, place type), identifying their documented relationships
(including person-to-person, person-to-place, place-to-place, etc), and then
incorporating that information into your main historical data — a process that
would require correlation with other analysed sources, and resolution of
conflicts or other differences — then how many people would identify with that
approach? Or, putting it another way, how many people’s approach would be
substantially different?
My contention is that this general approach is more common
during historical research. The converse is where you are asking a specific
question, or making a specific claim, and might be described as a goal-directed
approach. In the aforementioned case of family trees then such goals would
include finding data for a vital event, identifying a marriage partner, or
finding offspring of a couple; the tree effectively sets those goals. Rather
than suggesting that a goal-directed approach doesn’t exist, I’m suggesting
that it depends on the scope of the research, and that it becomes less common
as the research scope broadens. I believe historical researchers may still have
specific interests, such as a particular person, family, village, or event, but
would be more reliant on the serendipity of each source than answering specific
questions.
If this contention is true then it has implications for the digital
representation of our research, and for the support of any standard of
research. With no animus intended, let me select GEDCOM-X as an example. This
data model made an admirable attempt at supporting a research process, but this
was primarily a goal-directed process. The page at GEDCOM
X and the Genealogical Research Process describes the first phase as:
“Question Asking: The research process begins with a focused research
question”. Irrespective of whether the model can find a way of representing the
source-mining approach, its initial design was based on a more restricted
concept of research.
Am I suggesting that the Genealogical
Proof Standard (GPS) is not applicable to source mining? Well, that is a
really good question. It is true that discussions of the GPS nearly always
present it in the context of goal-directed research, such as establishing the
truth or falsity of some claim, but its core principles would still apply to
the incorporation of the mined data into your other data. Assessment of
information based on the nature and provenance of its source, resolution of
conflicts, etc., would be just as relevant.
At the time of writing, I am working towards a
representation of mined data that I hope will provide a crucially missing piece
in the STEMMA data model. Back in STEMMA
V3.0, I introduced the concept of a References element in which the details
of the subject references were assembled into prototype subject entities describing
persons, places, groups, etc., and their relationships to each other, using the
information from a given source. Since this was dealing with the documented
properties and relationships, and it allowed me to analyse them and to
correlate them with other sources, then it was a good starting point for source
mining — except that it was in entirely the wrong place! It was part of the
Event entity, but that embraced conclusions and so was too high on the structured pathway mentioned above. Source
mining is largely a bottom-up approach, and the References element was designed
to facilitate the extraction and summarisation of source information in a
manner from which inferences and conclusions could be built. For instance a
documented name may not have been someone’s registered name, a documented age
may have been estimated, a relationship of “cousin” could have meant a lot of
different things, and to say that one place was “within” another wouldn’t
necessary identify either of them — some analysis and correlation is required. From
this perspective, those prototype entities were fairly similar to Personae,
except that I had generalised the concept to include other subject types, and I
had endeavoured to keep their shared context. A strong criticism I have of the
accepted persona concept is that it extracts names, and other details, from different
sources and treats them all equally, and in isolation from their original
context, including the background context of the source information, the nature
of the source itself, and the documented relationships between subjects (not
just persons) in the same source.
This work is still in its early stages, but it will involve
moving those Reference elements into some new entity that bridges between my
sources and my conclusions. Part of the problem is that information must be
associated with its relevant context, which generally equates to a where, when, and by-whom (or as
much of it as can be deduced), but any single source may have multiple contexts.
An obvious case would be a diary, but other cases may involve information
making reference to prior events. You then have the context of the body
information and the context of the reported or recollected events within that
body.
The following schematic diagram illustrates the current direction
of this work, noting that it’s still subject to revision. The source material would describe local
material (where you may have an original or an image copy) or remote material (such
as a cited work or document), or both when they’re related. The new entity
would identify the different contexts within the source and assemble prototype
subject entities, together with their documented properties and relationships.
I say documented because these would
not be conclusions at this point. Hence, if the relationship between a
particular person reference and a place reference was that of “present at” then
I wouldn’t assume that it was their residence. Those subject references would
be connected to appropriate points in any transcriptions of the material, thus
providing a connection all the way from a conclusion entity (or associated
datum), through the correlation between different sources, through the analysis
of the different contexts within a given source, and finally to an actual
textual reference.
A source sentence that I’d used as a simplistic example on a
FHISO mailing list at Entity
Relationships went as follows:
"In 1963, John Smith, of 10
Front St, brother of Simon Smith of Woodstown, married Ann Jones".
The obvious context is the year 1963. It has three person
references (John Smith, Simon Smith, and Ann Jones) and two place references
(10 Front St and Woodstown), none of which have been specifically identified at
this stage. There are relationships indicated between the person references
(brother-of, and married) but also relationships between the persons and places
(expressed here as “of”).
What this representation may not deal with is the concept of
multi-tier
personae, and the equivalent for non-person subject types. They would be
relevant to the ‘correlation’ item in the diagram but I’m yet to be convinced
that a formal entity representation is any better than a narrative explanation.
I’m suggesting that research is primarily sourced-based,
working upwards from the information we find in each source, and that the
converse case, where we have a specific question, is less frequent, except for
family trees where the acquisition of details for vital events constitutes such
questions. Consequences of this are that the source details (including its
assessment as well as its citation) are a secondary consideration, and also
that other information from the same source may be ignored. The sharing of
research information, as opposed to just conclusions, must take account of this
bottom-up approach, but there are currently no comprehensive mechanisms for
this level of sharing. GEDCOM only shares conclusions, and those attempts that have
been made to share research have become mired in the goal-directed notion.
Whilst there may be a lot of variety in how readers approach their own source-based
research, the digital representation would try to encapsulate the core elements
in a manner that could be built upon to support correlation with other sources
and the generation of conclusions.
[1] This is STEMMA
terminology for references to subjects
in a source, i.e. person references, place references, group references, etc.
The distinction between subject
references and subject entities (in
the digital representation) was proposed as a clean break from the more
contentious evidence person and conclusion person on the FHISO mailing
list at: The
Preferred Vocabulary. This
suggestion sank without a trace as it was attached to some unrelated post, but
it also needs rethinking as there are three distinct representations rather
than just two; prototypes subjects being the third.