This is the final part of my trilogy on the philosophy
behind STEMMA V4.0. Part
I covered its application to both arboreal (tree) genealogy and event-based
genealogy, while Part
II covered narrative genealogy; I now want to expand on its support for source-based genealogy.
Source-based genealogy is both a research orientation and an
organising principle where the source is the primary focus. The majority of
software, and especially Web sites, are focused on conclusions; users are asked
to provide names, dates, and locations without having to indicate where or how
their information originated. At best, they might be given the opportunity to
retrospectively add some citation or electronic bookmark. When starting with
the source, though, all the relevant resources (images, documents, artefacts)
can be organised according to the source provenance and structure, a citation
can be created as soon as information is acquired, and the information can be
assimilated before you decide how and where to use it.
People like Louis Kessler have advocated source-based
genealogy for several years, and the term itself has displaced the more
naïve notion of evidence-based genealogy.
Since evidence is only information that we believe supports or refutes some
claim then a focus on that alone would ignore the source of the information,
and any context or other information therein. For instance, imagine that you
have used certain information from a source to substantiate a particular claim.
What happens if you later feel that the same source might help with a different
claim, made elsewhere? Do you have to assimilate the contents all over again in
order to decide whether that’s true or not? How would you become aware of its
possible relevance?
Let’s think how source-based genealogy might work,
conceptually, and especially the assimilation
phase. Anyone who remembers studying text books in preparation for an
examination may also recall annotating pieces of text: underlining phrases or
circling sections that we believed were going to be important, and which we
wanted to ensure that we fully grasped. What we were doing is reinforcing our
mental model, or mind-map, and creating structure and order from the text.
I’m sure you’ve all seen detective films, or TV series,
where someone solves a complex puzzle using notes and images on a pin-board
with string connecting the pieces together. This technique really does exist,
and it’s called link
analysis. It’s a type of graphic organiser used
to evaluate relationships (or connections) between nodes of various types, and
it has been used for investigation of criminal activity (fraud detection,
counterterrorism, and intelligence), computer security analysis, search engine
optimization, market research, and medical diagnosis. Although most online
sources present it in terms of software implementations, it is much older,
possibly going back before WWII.[1]
Figure 1 - Conceptual link diagram.
This rather imaginative depiction of the concept illustrates
some of the advantages and benefits quite nicely. The information of interest
in each source is somehow marked-out, and used to build any of: concepts, prototype
subjects[2], conjectures,
and logic steps — whatever the researcher wants from them. Because of the
freedom in choosing those pieces, this method is as useful to my
non-goal-directed Source
Mining as it is to goal-directed research. Essentially, one approach is
aimed more at collecting material to paint a history or biography of subjects,
whereas the other is focused on solving a specific problem or proving
one-or-more claims.
When those pieces are connected all the way up to the
conclusion entities in your data, then it also provides a trail by which some
user could drill-down[3] in
order to see how a conclusion was arrived at, what information was used as
evidence, and where that information originally came from.
When a marked item is a person reference then it is
effectively the same as what’s often termed a persona,
and the ability to connect personae from different sources provides a way of
supporting multi-tier personae.
Although I have previously been critical of the use of personae because of the
separation of the person reference from its original source context —
including place, date, and relationship context that might be instrumental in
establishing the identity behind the reference — this approach retains a
connection to the source information, and even allows it to follow the
subsequent use of a persona. This is particularly important because it appears
to be an inclusive handling of certain contrasting approaches advocated by
other data researchers: Tom Wetmore has long believed in personae, and Louis
Kessler in source-based genealogy, but those approaches have sometimes appeared
to be contradictory and have led to strong differences of opinion.
Part I of this series introduced STEMMA’s Source
entity as joining together citations and resources (such as transcriptions,
images, documents, and artefacts) for a given source of information, but it
encompasses much more than this. The semantic mark-up, described in Part II,
allows arbitrary items of information to be labelled in a transcription. This
includes subject references (person, place, animal, and group), event
references, date references, and any arbitrary word or phrase. Those labelled
items, called profiles, can be linked
together using simple building-blocks in order to add structure, interpretation,
or deduction to them. STEMMA doesn’t mandate the use of a visual link chart, or link diagram, since that would be something for software products
to implement, but it does include the essential means of representation.
The Source entity normally specifies the place and date
associated with the information, but these can also be linked to profiles if
they have to be associated with some interpretation, or even some logic. For
instance, when information relates to a place whose identity cannot be resolved
beyond doubt from the mere place reference.
When a source is comprised of discrete or disjointed parts —
such as a book’s pages, or a multi-page census household — or it contains
anterior (from a previous time)
references — such as a diary, chronological narrative, or recollections during
a story — then smaller sets of linked profiles can be grouped within the Source
entity using SourceLet elements, and these may have their own place and date
context. Each of those discrete parts may have their own separate
transcriptions and specific citations, although they’re related to the same parent
source by containment — a subject for
a future presentation. The network of linked profiles can bring together information
and references from these different SourceLets for analysis.
The Source entity is a good tool for assimilating the
information from a given source in a general and re-usable way. However, that
information may need correlating with similar information from other sources,
and this process may need to be repeated for different problems and with
different goals. STEMMA accomplishes this with a related Matrix
entity[4]
that carries those networks of linked profiles outside of their source context
and allows them to be worked on together.
Figure 2 - Mechanics of a link diagram.
Notice, as usual, that the building of these networks, and
the association of them with corresponding conclusion entities, is independent
of the relationship those entities have with their respective hierarchies and
events (Part I), and narrative (Part II). In other words, the four main approaches
to genealogy that I identified (arboreal, event-based, narrative, and
source-based) can be inclusive of each other.
Part I introduced the concept of Properties — items of
extracted and summarised information — that could be associated directly with
subject entities. The Source entity, which is part of the informational
sub-model rather than the conclusional sub-model, can also achieve this but
with additional flexibility. Because it is working directly with source
information, and not obliged to make any final conclusions, it means that
references to incidental
people, or otherwise unidentified subjects, can still be assimilated but
left in the Source entity for possible future use. The power of this should be
obvious where, say, a referenced person later turns out to be a relative or
in-law. Another difference is the vocabulary used to describe data and
relationships. The Property mechanism uses a normalised computer vocabulary so
that information can be consistently categorised as name, occupation,
residence, etc., and relationships can be categorised precisely as things like
spouse, mother, son, and so on. In the Source entity, though, what you record
and what you call it are free choices; if you encounter a relationship provided
as grandchild, nephew/niece, or cousin, where the interpretation may not be obvious,
then you can keep it as-written and work on it. For the masochists, a
comparison of these two mechanisms being applied to the same source may be
found at: Census
Roles.
It might be said that ignorance of prior work is bad during
any type of research and development, but my software history is full of such
cases where it has yielded a route to genuine innovation.[5]
When I finally decided to look at whether other data models had addressed this
approach, I was surprised to find that the old GenTech project from 1998
had documented a similar approach. GenTech and STEMMA had both try to build a
network of extracted information and evidence from “source fragments” — and
actually used this same term — but the similarities applied mainly to the
intention rather than to the implementation.
The GenTech
data model V1.1 is hard to read because it has no real examples, it
presupposes a concrete database
implementation — which I’m not alone in pointing out to be inappropriate in a
data model specification — and it talks exclusively about evidence, and analysing
evidence, rather than information.
The latter point is technically incorrect when assimilating data from source
fragments since the identification of evidence, or the points at which
information can be considered evidence, is dependent upon the researcher and
the process being applied rather than some black-and-white innate distinction.
GenTech’s ASSERTION statement is the core building-block for
its network. This is simply a 5-tuple[6]
comprising {subject-1 type/id, subject-2 type/id, value} that relates two
“subjects”. Those subjects are limited to its: PERSONA, EVENT, GROUP, and
CHARACTERISTIC entities — concepts which differ from STEMMA’s use of the same
terms — and there are some seemingly arbitrary rules for which can be specified
together. This restricted vocabulary means that it does not clearly indicate
how its CHARACTERISTICs are associated with a particular time-and-place context
(I couldn’t even work out how); it has no orthogonal treatment of other
historical subjects (STEMMA
terminology), such as place, group, or animal; and it cannot handle source
fragments with arbitrary words and phrases. By contrast STEMMA’s profiles can deal with source fragments
containing references to persons, places, groups, animals, events, dates, or
arbitrary pieces of text. It’s <Link> element is the low-level
building-block that connects these together, and to other profiles, but with
much more freedom. For instance, the equivalent of a multi-tier persona is
achieved by simply connecting two prototype-person profiles. GenTech uses its
GROUP entity to achieve this, and effectively overloads that entity for
grouping PERSONA and CHARACTERISTICs rather than using it only to model
real-world groups.
Some other philosophical differences include the fact that
STEMMA profiles represent snapshots of information and knowledge at a
particular point in the assimilation process (or correlation process, in the
case of Matrix); the actual information effectively flows between those
profiles. This is hard to describe in detail, and I may save it for a later
post. Another difference is that the profiles allow steps of logic and
reasoning to be represented in natural language; the connections are not just a
bunch of data links or database record IDs. That text would be essential if
some user wanted to drill-down to understand where a claim or figure originated,
and STEMMA allows multi-lingual variants to be written in parallel. Reading GenTech’s
section 1.4.2 suggests (to me) that its ASSERTION may have more in common with
STEMMA’s Property mechanism that with its Source and Matrix entities.
An interesting corollary is that conclusions are easily represented
in software data models, and they will usually employ precise
taxonomies/ontologies to characterise data (such as a date-of-birth, or a
biological father), or equivalent structures (such as a tree). In effect, these
conclusions are designed to be read by software in order to populate a database
or to graphically depict, say, biological lineage. Source information, on the
other hand, cannot be categorised to that extent — it was originally intended
to be humanly-readable, it must assimilated and correlated by a human, and all
analysis must be understood later by other humans.
There have been a number of attempts to represent the
logical analysis of source information using wholly computerised elements (see FHISO papers received
and Research
Process, Evidence & GPS) but these are far removed from employing real
text. As a result, they lose that possibility of drilling-down from a
conclusion to get back to a written human analysis, and to the underlying
information fragments in the sources. While allowing analytic notes to be added
directly to data items might be one simplistic provision, connecting notes and
concepts together to build structure must
have written human explanation, not “logic gates” and other such notions. One
reason for these overtly computerised approaches could be that software
designers feel an onus on them to support “proof” in the mathematical sense
rather than in the genealogical sense; a result of possibly misunderstanding
the terminology (see Proof
of the Pudding).
So what have I accomplished in this trilogy? I have given
insights into a published data model for micro-history that has orthogonal
treatment of subjects and inclusive handling of hierarchies, events, narrative,
and sources. Did I expect more? Well, I did hope for more, when I first started
the project, but it was not to be. There are people in the industry that I have
failed to engage, for whatever reason, and that means that the model will
probably finish as another curiosity on the internet shelf, along with
efforts such as GenTech and DeadEnds. Complacency and blind acceptance have left
genealogy in a deep rut. In the distant future — when technophobe
is simply a word in the dictionary — people of all ages will look back at this
decade and laugh at what our industry achieved with its software. When
paper-based genealogy (using real records) has probably gone down the same
chute as paper-based libraries, and we're left with software
genealogy and online records, then we'll wish that we had built bigger bridges
between those worlds. If our attitudes and perceptions don’t rise above the
horizon then we'll never see the setting sun until it’s too late, and we might
as well say: RIP Genealogy!
[1] Nancy Roberts & Sean F. Everton, "Strategies for
Combating Dark Networks", Journal of
Social Structure (JoSS), volume 12 (https://www.cmu.edu/joss/content/articles/volume12/RobertsEverton.pdf
: accessed 24 Oct 2015), under
“Introduction”; publication date uncertain but its References section suggests
2011.
[4] "An environment or material in which something develops; a
surrounding medium or structure", from Oxford
Dictionaries Online (http://www.oxforddictionaries.com/definition/american_english/matrix
: accessed 28 Oct 2015), s.v. “matrix”, alternative 1.
[5] Modern software
development is less about ground-up development and more about creating
solutions by bolting together off-the-shelf (e.g. open-source or proprietary) components.
Both have merit but the pendulum is currently in the other quadrant.
No comments:
Post a Comment