Parallax View ®: 2015

Saturday, 19 December 2015

Organising Photographs

The question of how to organise your photographs in your file store, or even your general genealogical files, is a frequent one. Everyone has their own preferred scheme, but I want to try and add a different perspective on this subject. The suggestions I will make will use Windows as an example, but similar techniques will be possible elsewhere.

The ultimate brick wall that everyone will stub their toes on is that there are multiple ways of categorising their photographs, and a simple filename cannot adequately embrace them all. For instance, naming them by person (but which person if there’s a group), or by surname (again, which surname in, say, a wedding group), by event, by place, or by date. Every attempt to achieve this using just the filename will be a compromise of some sort.

Figure 1 - There are different ways of grouping the same pictures.

I have written, before, that my own choice is to group the files by their provenance, and then rely on a software application to present them in other ways, and to associate them with descriptions, stories, timelines, and so on: Hierarchical Sources. There are some issues with using a specialised application, but we will come to that in a moment.

Keywords

Another option is to use keywords, such that each picture can have an arbitrary number of keywords relating to personal names, surnames, places, events, dates, or whatever you want. This is good for finding related pictures in a large set, but is it ideal for organising them in a browsable way? Although products such as Adobe Photoshop Lightroom organise pictures by keyword, this is just another type of specialised application, and so has the same issues that I hinted at above; the keywords really need to be a core feature of the operating system.

Windows has a feature called Tags which are effectively user-defined keywords that can be added to your files, but prior to Windows 7 only Microsoft office documents supported them, and there weren’t many tools to make use of them. Before I talk about them, let me first present a bargain-basement equivalent that would work under, say, Windows XP (yes, there are still people who use XP). For illustration, let’s assume we have a folder with the following three images files in:

Ann_Jones_1985.jpg

Jane_Smith_1980.jpg

Joan_James_1983.jpg

By dividing-up a filename using a character such as underscore, you’re effectively providing sets of keywords. Their order is not really important as files can be found no matter where the relevant keywords appear. Words can be grouped together to create compound keywords by using a different character, such as Joan-Smith_John-James_Marriage_1970.jpg.

The old Windows XP Search box could search on multiple filename parts, separated by semicolons, and this would achieve a search-by-keyword.

Figure 2 - Searching by keyword under Windows XP.

In this example, the search is looking for all files with either “Jane” or “James” in their name. The actual ordering of the keywords in a filename might be chosen so that the default sorting achieves some vaguely useful grouping.

So how did this change under Windows 7? For a start, its new-style Search box allows Boolean operators so that you can now type “Jane OR James” (equivalent to the XP example, above) or “Jane AND James” (for which none of the three example files would have matched).

Another change in Windows 7 was that the support for file Tags was greatly increased. The file Properties dialog, on its Details tab, will show a Tags field if the current file-type supports them. Clicking to the right of the Tags label allows you to enter multiple keywords, separated by semicolons, and these are then hidden away inside the file’s meta-data.

Figure 3 - Entering keyword Tags in Windows 7.

In the Search box, where we had previously typed separate filename parts, we can now use terms such as “tag:Jane”, and it will then search for files with those Tags rather than ones with particular words in their filenames.

Figure 4 - Searching by Tags in Windows 7.

Again, we can use the Boolean operators to say something like “tag:Jane OR tag:James”. OK, so what are the advantages of this scheme over the bargain-basement one using just the filename? Both schemes allow Boolean operators, and both operate case-blind. However, those Tags are discrete items of meta-data and so leave you to name the file any way you want. Also, the Tag names are matched as complete words and so there’s no risk of an accidental match, such as “Ann” matching “Anne” and “Anna”, etc.

Windows 7 also allows you to sort your files by their Tags — look on the View menu, under Sort by — but keywords are still primarily a way of finding content rather than presenting it. If the advantages are so great for organising pictures, or any files, by their provenance, and for relying on a specialised application to present them in a much richer fashion — with the added context of stories, timelines, and so on — then why don’t we all do it that way?

This subject came up in a Google Hangout in the DearMYRTLE's Genealogy Community, hosted by Pat Richley-Erickson (aka DearMYRTLE), on 19 Jan 2015. Twice during that Hangout — once at 15:00 into the recording, and then later at 35:40 — Pat made the astute observation that relatives (and especially the younger ones) will just want to browse some “cool old photographs” and not mess around with a specialised application. It’s sad but true that if there isn’t a description directly visible when they open the file then they won’t find the details. Remember that in the traditional family albums there would usually have been something written under each picture.

The technology is there to put a description inside each picture — in that same meta-data area where the Tags live — and this could even include an optional “wire frame” diagram that could be overlaid to identify individuals in the picture. That could have relevant links for each of those people to the data held in your specialised genealogy application. You would probably have to write your own picture-viewer application in order to see all that content, but you would then be back to the same problem again.

Proxies

When you click on a file, your operating system checks what application is registered for opening a file of that type. Although you may use the same application for all your image file-types, it is possible to make the association type-specific; for instance, using Microsoft Paint (mspaint.exe) for *.bmp files and Microsoft Office Picture Viewer (ois.exe) for *.jpg files. However, each association is fixed for a given file-type.

It is possible, though, to go via an intermediary application to make an intelligent choice for you. This would mean the vendor of your genealogy application producing a very tiny proxy application that looks at the image file you’re trying to open, and then determines whether to load it in the default image viewer (for that file-type) or in their own genealogy application.

Figure 5 - Using a proxy viewer.

I have written a sample C program that demonstrates this principle using alternative viewers for plain text files: proxy.c. This looks to see if the image filename ends in some genealogical identifier of the form: “ID-identifier” (e.g. Joan_James_1983_ID-1AF92G.jpg). It would be just as feasible to involve the folder path in its decision-making, or even looking inside at the file’s meta-data for an identifier there, but this scheme was simpler.

If this sample proxy finds such an identifier then it launches a specialised viewer with the arguments: <filename> <identifier>, and in all other cases it launches a default viewer with the single argument: <filename>. When configured correctly then those young relatives could happily click on images anywhere on the computer, and they would see it in the appropriate viewer depending on whether they’re part of your genealogy collection or not.

Yes, this would need some help from your genealogy application to ensure that your files have the correct identifier in their name, and hiding that information inside each file’s meta-data would be cleaner. What about the configuration, though? The proxy has to take over a number of file associations (one for each of the image types you’re interested in), and remember what their default viewers were so that it can invoke them when necessary. Well, that turns out to be quite easy: during installation, it would simply displace the existing default viewer for each file-type, and pass that to the proxy as either another argument or via a command-line option. This also serves as a way of saving the file-path of each default viewer. A later uninstall would then have those previous file-paths available so that it could put things back exactly as they were.

This approach can also be applied to non-image file-types, such as Word documents. This could make the difference between a machine that happens to hold your genealogy data, and a “genealogy machine”. Who knows, maybe someone will do this now.

Thursday, 10 December 2015

Our Days of Future Passed — Part III

This is the final part of my trilogy on the philosophy behind STEMMA V4.0. Part I covered its application to both arboreal (tree) genealogy and event-based genealogy, while Part II covered narrative genealogy; I now want to expand on its support for source-based genealogy.

Source-based genealogy is both a research orientation and an organising principle where the source is the primary focus. The majority of software, and especially Web sites, are focused on conclusions; users are asked to provide names, dates, and locations without having to indicate where or how their information originated. At best, they might be given the opportunity to retrospectively add some citation or electronic bookmark. When starting with the source, though, all the relevant resources (images, documents, artefacts) can be organised according to the source provenance and structure, a citation can be created as soon as information is acquired, and the information can be assimilated before you decide how and where to use it.

People like Louis Kessler have advocated source-based genealogy for several years, and the term itself has displaced the more naïve notion of evidence-based genealogy. Since evidence is only information that we believe supports or refutes some claim then a focus on that alone would ignore the source of the information, and any context or other information therein. For instance, imagine that you have used certain information from a source to substantiate a particular claim. What happens if you later feel that the same source might help with a different claim, made elsewhere? Do you have to assimilate the contents all over again in order to decide whether that’s true or not? How would you become aware of its possible relevance?

Link Analysis

Let’s think how source-based genealogy might work, conceptually, and especially the assimilation phase. Anyone who remembers studying text books in preparation for an examination may also recall annotating pieces of text: underlining phrases or circling sections that we believed were going to be important, and which we wanted to ensure that we fully grasped. What we were doing is reinforcing our mental model, or mind-map, and creating structure and order from the text.

I’m sure you’ve all seen detective films, or TV series, where someone solves a complex puzzle using notes and images on a pin-board with string connecting the pieces together. This technique really does exist, and it’s called link analysis. It’s a type of graphic organiser used to evaluate relationships (or connections) between nodes of various types, and it has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, and medical diagnosis. Although most online sources present it in terms of software implementations, it is much older, possibly going back before WWII.[1]

Figure 1 - Conceptual link diagram.

This rather imaginative depiction of the concept illustrates some of the advantages and benefits quite nicely. The information of interest in each source is somehow marked-out, and used to build any of: concepts, prototype subjects[2], conjectures, and logic steps — whatever the researcher wants from them. Because of the freedom in choosing those pieces, this method is as useful to my non-goal-directed Source Mining as it is to goal-directed research. Essentially, one approach is aimed more at collecting material to paint a history or biography of subjects, whereas the other is focused on solving a specific problem or proving one-or-more claims.

When those pieces are connected all the way up to the conclusion entities in your data, then it also provides a trail by which some user could drill-down[3] in order to see how a conclusion was arrived at, what information was used as evidence, and where that information originally came from.

When a marked item is a person reference then it is effectively the same as what’s often termed a persona, and the ability to connect personae from different sources provides a way of supporting multi-tier personae. Although I have previously been critical of the use of personae because of the separation of the person reference from its original source context — including place, date, and relationship context that might be instrumental in establishing the identity behind the reference — this approach retains a connection to the source information, and even allows it to follow the subsequent use of a persona. This is particularly important because it appears to be an inclusive handling of certain contrasting approaches advocated by other data researchers: Tom Wetmore has long believed in personae, and Louis Kessler in source-based genealogy, but those approaches have sometimes appeared to be contradictory and have led to strong differences of opinion.

Source and Matrix Entities

Part I of this series introduced STEMMA’s Source entity as joining together citations and resources (such as transcriptions, images, documents, and artefacts) for a given source of information, but it encompasses much more than this. The semantic mark-up, described in Part II, allows arbitrary items of information to be labelled in a transcription. This includes subject references (person, place, animal, and group), event references, date references, and any arbitrary word or phrase. Those labelled items, called profiles, can be linked together using simple building-blocks in order to add structure, interpretation, or deduction to them. STEMMA doesn’t mandate the use of a visual link chart, or link diagram, since that would be something for software products to implement, but it does include the essential means of representation.

The Source entity normally specifies the place and date associated with the information, but these can also be linked to profiles if they have to be associated with some interpretation, or even some logic. For instance, when information relates to a place whose identity cannot be resolved beyond doubt from the mere place reference.

When a source is comprised of discrete or disjointed parts — such as a book’s pages, or a multi-page census household — or it contains anterior (from a previous time) references — such as a diary, chronological narrative, or recollections during a story — then smaller sets of linked profiles can be grouped within the Source entity using SourceLet elements, and these may have their own place and date context. Each of those discrete parts may have their own separate transcriptions and specific citations, although they’re related to the same parent source by containment — a subject for a future presentation. The network of linked profiles can bring together information and references from these different SourceLets for analysis.

The Source entity is a good tool for assimilating the information from a given source in a general and re-usable way. However, that information may need correlating with similar information from other sources, and this process may need to be repeated for different problems and with different goals. STEMMA accomplishes this with a related Matrix entity [4] that carries those networks of linked profiles outside of their source context and allows them to be worked on together.

Figure 2 - Mechanics of a link diagram.

Notice, as usual, that the building of these networks, and the association of them with corresponding conclusion entities, is independent of the relationship those entities have with their respective hierarchies and events (Part I), and narrative (Part II). In other words, the four main approaches to genealogy that I identified (arboreal, event-based, narrative, and source-based) can be inclusive of each other.

Compare and Contrast

Part I introduced the concept of Properties — items of extracted and summarised information — that could be associated directly with subject entities. The Source entity, which is part of the informational sub-model rather than the conclusional sub-model, can also achieve this but with additional flexibility. Because it is working directly with source information, and not obliged to make any final conclusions, it means that references to incidental people, or otherwise unidentified subjects, can still be assimilated but left in the Source entity for possible future use. The power of this should be obvious where, say, a referenced person later turns out to be a relative or in-law. Another difference is the vocabulary used to describe data and relationships. The Property mechanism uses a normalised computer vocabulary so that information can be consistently categorised as name, occupation, residence, etc., and relationships can be categorised precisely as things like spouse, mother, son, and so on. In the Source entity, though, what you record and what you call it are free choices; if you encounter a relationship provided as grandchild, nephew/niece, or cousin, where the interpretation may not be obvious, then you can keep it as-written and work on it. For the masochists, a comparison of these two mechanisms being applied to the same source may be found at: Census Roles.

It might be said that ignorance of prior work is bad during any type of research and development, but my software history is full of such cases where it has yielded a route to genuine innovation.[5] When I finally decided to look at whether other data models had addressed this approach, I was surprised to find that the old GenTech project from 1998 had documented a similar approach. GenTech and STEMMA had both try to build a network of extracted information and evidence from “source fragments” — and actually used this same term — but the similarities applied mainly to the intention rather than to the implementation.

The GenTech data model V1.1 is hard to read because it has no real examples, it presupposes a concrete database implementation — which I’m not alone in pointing out to be inappropriate in a data model specification — and it talks exclusively about evidence, and analysing evidence, rather than information. The latter point is technically incorrect when assimilating data from source fragments since the identification of evidence, or the points at which information can be considered evidence, is dependent upon the researcher and the process being applied rather than some black-and-white innate distinction.

GenTech’s ASSERTION statement is the core building-block for its network. This is simply a 5-tuple[6] comprising {subject-1 type/id, subject-2 type/id, value} that relates two “subjects”. Those subjects are limited to its: PERSONA, EVENT, GROUP, and CHARACTERISTIC entities — concepts which differ from STEMMA’s use of the same terms — and there are some seemingly arbitrary rules for which can be specified together. This restricted vocabulary means that it does not clearly indicate how its CHARACTERISTICs are associated with a particular time-and-place context (I couldn’t even work out how); it has no orthogonal treatment of other historical subjects (STEMMA terminology), such as place, group, or animal; and it cannot handle source fragments with arbitrary words and phrases. By contrast STEMMA’s profiles can deal with source fragments containing references to persons, places, groups, animals, events, dates, or arbitrary pieces of text. It’s <Link> element is the low-level building-block that connects these together, and to other profiles, but with much more freedom. For instance, the equivalent of a multi-tier persona is achieved by simply connecting two prototype-person profiles. GenTech uses its GROUP entity to achieve this, and effectively overloads that entity for grouping PERSONA and CHARACTERISTICs rather than using it only to model real-world groups.

Some other philosophical differences include the fact that STEMMA profiles represent snapshots of information and knowledge at a particular point in the assimilation process (or correlation process, in the case of Matrix); the actual information effectively flows between those profiles. This is hard to describe in detail, and I may save it for a later post. Another difference is that the profiles allow steps of logic and reasoning to be represented in natural language; the connections are not just a bunch of data links or database record IDs. That text would be essential if some user wanted to drill-down to understand where a claim or figure originated, and STEMMA allows multi-lingual variants to be written in parallel. Reading GenTech’s section 1.4.2 suggests (to me) that its ASSERTION may have more in common with STEMMA’s Property mechanism that with its Source and Matrix entities.

An interesting corollary is that conclusions are easily represented in software data models, and they will usually employ precise taxonomies/ontologies to characterise data (such as a date-of-birth, or a biological father), or equivalent structures (such as a tree). In effect, these conclusions are designed to be read by software in order to populate a database or to graphically depict, say, biological lineage. Source information, on the other hand, cannot be categorised to that extent — it was originally intended to be humanly-readable, it must assimilated and correlated by a human, and all analysis must be understood later by other humans.

There have been a number of attempts to represent the logical analysis of source information using wholly computerised elements (see FHISO papers received and Research Process, Evidence & GPS) but these are far removed from employing real text. As a result, they lose that possibility of drilling-down from a conclusion to get back to a written human analysis, and to the underlying information fragments in the sources. While allowing analytic notes to be added directly to data items might be one simplistic provision, connecting notes and concepts together to build structure must have written human explanation, not “logic gates” and other such notions. One reason for these overtly computerised approaches could be that software designers feel an onus on them to support “proof” in the mathematical sense rather than in the genealogical sense; a result of possibly misunderstanding the terminology (see Proof of the Pudding).

Concluding Remarks

So what have I accomplished in this trilogy? I have given insights into a published data model for micro-history that has orthogonal treatment of subjects and inclusive handling of hierarchies, events, narrative, and sources. Did I expect more? Well, I did hope for more, when I first started the project, but it was not to be. There are people in the industry that I have failed to engage, for whatever reason, and that means that the model will probably finish as another curiosity on the internet shelf, along with efforts such as GenTech and DeadEnds. Complacency and blind acceptance have left genealogy in a deep rut. In the distant future — when technophobe is simply a word in the dictionary — people of all ages will look back at this decade and laugh at what our industry achieved with its software. When paper-based genealogy (using real records) has probably gone down the same chute as paper-based libraries, and we're left with software genealogy and online records, then we'll wish that we had built bigger bridges between those worlds. If our attitudes and perceptions don’t rise above the horizon then we'll never see the setting sun until it’s too late, and we might as well say: RIP Genealogy!

[1] Nancy Roberts & Sean F. Everton, "Strategies for Combating Dark Networks", Journal of Social Structure (JoSS), volume 12 (https://www.cmu.edu/joss/content/articles/volume12/RobertsEverton.pdf : accessed 24 Oct 2015), under “Introduction”; publication date uncertain but its References section suggests 2011.

[2] In STEMMA terms: persons, places, groups, or animals.

[3] A BI process where selecting a summarised datum or a hierarchical field — usually with a click in a GUI tool — revealed the underlying data from which it was derived. The term was used routinely during the early 1990s when a new breed of data-driven OLAP product began to emerge.

[4] "An environment or material in which something develops; a surrounding medium or structure", from Oxford Dictionaries Online (http://www.oxforddictionaries.com/definition/american_english/matrix : accessed 28 Oct 2015), s.v. “matrix”, alternative 1.

[5] Modern software development is less about ground-up development and more about creating solutions by bolting together off-the-shelf (e.g. open-source or proprietary) components. Both have merit but the pendulum is currently in the other quadrant.

[6] An n-tuple (abbreviated to ”tuple”) is a generalisation of double, triple, etc., that describes any finite ordered list. The term was popularised in mathematics, and more recently in OLAP technologies such as Holos and later Microsoft OLAP.

Wednesday, 2 December 2015

Our Days of Future Passed — Part II

After bringing things up-to-date regarding STEMMA V4.0 in my previous post, Our Days of Future Passed — Part I, I now want to expand on its support for narrative genealogy.

When asked what this is, most people would respond that it is story telling. It is true that recounting stories of our own experience or recollection would be a part of this, but there is still more. Unfortunately, the terminology for distinguishing types of authored work is something of a minefield with many different terms being applied inconsistently. I will keep with the following terms in this article, which I hope will be meaningful and acceptable to readers:

Narrative essays: typically contain personal non-fictional storytelling for the purpose of sharing an experience, recollections, or a point of view.

Narrative reports: write-up of research, analysis of information, conclusions, etc., using a narrative format. That is, weaving the research process into a description of the events uncovered. This type of narrative would generally be from the point of view of source information rather than personal experience.

Research report: formalised report of a specific research assignment, usually for a client. It might record everything that was searched, who & what was searched for, everything that was found, everything you hoped to find but couldn’t (with reasons), all the negative results, analysis, and a research plan for future work.

For completeness, research notes are the sum total of everything we know about a person, or other subject, expressed as raw records with separate commentary, and in an easily accessible typescript form. This is the accepted meaning, but I suggest that this disorganised concept stems from inadequate digital representation.

Let’s just focus on the first two of these for now: narrative essays and narrative reports, the difference of which might be blurred if researching something that has its roots in living memories. Where would we write them? The main answers would probably be: blogs, dedicated Web sites, or word-processor documents, but are they sufficient?

The problem with those approaches is that your narrative is then disconnected from any other type of data, and not described by a data model. On its own, it probably doesn’t need a data model, but if you want to integrate it with anything more structured — which basically means all that multi-linked data that I described in Part I — then it is essential. Before I can explain that integration, I must first describe some of the advantages and disadvantages of narrative.

Narrative Essays

Natural language is a very rich means of expression, and it can be used to describe events, circumstances, objects, people, emotions, analysis, evidence, and conclusions. Furthermore, this can be done objectively, in a matter-of-fact way, or with the elegance and beauty of seductive prose — whichever is appropriate for the material. Remember, too, that having some natural-language content is essential if you want to share your history with friends, relatives, or peers. If anyone believes that template-generated sentences, where software inserts discrete data values into some stock template, constitutes narrative then they need to read more books. There are no rules for whether these contributions would have to be about a particular person, family, or other subject; they could cover any historical topic, and even describe your research and reasoning in arriving at your picture of the past.

However, narrative also has a disadvantage: it is sequential. While a master of the art could take you on a journey using just their words, it is still a pre-prepared journey that has to be followed in sequence; you cannot easily navigate your own way around the information in the story. By contrast, the multi-linked entities in Part I would allow you to navigate their hierarchies (e.g. lineage), events, timelines, and geography — together, and with no restrictions.

Now imagine that both of these were combined, and that you could, for instance, navigate from a person reference, or a place reference, found in some narrative, to its respective hierarchy, then to a nearby entity in that same hierarchy, and finally to a mention of the new entity in some other narrative.

Figure 1 - Navigating between narrative and hierarchies.

STEMMA effectively integrates separate narrative articles (“pieces of non-fictional prose that is an independent part of a publication”) with its multi-linked data describing hierarchies, events, geography, sources, etc., and this allows the freedom to navigate between all of them. Note that the links between subject references (in the narrative) and subject entities can be considered bi-directional, and so can be navigated in either direction.

In effect, I’m saying that narrative supplements that multi-linked structured data, and that you cannot truly represent the past without it. I recently found a real-world analogy to this synergy after talking with Brian Miller, CEO of http://history-to-share.com/. His company produce ceramic outdoor plaques that can be associated with a gravestone, or other memorial marker. They include a QR code that allows a passer-by to scan it and see stories of that person’s history, and so breathe life into what would otherwise be simple names and dates, possibly with brief relationship details such as “wife of”, “husband of”, etc.

Source Information

If you’re serious about genealogy then you will be interested in original documents — including copies or derivatives thereof — and maybe even authored works by other people. Just as narrative held in a separate location is not making good use of it, then neither is keeping only images or other facsimiles of those documents. What I’m about to talk about, here, is transcription, and including those transcriptions with the rest of the data.

It’s tempting to think that a transcription is simply text, and so not fundamentally different to narrative. However, there are many additional issues to consider, such unusual or erroneous spelling, unknown or uncertain words, insertions and deletions, emphasis, marginal notes, footnotes, and so on. Capturing the essence of these in a transcription is essential if you plan to study it, or even to understand it properly.

Electronic documents use a system of mark-up, analogous to original manuscript mark-up, that embeds information or instructions within the text. This may be presentational mark-up that gives instructions on how to present something (e.g. that a word or phrase should be in italics), or semantic mark-up that associates meaning or other information with a word or phrase (e.g. that a phrase is actually a hyperlink that must take you to a given URL). STEMMA uses such a system in its narrative support, and a large part of it is common to both authored work and transcription, but a smaller part is also specific to transcription. Since authored work will often need to quote transcribed text then both features are actually provided by the same rich-text narrative tool. For the masochists amongst you, an example showing STEMMA’s mark-up being applied to an evidence-of-age document may be found at: Transcription Anomalies.

Semantic Mark-Up

In order to introduce semantic mark-up, let’s begin by looking at a person reference such as “Tony Proctor”. When producing, say, a narrative essay then the mark-up allows the author to ‘generate a reference to the Person entity whose identifying key is such-and-such’. There are options allowing a choice of formal/informal name, or even some custom description of the person. As well as inserting the selected name into the text, this also marks it as a person reference.

This approach can be applied to all of the STEMMA subject types: person, animal, place, and group. It can even be applied to the names of events, or to raw dates.

Figure 2 - Relationship between subject entities and narrative.

Note that this diagram illustrates how each of the subject entities is still independently connected to its respective hierarchy and shared events. The subjects might be referenced in many separate narrative articles, but these could be found directly from those hierarchies and events, or vice versa.

Now let’s switch to a person reference encountered during a transcription. In this case, the subject reference was already present, and the goal of the mark-up is simply to tag it as a person reference, etc. Note, however, that a subject reference is not necessarily the same as a name; phrases such as “my grandmother”, “my dog”, “his regiment”, or “their home” are all valid examples of subject references — for person, animal, group, and place, respectively — but none are names. This may also apply to dates since phrases such as “next year” or “last week” are still date references.

This observation leads to a choice by the transcriber: either a subject reference can be connected to an existing subject entity, or left as a reference to some unidentified or incidental subject. STEMMA terms these options deep semantics when we want to make the association and shallow semantics when we simply want to mark it as a reference to a subject of a given type. This choice also applies to date references where we may not be able to identify the date value beyond reasonable doubt, but we still know that it’s a date.

Narrative Reports and Research Reports

A narrative report will probably have a slightly more academic approach than a narrative essay, and one of the most important requirements will be for source reference notes and for general footnotes/endnotes. Since a narrative report, by definition, will involve researching information from a number of sources then it should include traditionally formatted citations for them. STEMMA’s mark-up allows the production of citations, and these may be generated directly in a corresponding footnote/endnote, or inline with your other text so that more complex (possibly multi-source) citations can be placed in a custom footnote/endnote. Illustrations of this may be found in: Cite Seeing.

Research reports are a more emotive issue, but bear with me. I do not have any figures that indicate how many professionals disseminate their research reports via paper or electronic means, but all will undoubtedly have been created using a word-processor. Reactions to the suggestion that a research report could be disseminated in some computerised form, other than from a word-processor, are largely based on the fear that it would somehow limit freedom of expression, or force the use of some database, but these are unfounded.

STEMMA’s rich-text narrative does require a specialised word-processor tool so, for a moment, let’s assume that this tool was freely available. It has the same capabilities for layout, formatting, tables, pictures, and reference notes, as do most word-processors. If you had to use that, and the recipient also had a corresponding reader, then there would be no loss of freedom. However, the STEMMA version would also allow subject references to be flagged, and formatted differently to the surrounding text — no more need to use that horribly non-international approach of uppercasing surnames. Furthermore, those subject references could be clickable, and could take you to pictorial representations of some event or lineage information that was uncovered. That structured information could also be lifted out of the report by a compliant genealogy product since it would understand the same format; there would be no need for the client to mess around trying to cut-and-paste pieces for data entry.

There are many advantages to this approach, but I also know that it’s currently a step too far for some readers. Until such a format becomes as ubiquitous as our word-processor formats then it will remain in the abode of digital dragons and uncharted territory.

Making use of Narrative

So is this just about exploration? What about searching? While it is possible to search word-processor documents, or blogs, you have to know exactly what to search for, and this is fraught with problems for subjects with alternative names. Searching marked-up narrative means that the software can automatically check all the alternative modes of reference without you having to know them. Not convinced? OK, consider this problem that I had several times before my software had matured sufficiently. You come across a surname, or someone asks you about a surname, and you want to search all the persons you have in your data, all their aliases, their maiden names, alternative spellings, and find all the narrative and transcribed references to them. Furthermore, you want to check previously unidentified people or incidental people. And you don’t want to worry about ambiguities such as tailor/Tailor and baker/Baker, or confusing the surname London with the place of the same name. That’s a powerful feature, but it’s then quite easy to perform. I would even consider buying a product based on that one capability … had I not already done it.

There will undoubtedly be a researcher amongst you who will ask ‘what happens if I suspect who a person reference is to, but I do not want to connect it directly to a corresponding person entity; instead wanting to build a case for why it is that person’. Well, full marks to whoever asked that! It was one of the last core pieces to appear in the STEMMA specification.

The subject of connecting a person reference (or other subject reference) to the logic, and to prototype persons[1], before making a concluding link to a person entity, will be the subject for Part III in this series. This will also discuss why reasoning must be expressed using natural language, and not in some wholly formalised computer-speak.

[1] A prototype subject begins as the details of some initial subject reference — effectively a persona in the case of persons — and possibly being merged with other prototypes before being connected to a subject entity.