Wednesday 25 September 2013

Collaboration With Tears



Is it possible to collaborate with others on a unified online family tree? This is one of those regular emotive topics but it has recently flared-up in discussions around the FamilySearch.org Family Tree. I want to highlight some basic issues with the concept, and eventually follow-up with a detailed Collaboration Without Tears post.

In his blog-post entitled Is a unified online family tree program possible?, James Tanner states unequivocally that a unified online family tree is possible, based on an analogy with collaborative wiki programs. A problem I have with this analogy is that wiki programs mainly handle text, and text is considerably less structured than family-history data. A consequence is that a wiki can more easily represent conflicting opinions without having to adopt one at the expense of others. This is not to say that an online model couldn’t implement a scheme for representing conflicting opinions but that would be quite complex compared to the existing collaborative models out there.

Although Wikipedia is a valuable resource, I am on the fence regarding the ease of collaboration – even with primarily text – and the rules placed upon reliable sources.  No personal opinions, no original research, no private documents, no rumours, etc., are rules obviously intended to increase the accuracy and verifiability of the contribution, but there are cases where they hinder a valid presentation. Not everything is known for sure. Not everything is in the public domain. Some contributions are necessarily personal recollection because accessible sources may no longer exist. The issue of accessible sources in the public domain has a direct parallel in genealogy and I will give an example below.

In his follow-up post, Genealogical Ownership and Isolationism, James correctly points out that ‘You can't copyright ideas and you can't copyright facts’. However, a work of academic research, which I would say includes a treatise on your family history, can be copyrighted. In fact, it would be automatically copyright by virtue of the Berne Convention, unless you’ve agreed to some waiver or Creative Commons licence. This may be a topic more applicable to collaborative family history than to some type of online lineage but I want to present a number of issues that suggest Isolationism is actually an inherent part of genealogy, and so cuts across the grain with collaboration.

With any type of collaboration, the question of who is most qualified to make a change will always break the utopian ideal. In the case of a wiki, someone may consider themselves to be a learned expert, or a qualified researcher, but if they’re writing about your work/creation, or your family, or you as a person, then who is more qualified? Although, I have no personal experience of this, I have heard horror stories about such conflicts. The same happens with collaborative genealogy too. You may have put a considerable amount of effort into establishing the truth of some aspect of a family’s lineage, so how then would you react if some less qualified person ignores your work and changes things? With systems like Family Tree, the change may not even be a direct one – you may be unfortunate enough to be downstream from someone’s change elsewhere. Looking from the other side, though, if you’re closer to the family in question then how do you react when someone with letters after their name changes your contributions? These are fundamental problems with any type of collaboration. 

The issue of restricted sharing of certain items, such as photographs, personal documents, and family stories, is one that we can all appreciate – at least if there’s any depth to our family history data. There will always be a point at which we decide something is so personal, or so private, or so sensitive, that you don’t want to share it with the whole world. Again, you might argue that this is only applicable to family history rather than biological lineage but the boundaries are vague. This particular issue was recently discussed at No, You Can't Have My Photos and Stories One World Tree.

Where there is still live research – which is pretty much an eternal endeavour with family history – then collaboration means you’re trying to hit a moving target. It is frustrating to find that when you want to update something that it’s all changed, or even disappeared. Do you really want to spend a considerable amount of your research time verifying or debating what others have added to your shared data, as opposed to looking at their separate research when time permits? These are two distinct approaches. Irrespective of the approach, when you’re in the throes of some deep research, you really want to flag your data as tentative until you’re sufficiently confident with it. I am not aware of any site that supports this, though, which means you’re left with the options of full visibility or no visibility. One of the reasons that online trees contain so many errors in that people have copied data from someone else before it was ready, and they’ve never bothered to verify it themselves.

The fact that people blatantly copy data from other trees, and with no citations or attribution, is also a reason contributing to the poor quality of online trees generally. One of the justifications for a shared online tree is that such copying is no longer necessary. This is true but it’s not the only alternative. It is possible to devise schemes that accommodate different depictions the past, and yet don’t require people to copy-and-paste to build their own tree. However, with alternative viewpoints comes the requirement to rate one against another. A simple mechanism such as ‘Like’ would work although it lends itself to abuse. A mechanism based on the number of separate trees that agree-with or join-with (see below) some viewpoint would also work. Of course, these both rely on users assessing each conflicting viewpoint based on the case they make and the supporting evidence they cite.

I’m currently helping a friend with her recent (20th century) family history. This turns out to be one of the most convoluted cat’s cradles that I can recall working on. Although we’re making great progress, that success is primarily due to the cooperation of existing family members, and their recollections or personal documents. A consequence is that some qualified researcher who may be diligently looking at so-called reliable sources would end up with an incorrect picture of the past. This would put those family members in a quandary if that researcher published a tree based on reliable sources. Do they challenge it or ignore it? Some of the evidence is not in the public domain, and may never be, so on what grounds could a public tree be challenged? Does this not imply that a publicly shared tree can never be totally accurate, or agreed upon? The thought-provoking subject of privacy and the right to dig into our ancestor’s lives was recently raised by Thomas MacEntee at Is There A “Right” To Do Genealogy? following a lecture of his entitled Privacy and Our Ancestors.

So if Isolationism is inherent in genealogical research then what criteria would make collaboration practical? Here’s my tentative list:

  • Supporting alternative viewpoints.
  • Controlled sharing (import) and visibility (export).
  • Alternative to copy-and-paste genealogy.
  • Citation and Attribution where necessary.
  • Automated rating of different viewpoints.

Collaborative models could be defined where the trees from different researchers are essentially held separately, although I’m not aware of any. Existing online trees are either individual ones with controlled access, or a single shared one with collaboration. Having separate contributions is obviously good from the point of view of concurrent research, and for controlled visibility, but the missing element is to be able to create a single “tree view” from those contributions. Now I’m not talking about any physical data merge here since those separate contributions should be immutable. I’m thinking of schemes where you voluntarily connect or overlay others’ contributions with your own. There’s a whole range of possibilities dependent upon the unit of sharing. It could be a sub-tree from someone else’s tree, or the site could allow collaboration on public named tree segments that people can voluntarily connect with. These approaches all support alternative viewpoints and implicit rating of those viewpoints. They also substitute the act of joining-to or overlaying-with in place of any copy-and-paste. However, underpinning it all is the goal of a traditional family lineage chart, and that alone is deeply flawed.

I intend to follow-up with a novel approach to collaboration which is both simple and practical, but also fundamentally different to the approaches discussed here.

Friday 20 September 2013

Where are the Standards for Historical Data?



We all accept the need for standards – standards of measurement, electrical and mechanical components, information representation, etc. How many people have noticed, though, that there are no standards for historical data? I will explore this with you, and even given a couple of concrete examples of omissions, before considering why historians might be ignored.

A quick computer search for “historical data” in the context of computers and standards leads you into transaction processing systems (TPS) and the records associated with past transactions, which isn’t what we want.

The normally very helpful Cindi’s List presents a very meagre list, including FHISO (which we’ll come to again shortly), GenContent, historical-data.org, and GenTech.

A glance at the available List of ISO Standards confirms that existing international standards relate to industry, technology (incl. IT), science, consumer products, foodstuffs, and documentation. There are none for history or historical data. The closest example to an historical standard there appears to be ISO 3166-3 which describes codes for old country names. However, this is only those countries that have been deleted since the main ISO 3166-1 country-code standard was first published in 1974. Anything older than that is not included. A look at ANSI’s Web site reveals a page on the history of standards but not standards for history.

You might be about to point out the MARC (MAchine-Readable Cataloging) standards. However, this set of digital formats, developed at the US Library of Congress during the 1960s, is for the description of items catalogued by libraries. Similarly, the METS standard (Metadata Encoding and Transmission Standard) relates to the encoding of descriptive, administrative, and structural metadata for objects within a digital library. These standards are both inward-facing and relate to the cataloguing and organisation within archives and digital libraries. The Open Archives Initiative is in a similar vein as it relates to interoperability between those archives and digital libraries.

OK, so what exactly am I looking for? Well, international standards for the unambiguous representation and exchange of data relating to historical entities by software agents. Not just for items held in a repository.

In all honesty, the only standard along these lines that I’m aware of is the Unicode character set standard. Version 2.0, which was released in 1996, included a multi-word mechanism, otherwise known as surrogate pairs, to remove the restriction of 16-bit character codes (i.e. the limitation of 64k characters). This allowed it to represent characters from historical languages, including Egyptian Hieroglyphs, and this is all now part of ISO/IEC 10646.

Let me briefly describe two example voids that could do with some help in this direction:

Place types

When we exchange place references, we want to know what type of place they are (see A Place for Everything). ISO 3166-1 only defines codes for present-day countries so how would we describe America before it was the United States. It would be wrong to apply the modern US tag as they’re not synonymous. Also, ISO 3166-2 defines codes for the names of the principal present-day subdivisions of the countries in ISO 3166-1 (e.g. provinces or states). This does not include old subdivisions such as Shires in the UK although they are still historically relevant. There is a similar standard to ISO 3166-2 developed independently by the European Union and called the Nomenclature of Units for Territorial Statistics (NUTS). This has the same issues with historical entities.

Calendars

This ISO_8601 standard was first published by ISO in 1988 and concerns the exchange of dates and times from the Gregorian calendar. It does not support any other calendar system. There are many other calendar systems, though, and it would be wrong to assume that every date in every calendar has a unique and unambiguous representation in the Gregorian system. It’s an issue of preserving the integrity of the evidence, and not being forced to mangle it in order to suit some modern standard. In fact, this particular case is even more important because several of those alternative calendars are used to this day – they’re not all obsolete and archaic. There’s therefore an additional cultural dimension to this.


FHISO (Family History Information Standards Organisation) was created with the goal of looking after (developing, maintaining, collaboration on) digital standards that affect genealogy. They are careful to treat the terms genealogy and family history in an equal way, although I would prefer the wider term of micro-history, especially as they have been involved in discussions with groups from that wider sphere. However, the even bigger sphere of generic history is outside of their remit. So does anyone look after that area for historians? Is there any collaboration for our common good?

There is an International Classification for Standards (ICS) with categories into which new standards can be placed. Not surprisingly, there are none for historical data. This would mean any new standard relating to genealogy, for instance, would have to be placed in a general catch-all category (e.g. 35.240.99). This might be acceptable if genealogy was an isolated case but I’ve already suggested that it is part of a much bigger category – one deserving of its own designation.

The focus on modern technology and business requirements implicitly assumes that historical data is no longer exchanged in a live fashion. It is simply “dead data” consigned to some archive or library. Clearly this is not the case and our genealogical pursuits are a prime example.

Another example is schemes like historical-data.org which employ microdata to attach semantics to historical data in HTML pages. The Semantic Web will involve historical and genealogical data but it must be reliant on international standards to represent historical entities.

We can’t change historical data to fit new standards – relevant standards must represent what we know about history, and as it was. They must also acknowledge that history isn’t as tangible as modern information since evidence is both finite and disjointed, and often supplemented by subjective conclusions. This should make sense to historians and genealogists, and so they have an obligation to educate and collaborate with those who would make standards for us.

Wednesday 11 September 2013

Genealogical Persona Non Grata

You may have heard the term Persona (pl. Personae) being used in a genealogical context, especially by people with a software background. What is a persona, though? Do you ignore it as a software aberration? Does it have any value?

A persona is the term used to describe the reference to some person from one specific source. There are no conclusions in a persona — only information — and so it is sometimes inaccurately referred to as an “evidence person” in order to distinguish it from the traditional “conclusion person” that we might have in a family tree. Hence, a persona is not equated with any actual person. The use of the term evidence is inaccurate here as evidence is an intangible mental construct, unlike source information. It is what we think certain source information means.[1]

In principle — I know we all work in different ways — it is possible to take a number of similar personae and group them together in order to form one or more conclusion persons that we can identify with actual people; more on this process in a moment, though. Some advocates describe a multi-tier process where this grouping occurs at different levels. For instance, personae that are obviously similar being grouped first, and then those groups being tentatively grouped themselves based on less obvious criteria. It’s interesting to note that these persona groups are not personae themselves since they are the result of some inference and conclusion, and ideally need some justification.

 

This brief outline embodies the generally accepted nature of a persona. However, things start to go awry from here and opinions begin to differ. The persona is a much-debated concept in the Evidence & Conclusion model of genealogy, and many threads on the subject can be found on the BetterGEDCOM wiki, such as Do we need persona?

As a representation of the reference to person from a single source, there cannot be much debate over the concept. However, in practice, that information is usually distilled down to a number of named properties, as in the illustration above. I’m using the STEMMA® terminology of properties here rather than “facts” or “PFACTs”, etc. STEMMA defines a Property as extracted and summarised source information, and acknowledges that they require the same support for uncertain characters, uncertain interpretations, and other anomalies as do transcriptions. Properties are valuable as a window onto the supporting information but they do not replace the raw information since they are only a digested form of it. To do that would lose the contextual parts of the information such as: what the event was, who else was there, what parts they played, and how reliable the source itself is.

The persona concept itself can be traced to a 1959 paper entitled Automatic Linkage of Vital Records[2]. Indeed, there are still those who believe that one of the primary uses of personae is in their automated combination by software. This might yield a first-pass result when many records are involved but I would be very concerned about accepting that result without putting in the real analysis expected of genealogical research. However, this is straying into the field of how software might utilise personae rather than their expressive power.

The origin of the term itself is uncertain but at the meeting that kicked off the GenTech model, in 1994, Tom Wetmore gave a talk entitled "Structured Flexibility in Genealogical Data" in which he stressed the need to record evidence data, and where he used the term persona in that context. The concept of persona exists in several data models, including GenTech and more recently GEDCOM-X.

So, is there any merit in representing personae in our data? STEMMA records Property values for a person reference, such as their name, age, and occupation, but it also wants to retain the source context of that information — the where-and-when. It does this by subdividing its Event entities into a number of SourceLnk elements, each of which is supported by a distinct source. Those SourceLnk elements may contain multiple PersonLnk elements corresponding to person references in that source and these are, therefore, similar to personae.


<Person Key=’pWilliamElliott’>
    <Eventlet>
        <!-- Private event (no other persons involved) -->
        <When Value=’1870-11-17’/>
        <SourceLnk Key=’sEveningPost’>
            <PersonLnk>
                <Property Name='Name'>
                Wm. Elliott
                </Property>
                <Property Name=’Age’> 29 </Property>
            </PersonLnk>
        </SourceLnk>
    </Eventlet>
</Person>

<!-- Multi-person events -->

<Event Key='eCensusElliott1851'>
    <SourceLnk Key=’sCensusElliott1851’>
        <PersonLnk Key=’pWilliamElliott’>
            <Property Name='Name'>
            William Elliott </Property>
            <Property Name='Age'> 10 </Property>
            <Property Name='Occupation'>
            Scholar </Property>
            <Property Name='BirthPlace' Key='wUttoxeter'>
            Staffordshire Uttoxeter </Property>
            <Property Name='Relationship’
            Key='pTimothyElliott'> Son </Property>
            <Property Name='Status'/>
        </PersonLnk>
    </SourceLnk>
</Event>

<Event Key='eCensusElliott1861'>
    <SourceLnk Key=’sCensusElliott1861’>
        <PersonLnk Key=’pWilliamElliott’>
            <Property Name='Name'>
            William Elliott </Property>
            <Property Name='Age'> 20 </Property>
            <Property Name='Occupation'>
            Labourer </Property>
            <Property Name='BirthPlace' Key='wUttoxeter'>
            Staffordshire Uttoxeter </Property>
            <Property Name='Relationship’
            Key='pTimothyElliott'> Son </Property>
            <Property Name='Status'>
            Unmarried </Property>
        </PersonLnk>
    </SourceLnk>
</Event>

<Event Key='eMarriageElliott1862'>
    <SourceLnk Key=’sMarriageElliott1862’>
        <PersonLnk Key=’pWilliamElliott’>
            <Property Name='Name'>
            William Elliott </Property>
            <Property Name='Age'> 21 </Property>
            <Property Name='Occupation'>
            Hammersman </Property>
            <Property Name='ResidencePlace'
            Key='wVictoriaStreet'> Victoria Street Derby
            </Property>
            <Property Name='Role'> Groom </Property>
            <Property Name='Status'>
            Unmarried </Property>
        </PersonLnk>
    </SourceLnk>
</Event>


The PersonLnk elements representing the subject references are assembled from the discrete Property values derived from the supporting source. When the Properties describe relationships for the subjects then they can also be represented, and may be inter-person relationships (such as wife-of or wife-of-brother-of), membership of some group, or ones relative to referenced places. Putting this information into Events allows the information to be presented by time (i.e. a timeline), or geography, or both. The Property values for the Event itself, such as the dates or place, may also be specified in the SourceLnk element as Event properties.

OK, so why don’t I describe these sets of Property values as personae and use them as such? For a start, the interpretation and summarisation of these items constitutes a level of inference, and so they are one level removed from the persona concept. STEMMA also generalised the concept so that there are equivalents for all of its subject references, including places, groups, and animals. Furthermore, as of STEMMA V4.0, there is a much closer concept that has true value for research and analysis purposes. Its Source entity allows references to subjects (such as persons), and to dates and other important details or phrases, to be marked, collected, and built into a network for a graphic analyser. This allows those references to be analysed in terms of other context from the source information, and for similar references — in either a single source or across multiple sources — to be assembled into multi-tier persona-like entities.

In summary I believe the concept of personae has merit in micro-history data, but without the contextual information that surrounded those references in their respective sources then they cannot be used for research purposes. Similarly, STEMMA’s sets of Property values are merely an extracted and summarised form of information from a source and are not designed for deep analysis. Conversely, its Source entity embraces references to more subjects than merely persons, and to any information that the researcher feels will be important to their historical analysis. This is not mandating a given research methodology — which is a basic premise of STEMMA — but it does provide support for a genuine approach to handling complex evidence.


** Post updated on 22 Nov 2015 to align with the changes in STEMMA V4.0 **


[1] “QuickLesson 13: Classes of Evidence—Direct, Indirect & Negative“, Evidence Explained: Historical Analysis, Citation & Source Usage (https://www.evidenceexplained.com/content/quicklesson-13-classes-evidence%E2%80%94direct-indirect-negative : accessed 10 Sep 2014).
[2] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, “Automatic Linkage of Vital Records”, Science, Vol. 130, No. 3381 (16 Oct 1959): p.954959.