Tuesday, 1 October 2013

Collaboration Without Tears



Following my previous post, Collaboration With Tears, where I suggested that some element of Isolationism is inherent in genealogy, and what criteria would make collaboration practical, I want to now describe a novel approach to collaboration.

Collaboration is important, not just for users but for content providers too.  It is often seen as a natural progression from records-based content that allows new users to get involved more quickly, and providing a social connection. At the time of writing, not all sites provide collaborative tools – for instance findmypast – but that is the way of the future for these sites. However, there are many ways that collaboration can occur. Current tools focus on the creation of a shared picture of our biological lineage, aka a global family tree, but future tools may support full family history. There are also a number of other subjects under the micro-history umbrella that are currently dealt with by specialist sites, including One-Name Studies and One-Place Studies. I recently commented on a new UK site called myhomespast which allows people to create a pictorial history of the homes they've lived in. It is not currently viewed by its creators in terms of micro-history or social networking but the scope for collaborating on your old streets, and for finding your old neighbours, is huge.

At the end of my previous post, I hinted that focusing on a unified family tree is deeply flawed. The problem is that the unit of collaboration – the ‘person’ – is a conclusion that has been formed from available evidence. Some things can never be determined, though, because there’s simply no surviving evidence. In other cases, the evidence may be scant and circumstantial which necessarily means those conclusions will become more subjective. Establishing, then, that my Person-A is the same as your Person-A becomes more complex. It’s not even a matter of is-it-the-same or is-it-different since some subset of the person’s details may be substantially different.

A number of people have suggested using a different unit; one based on evidence rather than conclusions. The growing concept of a Persona has been the most common unit of these suggestions. Late last year, I imagined a model that I would dearly love to have used. However, it wasn’t focused on a family tree and so it did not exist. The more I considered it, the more I realised how easily this could be implemented, not just by some large content provider but by a small team of independent developers. As a long-standing developer, I went as far as writing myself a specification and even prototyping some code. However, I no longer have enough time to launch a new start-up. I know through experience that they not only need a good product and talented people but also the right mix of personalities – they’re hard!

The essence of the concept was of a tool for identifying people and relationships in a mass of evidence, as opposed to working top-down on a unified family tree. My own family history research incorporates far more than a mere tree and I was happy to continue work on that with some limited isolation. At the time, though, I was looking for a particular person in one of the census returns of England & Wales. They were pretty elusive and I was sure they’d changed their name. I was attempting to identify all their family members, in-laws, neighbours, etc., in order to locate them. It seemed like a task where collaboration would really have helped.

Let’s briefly stop to examine the essential differences of this approach. Rather than working with subjective conclusions, it would be working with something tangible; something whose existence is beyond contention – an entry on a census page. Irrespective of the identification of the person in that entry, and even if the name is not readable, the entry can be referenced unambiguously. That allows collaborators to make the identification of the person together, and of the relationships to other persons. Imagine being able to draw a link between two entries on a census page, or between two entries on side-by-side pages, and attach a relationship type, comments, etc.

Yes, most census returns do have a Role field but even when it is correct, it is intra-household, and sometimes even more localised as in the STEMMA® example at: Census Roles. This collaborative model would be extremely useful in identifying those “strays” (people found in unexpected places on census night), or people with misspelled or uncertain names, and those people who had deliberately tried to obfuscate their true identity. This is far more than simply adding an alternative name, birth year, or place of birth, as currently supported on ancestry.com.

OK, so why did I pick on the census of England & Wales rather than, say, the civil registrations of vital (BMD) events or parish registers? The answer was the goal of ring-fencing the specification to keep it as simple as possible, and as independent of the large data collections as possible. Each page of this census has a unique identifying code of class/piece/folio/page. This, plus the ability to reference a particular line relative to the start of its respective page, gives a convenient way to address each and every entry. A difficulty in using other sources of evidence – and ones for vital events in particular – is that they have no standardised reference codes, and so it would be quite hard to ensure that any given source doesn’t materialise in multiple independent forms. This would be less of a problem for a content provider than it would be for an independent team.

So what about the images of the census pages? This is the one item that the model really needs from an external source, and the answer is delegation. The scheme I toyed with was to summon the images onto the screen using a customisable URL that could be sent to whichever content provider you were subscribed to, thus abstracting that source. The idea is a little like the way FamilySearch delegates to its partner site findmypast for such a census image. In their case, the URL has a private format and appears to use some internal image identification, although a mapping from the public reference code obviously exists somewhere. For instance:


After creating a link between two census persons, I imagined being able to add a relationship type (e.g. Father-of) and some justifying text. I also imagined another user adding a different relationship to the same link, and then people being able to compare the cases we were making and either up-vote or down-vote our interpretations.

The following illustration concerns two Nottingham families in the 1881 census of England & Wales. I am identifying the adult members of each household using their natural keys of class/piece/folio/page/line:

John Knowles                      RG11/3342/3/60/6
Eliza Knowles                      RG11/3342/3/60/7
Eliza Barker                         RG11/3342/3/60/10

John Webber                       RG11/3358/139/16/2
Elizabeth A. Webber            RG11/3358/139/16/3

The links, voting, and associated notes in this illustration revolve around the identification of Eliza Knowles. Two links each give justifications for their differing identification, but one receives a down-vote with an explanation as to why it must be wrong.





So, if you’ve gotten this far and you’re still awake then you’re probably thinking ‘OK, that sounds a nice idea but isn’t it a distraction from building your family tree’. Apart from it being a genuinely workable method of collaboration, and one which could be implemented independently, or by a content provider, or even by a partnership of content providers (there’s a thought!), it also has hidden potential. During my prototyping, it didn’t take long before I realised that I could turn my data inside-out. That is, use the links that describe direct biological relationships to generate a family tree for all or part-of my complete data. I remember thinking ‘Heh! Wow!’ because that family tree would also include any information on non-biological relationships, any per-user comments, implicit citations to each census, and any explicit citations for other sources provided by the users. More interestingly, it would also support alternative depictions of the tree that could be rated against each other. In other words, a very rich tapestry!


Technical Notes

I wouldn’t bother reading further unless you’re particularly interested in the technical details of the prototype. This section is just a summary of implementation details that might answer some burning questions.

  • A link between any two given persons is unique in the database. Users can add details to the link such as a relationship type but each link has its own unique ID.
  • A set of biological relationship types must be predefined and used for validation purposes. This can be supplemented by non-biological ones such as adoptive parents or step-siblings.
  • Each person (i.e. entry on a census page) has its own ‘natural key’ that happens to be unique. Users can add details such as notes to a person.
  • Some relationship types such as father and son-of are symmetrical. These would be normalised to reflect a preferred direction and reduce duplication.
  • Biological relationship types would have to be checked, say overnight, to ensure that they constitute a Directed Acyclic Graph (i.e. no loops).
  • A user can only vote once per link details, or per person details, but not for any owned by themselves.
  • The birth name cannot always be identified from a census and so this, and the name as-written, both need a distinct field in the database tables.
  • When incorporating multiple census returns (i.e. from different years), the links can no longer be directed (person-to-person) since their presence may not have been identified in all relevant census returns. Their per-census identities effectively form a SET and would be handled by different tables.
  • The notes were plain-text in the prototype. Mark-up similar to STEMMA’s structured narrative would be ideal since it could represent citations to other sources, URLs, attachments such as images, and references to other persons on a census page.
  • The prototype did not consider the lifecycle events such as notifying users when voting changes on their link/person details, or notifying a voting user when the associated details have been revised.

Database Schema

This is another technical section presenting a partial database schema in order to explain the entity relationships more clearly. The columns use the following key to their properties:

Auto    Auto-generated ID
PK       Primary key
FK       Foreign Key
NK      Natural Key
U[n]     Uniqueness constraint on one-or-more columns
Null     Column is nullable

LINK
Defines person-to-person links in same census. Only one link exists for any two persons
LinkId
Auto,PK
ID for this link
P1Id
NK,U1
Key for first person
P2Id
NK,U1
Key for second person

LINK_DETAILS
Provides per-user details (e.g. type, notes) on any given link. Each user can only add one set of details per link.
LinkDetId
Auto,PK
ID for these link details
LinkId
FK,U1
Reference to a specific link
UserId
FK,U1
User adding details
Notes
FK,Null
ID for any textual notes
Type

Link type (relationship)

PERSON_DETAILS
Provides per-user details (e.g. notes) on any given person. Each user can only add one set of details per person.
PerDetId
Auto,PK
ID for these person details
PId
NK,U1
Key for person
UserId
FK,U1
User adding details
Notes
FK
ID for textual notes

LINK_VOTE
Voting on user details for specific link. Each user can only vote once per link details, but not for their own.
LinkDetId
FK,U1
ID for link-details being voted on
UserId
FK,U1
User doing the voting
Vote

+1/-1 vote
Notes
FK,Null
ID for any textual notes

PERSON_VOTE
Voting on user details for a specific person. Each user can only vote once per person details, but not for their own.
PerDetId
FK,U1
ID for the person-details being voted on
UserId
FK,U1
User doing the voting
Vote

+1/-1 vote
Notes
FK,Null
ID for any textual notes