Showing posts with label Community. Show all posts
Showing posts with label Community. Show all posts

Thursday, 7 April 2022

What a Mess

This will be my last blog post for the foreseeable future, and probably forever. This is not a matter of free time, or of advancing years, or even of competing tasks, but of a complete disillusionment with modern genealogy.

I will continue to research in my spare time, but this will be to produce something of standing for my family and extended family to read; I have lost faith in the public world of genealogy. But let me explain in more detail because this has been on the horizon for a while, and yet previously planned retreats have all fallen down for different reasons.

I am use to academic research, and how academic research works in other fields. The purpose of research in those other fields is to find answers — truths — and to produce a valuable collective body of work through collaboration. Virtually by definition, it is not a commercial goal.

I have previously pondered over the nature of genealogy (What is Genealogy?), and considered its difference from family history, but there is a more systemic difference that touches on collaboration, software, and commercial forces. Although genealogy has a well-respected academic side, it generally considers the internet, and digital resources in general, as only good for derivative sources such as images and transcriptions, and not for publications. This field has high standards and produces quality work in traditional publications such as books and journals, but the internet is considered inappropriate for publication due to its transient and ephemeral nature.

If we look to commercial genealogy then we see two quite different worlds: that of the generation of derivative sources that people can search, and that of online trees. Other than improving the search tools and options, I have no real criticisms of the many digitisation and transcription projects, but for online trees then I have many. In fact I have written so many articles on this subject that I won't even begin to enumerate links to them. Irrespective of whether we are considering "unified trees" or "user-owned trees", there are fundamental issues with their structure and the process by which they are generated.

In terms of structure, a tree is appropriate for representing biological lineage, but dreadful for representing history — can you imagine a family tree attempting to detail, say, the events of WWII? But non-biological lineage, such as fostering and adoption, or even weaker associations between people, break this visualisation and can result in a cat's cradle of complexity and confusion. A tree is also limiting in terms of proof arguments (particularly if they reference multiple individuals, families or generations), citations that refer to actual claims (as opposed to simple hyperlinks saying where you got your information), and linking to external resources (images or document scans) that are not specific to single individuals.

But worse than this is the process by which we are expected to construct such trees. We are all probably aware now of the variable quality of trees — although I still find it vexing when I see 'trees are not a valid source' (it depends on the claim) — and that trees can persist online long after someone may have dabbled for a few months using a free trial or a subscription birthday present. There is no responsibility taken by the respective companies for the accuracy of what their subscribers publish, and they appear to be disinterested in why academia looks down on these published works. It is impractical for these companies to fact-check stuff, and so I am not suggesting that is the solution, but they do not acknowledge, publicly, that the simple paradigm of building trees directly from their raw digitised records is naive (despite their advertising). There are many difficult cases of family reconstruction that require effort — possibly an enormous amount of effort — to get around missing information, ambiguous information, or even deliberately obfuscated information, and so make a case for what really happened in the past.

But two experienced researchers might reach different conclusions, both of which appear to fit available information, and so how should that be dealt with? Well, the red mist and edit wars commonly associated with "unified trees" are not the answer. If left to software people then they might suggest transactional get and commit operations, analogous to those in software source-control systems. If you don't know what these are then it's probably best not to ask; they're complicated, generally with horrible user interfaces, and even get software people into trouble.

Well, why don't these companies look at how collaborative research works everywhere else? I can't believe that they're ignorant of it, and so I can only assume that they fear it would be too complicated for their subscribers, or that it would cost them money, or even that it's just a huge step into the unknown and they don't want to kill their cash cow.

Collaborative research elsewhere is not a linear one-step 'raw-data leading to final conclusions'; it's stepwise, and involving prior work by other researchers. Researchers can then look at the work of others and build from it (or refute it). This means real written work, with real citations, is a starting point as claims have to be justified, not just by pointing to data that appears to confirm them, but by explaining why, and why not something else.

OK, so not everyone will be able or willing to produce such written work, but there are people who do, and regularly do so: bloggers. I have already made a case that online genealogy companies could take advantage of this in a way that requires minimal investment, would not run into copyright or attribution problems, and would increase traffic to the respective blogs — surely, a win-win (Blogs as Genealogical Sources). Briefly summarised, the author of a blog article would give permission to the genealogical company to list the corresponding URL in one of their databases, and would provide meta-data to ensure that it showed up in the results of appropriate searches. The genealogy company would store such information in a database of so-called authored works (i.e. the URL, name of author, article title, and meta-data), but would not copy the body of the works. When these works showed up in a genealogical search, the end-user would click on one of them in order to be directed to the original blog article.

Yes, there would be some smaller issues such as the rating these works, or citing them, and so on, but it's academic as there has been no subsequent engagement — Zero, with a capital Z — by any of the companies, including the ones I approached directly.

Modern genealogists rely on the search functions within these online companies, and possibly on Google (although woe betide we have to research a surname such as 'covid'), but they would be less likely to find relevant printed books or journal articles. This sort of scheme could even be extended to cover non-internet sources, but there is yet another possibility, one that flies in the face of the view that research has to be written up in paper-based journals.

People who have researched in other fields may be aware of sites such as arXiv.org (the 'X' is actually representing the Greek letter chi, and so the site name is pronounced as "archive"). These contain online articles, submitted online, and viewed online. They are much more accessible and searchable than the old paper-based journals, and it is entirely possible that this could be done for genealogical research, but it would take the initiative away from any forward-looking genealogy company. Does that matter to genealogists? Probably not as there are many searchable resources that do not fall under their control. Would it contribute to the accuracy and a truly collaborative approach in modern genealogy?

I wish I could be optimistic here, but I'm not!

Monday, 2 August 2021

Is Pinterest a Valid Source?

 

At the end of 2019, I made a case for online trees being a valid source, although with some caveats. I recently thought about a similar case for Pinterest, but the situation is not the same there so I wanted to dig over the main issues.


 Like many people, I started a Pinterest account when it first appeared, and then got disinterested when I realised that it was all smoke and mirrors (or images thereof), that my feed could not be tailored to deliver what I really wanted to see, and when I got deluged by unwanted advertising. In fact, I have just closed my account as it had no value for me.

Pinterest has been criticised for many reasons, including some content being pornographic or obscene, being overtly political, hosting commercial scams, spreading misinformation (especially medical), or focusing on people's eating disorders or weight problems. But what was it supposed to be?

According to Wikipedia, Pinterest is an "American image sharing and social media service designed to enable saving and discovery of information ... and ideas", but the reality is much more mundane. It is now basically an image sharing site with no obvious purpose. You see images were supposed to be just a taster that encouraged people to pin them, and click on them to get information; an image on its own — with no caption, link, or accompanying information — is a dead-end.

Let me pick a specific case, one that initially encouraged me to look at Pinterest: images of old places. I love to see historical pictures of my home town, but on Pinterest they invariably contain no details, or any caption. If I wanted to search for an image of a particular place then I cannot — the search bar simply finds boards of that name shared by other users. If I happen upon a rare or interesting image there then I might, if I'm lucky, recognise the place, but what about the date, or the photographer, or the story behind the picture?

If I was doing this as part of a research project then I have a deeper problem: provenance. Where did the image come from? Who took the original, and is it in copyright? Pinterest does have a mechanism for a copyright holder to get material taken down, but this is fighting against the tide because it already makes it so easy for people to share anything and everything that they might find, online or otherwise. At the very least, it should have implemented a mechanism identifying the initial point of entry of an image onto Pinterest, i.e. who first loaded it, and where from.

The situation is more complicated than this, though, and doomed to failure in the hands of people who treat it like stamp collecting. I have several images in my blog posts that I have taken pains to get permission to display from the copyright holders, and I had shared those same posts via Pinterest using those images. And yet I have these images in isolation on other people's Pinterest boards. That is, pinned images, divorced from my blog, without the associated information, and without any provenance or attribution. The images had been appropriated to sit in someone's "gallery" of images that they like, but that serves no purpose beyond the private pleasure of such hoarders.

So, if Google turns up some image during your research that resides on Pinterest, what do you do? Would there be any point in citing it at all, in the way you might for other social media? Google does provide a search-by-image mechanism through which you might be able to identify a non-Pinterest copy — ideally being older and with more details — but then a Google search could equally have found that, so what purpose does Pinterest serve? As a means of sharing, it is naively structured and simply exacerbates sharing issues already present on the Internet. But as a source of information that is worth reading and citing then it is a non-starter.

Wednesday, 18 December 2019

Another Tree Can Be a Valid Source


I’m just taking a short break from my work to write about “valid sources”. I was prompted to do this after reading an article on the Family History Daily website entitled “Another Person’s Family Tree is Not a Valid Source”, posted approximately March 2018. The article is anonymous but Melanie Mayo-Laakso is the website’s founder and editor.

The thrust of the article is straightforward, and is not challenged here: that information from someone else’s tree is very likely to be inaccurate, and that you should at least verify the information in more reliable records before adding it to your own tree. This is particularly important since providers of online family trees make it oh-so-easy to copy information into your own tree, whether accurate and relevant, or not. Quoting from that article,

The issue arises from the fact that many people don’t view the information contained in a family tree any differently than they do the data found in a record source. When they are presented with individuals from a tree that appear to match their needs they see the data as existing research and very often copy the information without a thought.

The challenge presented here is to do with the nature of a ‘source’, and that online family trees have distorted this in the minds of their users. Furthermore, to explain that family trees are “valid sources”, and that the difference is primarily in their degree of reliability.

First, let’s dispel some related myths:

  • A "source" is simply a source of information that you have used in some research, and not specifically information that you've followed blindly, or even that you agree with.
  • Genealogy is not just about discrete bits of information: the so-called “facts”.
  • No source is guaranteed to be factual, and all sources must be assessed with a critical eye some more than others.
  • Many answers will never be found directly in a single source.

Why are the associated myths relevant? Well, these points suggest that there is more, in real research, than collecting discrete “facts”. Sometimes, you need to make a case that involves looking at multiple sources, and ones that may contain conflicting information. Writing up this type of inferential genealogy is what makes the difference between information (just something a source says) and evidence (something that substantiates, or refutes, a claim you have made). NB: This is not just something that professional or academic genealogists do, but people in other fields of research as well, although their terminology may differ.

Now the problem with online trees is that they circumvent this sequence, and subscribers are led to believe that “sources” yield discrete reliable "facts", and anything that doesn't yield such cannot be a "source". These trees can easily make a connection between such a discrete “fact” and some database entry, but that says nothing more than where the information came from. Very few trees — in fact, I have never seen one — include any type of narrative explaining why a cited database entry (or image) is in any way relevant, let alone analysing multiple sources to derive a considered conclusion when there are no direct answers.

Sources may be original or derivative, where a derivative may be close (e.g. a facsimile or a scan) or distant (e.g. transcribed or translated), and so at best an online tree must be considered a derivative form that compiles information from other sources. They are no less a source than any of the other derivative sources already offered by your genealogy provider, even though their accuracy may well be poorer. But no source is guaranteed to be accurate, whether it’s a database, an online image, or even a stamped birth certificate directly from the relevant government office.

Note that a source may also be an ‘authored work’, which is a form that looks at information from several other sources, and rather than simply compiling it, it analyses the information to derive specific conclusions. The nature of these works means that they have to consider all types of source, whether original or derivative, whether reliable or sloppy, whether agreeing or conflicting, whether primary or secondary information, whether official or private information, and even including authored works by other writers.  To date, none of the genealogy providers have got their head around this concept, and how it works in the rest of the research world (cf. “Research in Online Trees”), but the principles stand.

So, to summarise, in writing up your research, you can utilise whatever sources of information that are relevant to your argument, as long as you evaluate them with the appropriate critical eye.

Wednesday, 15 November 2017

Thither FHISO



I want to temporarily break my blogging hiatus to summarise the progress of everyone’s genealogical friend: FHISO.

There can’t seriously be anyone in the community who hasn’t heard of FHISO (Family History Information Standards Organisation), but how many of you might have written it off? If so then look again! I want to raise awareness of its recent substantial progress, and to challenge pundits to evaluate its relevance.


History

To put this article in context, let’s just wind things back to 2010. Pat Richley-Erickson (alias DearMYRTLE), Greg Lamberson, and Russ Worthington, had become so fed up with the problems of sharing basic genealogical data that they created the BetterGEDCOM wiki: its goal being to produce an internationally applicable standard for the sharing and long-term storage of genealogical data.

Although this wiki garnered huge support — 134 members within its first year, contributing over 3,000 pages and 8,500 discussion posts — no actual standard emerged. The reasons for this were manyfold: there was no real structure or assigned responsibilities in the membership, the goal was too poorly defined (what genealogical scope? what level of backwards compatibility?), and there was no technical strategy (what technologies? what file formats?). As a result, discussions — valuable as they were — became mired in minutia, no consensus was reached, and nothing was formally written up.

Early in 2012, a small group of BetterGEDCOM members formed FHISO with the goal of overcoming these failings. They spent considerable effort designing an organisation that would accommodate a large and diverse number of contributors, and in planning for consensus-building and digital organisation.

In April 2012, it received a grant from genealogist Megan Smolenyak to help get the organisation started, and during the remainder of 2012 it pulled off an incredible coup by getting industry support from the following high-profile Founding Members (in chronological order of announcement):


During the summer of 2012, there were a number of blog-posts related to GEDCOM-X and to FHISO, including those of Louis Kessler (Whither GEDCOM-X?, 7 Jun 2012), Randy Seaver (Whither FHISO and GEDCOM X? Observations and Commentary, 18 Jul 2012), Tamura Jones (FHISO and GEDCOM X, 18 Jul 2012), and Pat Richley-Erickson (Whose sandbox is it anyway?, 19 Jul 2012). Randy’s subsequent Follow-Up Friday (20 Jul 2012) provided a more complete summary.

These posts were mostly concerned with proprietary versus community standards. GEDCOM-X was new and people believed that it would be a competing de facto standard — exacerbated by the fact that FamilySearch weren’t in the list of Founding Members, above. This turned out to be ill-founded paranoia (more on this in a moment), but even FHISO was defensive and quietly concerned, as can be inferred from GeneJ’s contribution to Randy’s summary.

During March 2013, in order to try and keep membership attention in the absence of technical work, FHISO began its Call For Papers initiative. The idea was that people could send in their ideas and proposals in preparation for more consensus-based work. Although quite a few papers were submitted, and on a wide range of topics, the number of distinct submitters was small — possibly an unrecognised warning to FHISO that few people would find the time or inclination if the effort was onerous.

The investigations into software tools for the burgeoning organisation showed that good ones were too expensive for FHISO and the cheap (or free) ones didn’t deliver what was needed, and this work dragged on for too long. Over the years, Board members have had to reach deep into their personal pockets to help move the organisation to the point where it would be fair for members to subscribe for another year (original memberships have been continually extended, for free, since August 2014).

During 2013–14, there were a number of team changes, and it would be a fair criticism to say that FHISO dropped the ball during these changes. There was little visible activity to people outside of FHISO (as explained by Tamura: Genealogy 2013: events & trends, 31 Dec 2013) and so it was to be expected that the community would lose interest in it.

During 2014, FHISO finally established its TSC (Technical Standing Committee), and so began the real technical work. This included the creation of the TSC-Public mailing list, and the creation of several exploratory groups, each of which had its own mailing list. However, the mailing lists demonstrated the same issues that were previously experienced in BetterGEDCOM: topics digressed and discussions meandered without formal conclusions. Such discussions must necessarily resort to a bewildering and ever-evolving technical vocabulary, and software people generally find it hard to explain their concepts in familiar terms without losing technical accuracy. It was truly amazing, therefore, that some well-known non-software genealogists participated, and I genuinely take my hat off to those that succeeded in balancing the discussions with real-world genealogical issues.

Quandary

When the flurry of posts on these mailing lists began to fizzle out, and the exploratory groups all floundered, FHISO took a deep breath and a long look at the reality of standards development. If it was going to achieve its goals then it needed to better-understand why other initiatives had failed, and it clearly needed to adopt a quite different approach.

It became evident that the hardest part of standards development was not the technical side but the commercial and/or political side. Creating a new data representation has some technical challenges, but it’s doable; there were several examples out there, ranging from the old GEDCOM to more recent data models, file formats, database schemas, and APIs, all coming from a range of commercial products and private research projects. But despite its age and deprecated status, GEDCOM was still the most widely-used way of exchanging data. There were those who believed that the industry could stay with GEDCOM, and that its problems and equivocality could be fixed in a new version. Then there were those who believed that this would restrict the evolution of genealogy, and that we must leapfrog lineage-linked data to include non-person subjects of history, or to integrate real research-based narrative. In reality, none of these viewpoints were entirely correct, … or incorrect.

If a new data representation were to be produced — even just an updated GEDCOM — then it would be unlikely that the industry would immediately embrace it for the simple fact that commercial stakeholders would have a financial commitment to their current internal data models, and to their supported modes of import/export. Companies and their products would have evolved along with their internal data model; any new data model with larger scope — no matter how powerful or modern — would have no impact if it required companies to abandon their existing products and to start again. For instance, taking advantage of the powerful analogy between persons and places as historical subjects would not help companies whose import/export was entirely via GEDCOM, or whose internal data models had not recognised the analogy. Data models such as GEDCOM-X were designed around the specific requirements of the parent organisation, and not as future-proofed models to be shared by all genealogical software.

Another issue was that there were many stakeholders out there (including every user who simply wanted to exchange data without error or loss), but fewer people prepared to openly contribute on mailing lists, and fewer still who had the time and skills to produce formal written material. FHISO’s impressive Founding Members seemed content to sit on the sidelines, and there was little (if any) engagement with them following the original announcements.

This was a tough problem. There was no doubt that the industry needed not just an open standard but an evolutionary path: one that would permit ‘software genealogy’ to mature, and to become part of the modern digital world. However, there was clearly some apathy to doing the heavy lifting, and there could be later resistance to anything too radical.

If ever the term catch-22 found its true mark then it was in the field of software genealogy.

The first thing to be done was to modify the organisational structure to one more appropriate to the reach of genealogical standards — FHISO was not a general-purpose ISO or ANSI. FHISO already had the concept of an Extended Organisational Period (EOP), now embodied in article 24 of its by-laws, during which the membership would be populated (including the Founding Members), new officers appointed and roles filled, and the TSC established. The EOP also allowed the Board to amend the by-laws as necessary, and without the need for an Annual General Meeting (AGM). It may have been envisaged that the EOP would only last for maybe six months, but it was still in effect at this time of change.

From a technical perspective, FHISO needed a focus: something concrete that could be debated, cited, and built upon. It would therefore be necessary for a core of dedicated people to form a Technical Project Team (as allowed by the TSC Charter) to establish a technical strategy and to publish a selection of draft component standards in order to kick-start subsequent work.

These processes could all be done within the remit of the EOP, but it would have to keep the membership (and the public) aware; certain organisations would not acknowledge any third-party standard unless it was developed through a proper transparent process by an incorporated organisation. FHISO does publish regular Board minutes and TSC minutes, but it would only be allowed to publish draft standards for comment during this period (not official standards) since there would be no voting mechanism. Until this work had reached an acceptable level, and elections and AGMs could be resumed, then it was deemed inappropriate to require existing members to pay for each year.

During this phase, FHISO produced a technical strategy paper that amplified on these points, and a policy document on the preferred nature of software vocabularies. The vocabularies document was updated during Feb/Mar 2016 to incorporate public feedback.

Wheel Hubs

Part of FHISO’s technical strategy was to focus on what might be called component standards: standards related to specific parts of genealogical data (e.g. personal names, citation elements, place references, dates), and allow these to be integrated into existing data models. This would not preclude the future publishing of a single FHISO data model that embraced all of these components, but for the shorter term it would allow existing data models to incorporate them more quickly, while minimising the impact on their core software. The basis for this was that, within certain limits, it should be possible to have distinct proprietary data models cooperating if they shared a common currency.

FHISO had previously used the analogy of car design to explain this strategy; rather than standardise what cars we can drive, there would be benefit in standardising the parts from which all cars are made. Well, there’s a real instance of this that can be cited: all the cars around the world (with a few exotic exceptions that can be ignored) share the same wheel hub sizes. There is a standard set of accepted sizes, and they’re all measured in Imperial units — even in countries that use the metric system. This means that the same range of tyres can be used for all our cars, no matter which model or where it was manufactured.

The plan, therefore, was to work on a number of these component standards, each of which would include details of how it should be integrated into existing data models — the so-called “bindings”. But there was a problem here: GEDCOM-X could never supplant GEDCOM because there were probably millions of GEDCOM files still out there, and software products that were tied to the GEDCOM model. These problems for FamilySearch effectively mirrored those of FHISO’s standardisation effort, but for the component standards to work then it required a version of GEDCOM that could be taken forwards. If those companies that were bound to GEDCOM were not to be left behind then there had to be a new version of it, one for which bindings could be defined for FHISO’s new component standards. The two initial data models of interest, therefore, would be GEDCOM and GEDCOM-X.

The following diagram illustrates how these component standards would be assimilated by the various data models, including a supported GEDCOM continuation (shown here as “ELF”).

FHISO Component Standards
Figure 1 – FHISO Component Standards.

FHISO ELF

GEDCOM hasn’t been updated in decades, and there are acknowledged weaknesses and ambiguities in its specification. Furthermore, the name is still the property of FamilySearch.

FHISO would, therefore, define a fully compatible format called Extended Legacy Format, or ELF for short. ELF v1.0 would be compatible with GEDCOM 5.5(.1), such that ELF could be loaded by a GEDCOM processor, and vice versa. This means not only that stakeholders could declare support for ELF v1.0 with not too much effort, but also that there would be no reasonable excuse for not declaring support.

Of all FHISO’s draft standards, ELF is probably the most important since it presents a future for GEDCOM data and software, a future that would support enhanced movement of data both between compliant products and between differing proprietary data models.

Figure 2 – FHISO ELF.[1]

As well as being a supported and more tightly-specified version of GEDCOM, ELF would include an extension mechanism that would be employed in later versions to embrace the FHISO component standards, and any third-party extensions by using proper namespaces.

During the preparation of the first ELF draft, FHISO engaged with members of the German group GEDCOM-L, which represents over twenty genealogical programs over there. Their goal since 2009 has been to reach agreement on the interpretation of the GEDCOM 5.5.1 specification, and to extend it to include a number of “user-defined tags”. For instance, high among their priorities was support for the German Rufname, or “appellation name”, which is an everyday form of personal name.

FHISO intends to utilise the knowledge and experience of the GEDCOM-L group in making a better GEDCOM.

Milestones and Signposts

Industry contacts had identified a citation-element vocabulary — a representation of the discrete elements of data within citations — as filling an important niche in today's standards, and so this became the focus of the first component standard.

In 2016, the Technical Project Team began to draft possible standards text, releasing an early draft micro-format for a citation-element ‘creator name’ for comment in the spring. During June 2017, FHISO was able to publish a number of high-quality draft standards for public comment, and during the September it incorporated public feedback from its TSC-Public mailing list.

This milestone puts FHISO ahead of all previous standards initiatives! The level of detail and accuracy in these drafts, combined with the choice of technologies, establishes a future-proof model that could take genealogical data as far as is needed, and so it sets the bar for all future FHISO work.

The next phase will involve releasing draft citation-element bindings for both GEDCOM-X and GEDCOM. Already released is a draft bindings document for RDFa. Rather than being a genealogical data model, RDFa defines a set of attribute-level extensions to HTML. This is especially interesting as it allows pre-formatted citations to have their embedded elements marked-up. The industry norm is to first define the individual citation elements as discrete items, and then rely on some citation template system to build them into a formatted citation. Traditional genealogists, and anyone who prefers to hand-craft their own citations (including me), should welcome this inverted alternative as it recognises the power and flexibility of citations as sentences rather than formulae.

Affiliation

Although currently acting chairman on the FHISO Board, I write this article as someone who has always believed that standardisation absolutely must happen in our field. It borders on hypocrisy that users are expected to collaborate on unified trees, and to play fair with each other, when the large organisations have been unable to set a precedent with their data sharing.


[Backup copy of the BetterGEDCOM wiki has recently been restored at: BetterGEDCOM Archive. The original site has been locked by wikispaces.com]




[1] Original base image used with kind permission of SuperColoring (http://www.supercoloring.com/drawing-tutorials/how-to-draw-a-christmas-elf : accessed 26 Jun 2017).