Sunday 20 October 2013

Do Genealogists Really Need a Database?

If we have a software product on our computer that looks after our family history data then we expect it to have some type of database too. Why is that? The question is not as silly as it sounds. The perceived requirement is largely a hangover from the past that introduces incompatible proprietary representations and limits data longevity.

This is a bold statement but I will guide you through the rationale for it. This will require me to explain some technical details of databases and computers which I hope you’ll find useful. Those people who are already aware of these details can just skip over those paragraphs (but hopefully not the whole blog!).

A database is an organised collection of data, which basically means it is indexed to support efficient access to it. Databases are mostly disk-resident (more on this later) but there are various types, including relational, object-orientated, and multi-dimensional (or OLAP). The majority of modern databases are relational ones that use the SQL query language, but there is growing trend to use other forms called NoSQL such as ones based on key-value stores.

The reason for NoSQL is that SQL databases were designed to support high levels of consistency and reliability in their data operations. For instance, ensuring that complex financial transactions either complete fully or not at all — the last thing you want is for your money to disappear down a hole, or end up in two different accounts. The term for this integrity feature is ACID and it imposes a large performance penalty. This all makes sense for large commercial databases that process concurrent transactions in real-time but it is extraneous for your personal family history database.

If and when we get a modern standard for representing family history data then it will primarily describe a data model — a logical representation of the data. The actual physical representation in a file depends on which syntax is adopted, and there may be several of those. This is presented in more detail at: Commercial Realities of Data Standards. However, no standard can, or should, mandate a particular database or a particular database schema. For a start, no two SQL databases behave the same — despite their being a SQL standard. The choice of how a database schema is organised and indexed depends as much on the design choices of the vendor and the capabilities of their product as it does on a data-model specification. Effectively, a database cannot form a definitive copy of your data since you cannot transport the content directly to the database of another product, and if your product becomes defunct then you could be left with an unreadable representation. This is important if you want to bequeath the fruits of your research to an archive, or to your surviving family.

The goal of organisations like FHISO, and BetterGEDCOM before it, is to define a data model for the exchange and long-term preservation of genealogical and family-history data. This means that whatever the vendor’s database looks like, they must support an additional format for sharing and preservation. Unfortunately, the old GEDCOM format is woefully inadequate for this purpose and there is no accepted replacement that fits this bill. If such a representation existed then why couldn’t we work directly from it? The idea isn’t that new but there are technical arguments that are always levelled against it, and I will come to these soon. If the representation were covered by an accepted international standard then the problems of sharing and longevity disappear. It also opens up your data for access by other, niche software components — say for a special type of charting or analysis. This isn’t possible if it’s all hidden in some opaque proprietary database. It also means there’s less chance of an unrecoverable corruption because the internal complexities of the database do not apply.

Let’s pick on a controversial representation: XML. This is an international standard that is here to stay[1], and that means that any standards organisation must at least define an XML representation of their data in order to prevent multiple incompatible representations being defined elsewhere. There are some people, including software vendors, who vehemently hate XML, and would refuse to process it. The main reason appears to be a perceived inefficiency in the format and so it is ideal for this illustration.

Yes, XML appears verbose. Repetitive use of the same key words (element and attribute names) means it can take up more disk space than other, custom representations. However, this also means that a compressed (or zipped) version is incredibly reduced. Disk space is cheap, though. Even a humble memory stick now has thousands of times more storage than the hard disks that I grew up with. I have seen many XML designers try to use very short key words as though this helps make it more efficient. Apart from making it more cryptic, it doesn’t help at all.

But what about its memory usage? Right, now we’re getting to the more serious issues. Let me first bring the non-software people up-to-speed about memory. A computer has a limited amount of physical memory (or RAM). Secondary storage, such as hard disks and memory sticks, is typically much larger in size than the available RAM. However, data has to be transferred from secondary storage to RAM before it can be processed by a program, and code (the program instructions) has to be similarly transferred before it can be executed. The operating system (O/S) creates an abstract view of memory for each program called a virtual address space, and this is made up of both RAM and secondary storage. This means that the O/S has to do a lot of work behind the scenes to support this abstraction by making code and data available in RAM when it’s needed, and this process is called paging. It keeps track of pages (fixed-sized chunks of code/data), and when they were last used, so that it can push older ones back out to disk in order to make room for newer requirements. If a program tries to randomly (rather than sequentially) access a large amount of data in its virtual address space then it can result in disastrous performance, and an effect known as thrashing.

The size of each program’s virtual address space is constrained by the computer’s address size. This means that on a 32-bit machine, a program can only address 0 to 4GB, irrespective of how much RAM is available. The situation is often worse than this because some of the virtual address space is used to give each program access to shared components.

Under Microsoft Windows, for instance, this is usually a 50/50 split so that each program can only address up to 2GB[2]. Hence, bringing large amounts of data into virtual memory unnecessarily can be inefficient on these systems. It is possible to create software that can manipulate massive amounts of data within these limitations[3] but the effort is huge.

So does XML require a lot of memory? Well, not in terms of the key words. These are compiled into a dictionary in memory, and references to these dictionary entries are small and of fixed-size. Hence, it basically doesn’t matter how big the words are. There are two approaches to processing an XML file, though, and the results are very different:

  • Tree-based. These load an XML file into an internal tree structure and allow a program to navigate around it. The World-Wide Web Consortium’s (W3C) Document Object Model (DOM) is a commonly used version. These are very easy to use but they can be memory hungry. Also, a program may only want access to part of the data, or it may want to convert the DOM into a different tree that’s more amenable to what it is doing. In the latter case, it means there will be two trees in memory before one is discarded in favour of the other.
  • Event-based. These perform a lexical scan of the XML file on disk and let the program know of parsing events, such as the start and end of elements, through a call-back mechanism. The program can then decide what it wants to listen for and what it wants to do with it. This is possibly less common because it requires more configuration to be provided in the code. SAX is the best known example.

XOM is an interesting open-source alternative in the Java world. Although based on SAX, it creates a tree representation in memory. However, the event-driven core allows it to be tailored to only load the XML parts of interest to the program. In effect, there is no reason why XML cannot be processed efficiently.

But… but … but … I have 26 million people in my tree and…. Yes, and you want to see them all at once, right? Many new computers now have 64-bit addressing which means their virtual address space is effectively unlimited (about 1.8 x 1019 bytes) and they’re fully capable of allowing a program to use as much of the cheap RAM as it wants. Database vendors have known this for some time and found they could achieve massive performance boosts using in-memory databases. Unfortunately, there are still many 32-bit computers out there, and also many programs whose 32-bit addressing is set in concrete, even in a 64-bit environment.

STEMMA® is a logical data model but predominantly uses XML as its file representation. It is tackling these issues by allowing data to be split across separate documents (i.e. files), and each document to be comprised of one-or-more self-contained datasets. This means that there is a lot of choice for how you want to divide up your data on disk. For instance, separating out shared places or events, or dividing people based on their surname or geography. When a document, or dataset, is loaded into memory then it is indexed at that point. One dataset can be loaded and indexed in less time than it takes to double-click. Memory-based indexes are far more efficient and flexible than database ones since they can be designed specifically to suit the program’s requirements.

This post is rather more technical than my others but I wanted to give a clear picture of the issues involved. There are many advantages to using so-called flat files in conjunction with memory-based indexing, including better sharing, greater longevity, stability and reliability, and supporting multiple applications on the same data. I hope that this post will encourage developers to think laterally and consider these advantages. During my own work, for instance, I found that a STEMMA document — which implicitly includes trees, narrative, timelines, and more — effectively became a “genealogical document”, in the sense of a word-processor document. It could be received by email and immediately loaded into a viewer (analogous to something like Word or Acrobat) to view the document contents, and to navigate around them in multiple ways.

[1] XML has a lot going for it, including schema versioning and namespaces. However, I personally draw the line at XSLT which is difficult to write, difficult to read/diagnose, very inefficient when applied to large XML files, and impossible to optimise.
[2] 32-bit Windows systems do have a special boot option to change this split to 3GB/1GB but it is rarely used. The situation can be worse on some other O/S types, such as the old DEC VMS systems which only allowed 1Gb addressability for each program’s P0-space.
[3] I was once software architect for a company that implemented a multi-dimensional database that ran on all the popular 32-bit machines of the time. This handled many Terabytes (thousands of GB) of data by performing the paging itself.

No comments:

Post a Comment