If we have a software product on our computer that looks
after our family history data then we expect it to have some type of database
too. Why is that? The question is not as silly as it sounds. The perceived requirement
is largely a hangover from the past that introduces incompatible proprietary
representations and limits data longevity.
This is a bold statement but I will guide you through the
rationale for it. This will require me to explain some technical details of
databases and computers which I hope you’ll find useful. Those people who are
already aware of these details can just skip over those paragraphs (but
hopefully not the whole blog!).
A database
is an organised collection of data, which basically means it is indexed to
support efficient access to it. Databases are mostly disk-resident (more on
this later) but there are various types, including relational,
object-orientated, and multi-dimensional (or OLAP). The majority of modern
databases are relational ones that use the SQL query language, but there is
growing trend to use other forms called NoSQL such as ones based on
key-value stores.
The reason for NoSQL is that SQL databases were designed to
support high levels of consistency and reliability in their data operations.
For instance, ensuring that complex financial transactions either complete fully
or not at all — the last thing you want is for your money to disappear down a hole, or end up in two different
accounts. The term for this integrity feature is ACID and it imposes a large
performance penalty. This all makes sense for large commercial databases that
process concurrent transactions in real-time but it is extraneous for your personal
family history database.
If and when we get a modern standard for representing family
history data then it will primarily describe a data model — a logical
representation of the data. The actual physical representation in a file depends
on which syntax is adopted, and there may be several of those. This is
presented in more detail at: Commercial
Realities of Data Standards. However, no standard can, or should, mandate a
particular database or a particular database schema. For a start, no two SQL
databases behave the same — despite their being a SQL standard. The choice of
how a database schema is organised and indexed depends as much on the design
choices of the vendor and the capabilities of their product as it does on a
data-model specification. Effectively, a database cannot form a definitive copy
of your data since you cannot transport the content directly to the database of
another product, and if your product becomes defunct then you could be left
with an unreadable representation. This is important if you want to bequeath
the fruits of your research to an archive, or to your surviving family.
The goal of organisations like FHISO,
and BetterGEDCOM before it,
is to define a data model for the exchange and long-term preservation of
genealogical and family-history data. This means that whatever the vendor’s
database looks like, they must support an additional format for sharing and
preservation. Unfortunately, the old GEDCOM format is woefully inadequate for
this purpose and there is no accepted replacement that fits this bill. If such
a representation existed then why couldn’t we work directly from it? The idea
isn’t that new but there are technical arguments that are always levelled
against it, and I will come to these soon. If the representation were covered
by an accepted international standard then the problems of sharing and
longevity disappear. It also opens up your data for access by other, niche
software components — say for a special type of charting or analysis. This
isn’t possible if it’s all hidden in some opaque proprietary database. It also
means there’s less chance of an unrecoverable corruption because the internal
complexities of the database do not apply.
Let’s pick on a controversial representation: XML. This is an international
standard that is here to stay[1],
and that means that any standards organisation must at least define an XML
representation of their data in order to prevent multiple incompatible
representations being defined elsewhere. There are some people, including
software vendors, who vehemently hate XML, and would refuse to process it. The
main reason appears to be a perceived inefficiency in the format and so it is
ideal for this illustration.
Yes, XML appears verbose. Repetitive use of the same key
words (element and attribute names) means it can take up more disk space than
other, custom representations. However, this also means that a compressed (or
zipped) version is incredibly reduced. Disk space is cheap, though. Even a
humble memory stick now has thousands of times more storage than the hard disks
that I grew up with. I have seen many XML designers try to use very short key
words as though this helps make it more efficient. Apart from making it more
cryptic, it doesn’t help at all.
But what about its memory usage? Right, now we’re getting to
the more serious issues. Let me first bring the non-software people up-to-speed
about memory. A computer has a limited amount of physical memory (or RAM). Secondary storage,
such as hard disks and memory sticks, is typically much larger in size than the
available RAM. However, data has to be transferred from secondary storage to
RAM before it can be processed by a program, and code (the program
instructions) has to be similarly transferred before it can be executed. The operating
system (O/S) creates an abstract view of memory for each program called a virtual address space,
and this is made up of both RAM and secondary storage. This means that the O/S
has to do a lot of work behind the scenes to support this abstraction by making
code and data available in RAM when it’s needed, and this process is called paging. It keeps track of pages
(fixed-sized chunks of code/data), and when they were last used, so that it can
push older ones back out to disk in order to make room for newer requirements.
If a program tries to randomly (rather than sequentially) access a large amount
of data in its virtual address space then it can result in disastrous
performance, and an effect known as thrashing.
The size of each program’s virtual address space is
constrained by the computer’s address size. This means that on a 32-bit
machine, a program can only address 0 to 4GB, irrespective of how much RAM is
available. The situation is often worse than this because some of the virtual
address space is used to give each program access to shared components.
Under Microsoft Windows, for instance, this is usually a
50/50 split so that each program can only address up to 2GB[2].
Hence, bringing large amounts of data into virtual memory unnecessarily can be
inefficient on these systems. It is possible to create software that can
manipulate massive amounts of data within these limitations[3]
but the effort is huge.
So does XML require a lot of memory? Well, not in terms of
the key words. These are compiled into a dictionary in memory, and references
to these dictionary entries are small and of fixed-size. Hence, it basically
doesn’t matter how big the words are. There are two approaches to processing an
XML file, though, and the results are very different:
- Tree-based. These load an XML file into an internal tree structure and allow a program to navigate around it. The World-Wide Web Consortium’s (W3C) Document Object Model (DOM) is a commonly used version. These are very easy to use but they can be memory hungry. Also, a program may only want access to part of the data, or it may want to convert the DOM into a different tree that’s more amenable to what it is doing. In the latter case, it means there will be two trees in memory before one is discarded in favour of the other.
- Event-based. These perform a lexical scan of the XML file on disk and let the program know of parsing events, such as the start and end of elements, through a call-back mechanism. The program can then decide what it wants to listen for and what it wants to do with it. This is possibly less common because it requires more configuration to be provided in the code. SAX is the best known example.
XOM is an interesting
open-source alternative in the Java world. Although based on SAX, it creates a
tree representation in memory. However, the event-driven core allows it to be
tailored to only load the XML parts of interest to the program. In effect, there
is no reason why XML cannot be processed efficiently.
But… but … but … I
have 26 million people in my tree and…. Yes, and you want to see them all
at once, right? Many new computers now have 64-bit addressing which means their
virtual address space is effectively unlimited (about 1.8 x 1019 bytes)
and they’re fully capable of allowing a program to use as much of the cheap RAM
as it wants. Database vendors have known this for some time and found they
could achieve massive performance boosts using in-memory databases.
Unfortunately, there are still many 32-bit computers out there, and also many
programs whose 32-bit addressing is set in concrete, even in a 64-bit
environment.
STEMMA®
is a logical data model but predominantly uses XML as its file representation.
It is tackling these issues by allowing data to be split across separate documents (i.e. files), and each
document to be comprised of one-or-more self-contained datasets. This means that there is a lot of choice for how you want
to divide up your data on disk. For instance, separating out shared places or
events, or dividing people based on their surname or geography. When a document,
or dataset, is loaded into memory then it is indexed at that point. One dataset
can be loaded and indexed in less time than it takes to double-click. Memory-based
indexes are far more efficient and flexible than database ones since they can
be designed specifically to suit the program’s requirements.
This post is rather more technical than my others but I wanted
to give a clear picture of the issues involved. There are many advantages to
using so-called flat files in conjunction with memory-based indexing, including
better sharing, greater longevity, stability and reliability, and supporting
multiple applications on the same data. I hope that this post will encourage
developers to think laterally and consider these advantages. During my own
work, for instance, I found that a STEMMA document — which implicitly includes
trees, narrative, timelines, and more — effectively became a “genealogical
document”, in the sense of a word-processor document. It could be received by
email and immediately loaded into a viewer (analogous to something like Word or
Acrobat) to view the document contents, and to navigate around them in multiple
ways.
[2] 32-bit Windows
systems do have a special boot option to change this split to 3GB/1GB but it is
rarely used. The situation can be worse on some other O/S types, such as the
old DEC VMS systems which only allowed 1Gb addressability for each program’s
P0-space.
[3] I was once software
architect for a company that implemented a multi-dimensional database that ran
on all the popular 32-bit machines of the time. This handled many Terabytes
(thousands of GB) of data by performing the paging itself.
No comments:
Post a Comment