Sunday, 19 January 2014

You’re Probably Right



When anyone mentions statistics or probability being applied to genealogical research then there’s usually a sharp reaction. There happen to be some valid questions that would benefit from thoughtful discussion but unfortunately the many knee-jerk reactions tend to be for all the wrong reasons.


It’s hard to find a single reason why this topic gets such an adverse reaction since the arguments made against it are rarely put together very carefully. I have seen some reactions based purely on the fear that any application of numbers means that assessments will be estimated to an inappropriate level of precision, such as 12.8732%. That’s just ludicrous, of course!

In this post, I won’t actually be making a case for the use of statistics since I am still experimenting with an implementation of this myself and it isn’t straightforward. What I will try to do is identify what is and is-not open to debate, and ideally to add some degree of clarity. Although I have a mathematical background, this only briefly touched on statistics. It is a specialist field, and many folks will have a skewed picture of it, whether they’re mathematically inclined or not. It is also a technical field and so a few symbols and numbers are inevitable but I will try and balance things with real-life illustrations.

Statistics is generally about the collection and analysis of data. Despite what politicians might have us believe, statistics proves nothing, and this is important for the purposes of this article. Statistical analysis can demonstrate a correlation between two sets of data but it cannot indicate whether either is a consequence of the other, or whether they both depend on something else. The classic example is data that shows a correlation between the sales of sunglasses and ice-cream — it doesn’t imply that the wearing of sunglasses is necessary for the eating of ice-cream.

Mathematical statistics is about the mathematical treatment of probability, but there is more than one interpretation of probability. The standard interpretation, called frequentist probability, uses it as a measure of the frequency or chance of something happening. Taking the roll of a die as a simple example, we can calculate the number of ways that it can fall and so attribute a probability to each face (1/6, or roughly 16.7%). Alternatively, we could look at past performance of the die and use that to determine the probabilities; a method that works better in the case where a die is weighted. When dealing with the individual events (e.g. each roll of the die), they may be independent of one another, or dependent on previous events. A real-life demonstration of independent events would be the roulette wheel. If the ball had fallen on red 20 times then we’d all instinctively bet on black next, even though the red/black probability is unchanged. Conversely, if you’d selected 20 red cards from a deck of playing cards then the probability of a black being next has increased.

The other major interpretation of probability is called Bayesian probability after the Rev. Thomas Bayes (1701–1761), a mathematician and theologian who first provided a theorem to expresses how a subjective degree of belief should change to account for new evidence. His work was later developed further by the famous French mathematician and astronomer Pierre-Simon, marquis de Laplace (1749–1827). It is this view of probability, rather than anything to do with frequency or chance, which is relevant to inferential disciplines such as genealogy. Essentially, a Bayesian probability represents a state of knowledge about something, such as a degree of confidence. This is where it gets philosophically interesting because some people (the objectivists) consider it to be a natural extension of traditional Boolean logic to handle concepts that cannot be represented by pairs of values with such exactitude as true/false, definite/impossible, or 1/0. Other people (the subjectivists) consider it to be simply an attempt to quantify personal belief in something.

In actuarial fields, such as insurance, a person is categorised according to their demographics, and the previous record of those demographics is used to attribute a numerical risk factor (and an associated insurance premium) to that person. This is therefore a frequentist application. Consider now a bookmaker who is giving odds on a horse race. You might think he’s simply basing his numbers on the past performance of the horses but you’d be wrong. A good bookmaker watches the horses in the paddock area, and sees how they look, move and behave. He may also talk to trainers. His odds are based on experience and knowledge of his field and so this is more of a Bayesian application.

Accepted genealogy certainly accommodates qualitative assessments such as primary/secondary information, original/derivative sources, impartial/subjective viewpoint, etc. When we consider the likelihood of a given scenario then we might use terms such as possible, very likely, or extremely improbable, and Elizabeth Shown Mills offers a recommended list of such terms[1]. Although there is no standard list, we all accept that our preferred terms are ordered, with each being between the likelihoods of the adjacent terms. These lists are not linear; meaning that the relative likelihoods are not evenly spaced. They actually form a non-linear[2] scale since we have more terms the closer we get to the delimiting ‘impossible’ and ‘definite’. In effect, our assessments asymptotically approach these idealistic terms, but never actually get there.

As part of my work on STEMMA®, I experimented with putting a numerical ‘Surety’ value against items of evidence when used to support/refute a conjecture, and also on the likelihood of competing explanations of something. This turned out to be more cumbersome than I’d imagined, although a better user interface in the software could have helped. The STEMMA rationale for using percentages in the Surety attribute rather than simple integers was partly so that it allowed some basic arithmetic to assess reasoning. For instance, if A => B, and B => C, then the surety of C is surety(A) * surety(B). Another goal, though, was that of ‘collective assessment’. Given three alternatives, X, Y, & Z, simple integers might allow an assessment of X against Y, or X against Z, but not X against all the remaining alternatives (i.e. Y+Z) since they wouldn’t add up to 100%.

Although I didn’t know it, my concept of ‘collective assessment’ was getting vaguely close to something called conditional probabilities in Bayes’ work. A conditional probability is the probability of an event (A) given that some other event (B) is true. Mathematicians write this as P(A | B) but don’t get too worried about this; just treat it as a form of shorthand. Bayes’ theorem can be summarised as[3]:

           P(A)
P(A | B) = ―――  P(B | A)
           P(B)

It helps you to invert a conditional probability so that you can look at it the other way around. A classic example that’s often used to demonstrate this involves a hypothetical criminal case. Suppose an accused man is considered one-chance-in-a-hundred to be guilty of a murder (i.e. 1%). This is known as the prior probability and we’ll refer to it as P(G), i.e. the probability that he’s Guilty. Then some new Evidence (E) comes along; say a bloodied murder weapon found in his house, or some DNA evidence. We might say that the probability of finding that evidence if he was guilty (i.e. P(E | G) is 95%, but the probability of finding it if he was NOT guilty (i.e. P(E | ¬ G)[4] is just 10%[5]. What we want is the new probability of him being guilty given that this evidence has now been found, i.e. P(G | E). This is known as the posterior probability (yeah, yeah, no jokes please!). The calculation itself is not too difficult, although the result is not at all obvious.

           P(E | G)            95%
P(G | E) = ――― P(G) = ――――――――――――― x 1% = 8.8%
             P(E)          (95% x 1%) + (10% x 99%)

This may just look like a bunch of numbers to many readers, but the mention of finding new evidence must be ringing bells for everyone. If you had estimated the likelihood of an explanation at such-and-such, but a new item of evidence came along, then you should be able to adjust that likelihood appropriately with this theorem.

So what about a genealogical example? Well, here’s a real one that I briefly toyed with myself. An ancestor called Susanna Kindle Richmond was born illegitimately in 1827. I estimated that there was a 15% chance that her middle name was the surname of the biological father. If we call this event K, for Kindle, then it means P(K) is 15%. This figure could be debated but it’s the difference between the prior and posterior versions of this probability that are more significant. In other words, even if this was a wild guess, it’s the change that any new evidence makes that I should take notice of. It turns out that the name ‘Kindle’ is quite a rare surname. FreeBMD counted less than 100 instances of Kindle/Kindel in the civil registrations of vital events for England and Wales. In the baptism records, I later found that there was a Kindle family living on the same street during the same year as Susanna’s baptism. Let’s call this event — of finding a Neighbour with the surname Kindle ― N. I estimated the chance of finding a neighbour with this surname if it was also the surname of her father at 1%, and the probability of finding one if it wasn’t the surname of her father at 0.01%. What I wanted was the new estimation of K, i.e. K | N. Well, following the method in the murder example:

           P(N | K)              1%
P(K | N) = ―――― P(K) = ――――――――――――― x 15% = 94.6%
            P(N)           (1% x 15%)+(0.01% x 85%)

This is a rather stark result from the low probabilities being used. I’m not claiming that this is a perfect example, or that my estimates are spot on, but it was designed to illustrate the following two points. Firstly, it demonstrates that the results from Bayes’ theorem can run counter to our intuition. Secondly, though, it demonstrates the difficulty in using the theorem correctly because this example is actually flawed.  The value of 1% for P(N | K) is fair enough as it represents the probability of finding a neighbour with the surname Kindle if her middle name was her father’s surname. However, the figure of 0.01% for P(N | ¬ K) was really representing the random chance of finding such a neighbour if her middle name wasn’t Kindle at all. What it should have represented was the probability of finding such a neighbour if her middle name was Kindle but it wasn’t the surname of her father. However, it failed to consider that the two families may simply have been close friends.

There is no room for debate on the mathematics of probability, including Bayesian probability and the Bayes’ theorem. The application of this mathematics is accepted in an enormous number of real-life fields, and genealogy is not fundamentally different to them. As part of my professional experience, I know that many companies use Bayesian forecasting to good effect in the analytical field know as business intelligence. The only controversial point presented here is the determination of those subjective assessments. All of the fields where Bayes’ theorem is applied involve people who are quantifying assessments that are based on experience and expertise. We already know that genealogists make qualitative assessments but would it be a natural step to put numerical equivalents on their ordered scales of terms. We wouldn’t argue that ‘definite’ means 100%, or that ‘impossible’ means 0%, but employing numbers in between is more controversial even though we may use a phrase like “50 : 50” in normal speech.

I believe there are two issues that would benefit from rational debate: where those estimations come from, and whether it would be practical for genealogists to specify them and make use of them through their software. Although businesses proactively use Bayesian forecasting, the only examples I’ve seen in fields such law and medicine have been ex post facto (after the event). For my part, I find it very easy to put approximate numbers against real-life perceived risks, and the likelihood of possible scenarios. I have no idea where these come from, and I can’t pretend that someone else would conjure the same values. Maybe it’s a simple familiarity with numbers, or maybe people are just wired differently – I really don’t know!

Even if this works for some of us, it is unlikely to work for all of us. By itself, though, this is not a reason for dismissing it out-of-hand, or lashing out at the mathematically-inspired amongst the community. A potential reaction such as ‘We happen to be qualified genealogists, and not bookmakers’ would say more about misplaced pride than considered analysis. Genealogists and bookmakers are both experts in their own fields. When they say they’re sure of something, they don’t mean absolutely, 100% sure, but to what extent are they sure?



[1] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore, Maryland: Genealogical Pub. Co., 2009), p.19.
[2] If you’re thinking “logarithmic” then you would be wrong. The range is symmetrically asymptotic at both ends and so is hyperbolic.
[3] This simple form applies where each event has just two outcomes: a result happening or not happening. There is a more complicated form that applies where each event may have an arbitrary number of outcomes.
[4] I’m using the logical NOT sign (¬) here to indicate the inverse of an event’s outcome. The convention is to use a macron (bar over the letter) but that requires a specialist typeface.
[5] Yes, that’s right, 10% and 95% do not add up to 100%. The misunderstanding that they should plagues a number of examples that I’ve seen. The probability of finding the evidence if he was guilty, P(E | G), and the probability of finding the evidence if he was not guilty, P(E | ¬ G), are like “apples and oranges” because they cover different situations, and so they will not add up to 100%. However, the probability of not finding the evidence if he was guilty, P(¬ E | G), is the inverse of P(E | G) and so they would total 100%.

No comments:

Post a Comment