When anyone mentions statistics or probability being applied
to genealogical research then there’s usually a sharp reaction. There happen to
be some valid questions that would benefit from thoughtful discussion but
unfortunately the many knee-jerk reactions tend to be for all the wrong
reasons.
It’s hard to find a single reason why this topic gets such an
adverse reaction since the arguments made against it are rarely put together very
carefully. I have seen some reactions based purely on the fear that any
application of numbers means that assessments will be estimated to an
inappropriate level of precision, such as 12.8732%. That’s just ludicrous, of
course!
In this post, I won’t actually be making a case for
the use of statistics since I am still experimenting with an implementation of this
myself and it isn’t straightforward. What I will try to do is identify what is
and is-not open to debate, and ideally to add some degree of clarity. Although
I have a mathematical background, this only briefly touched on statistics. It
is a specialist field, and many folks will have a skewed picture of it, whether
they’re mathematically inclined or not. It is also a technical field and so a
few symbols and numbers are inevitable but I will try and balance things with
real-life illustrations.
Statistics
is generally about the collection and analysis of data. Despite what
politicians might have us believe, statistics proves nothing, and this is
important for the purposes of this article. Statistical analysis can
demonstrate a correlation between two sets of data but it cannot indicate
whether either is a consequence of the other, or whether they both depend on
something else. The classic example is data that shows a correlation between
the sales of sunglasses and ice-cream — it doesn’t imply that the wearing of
sunglasses is necessary for the eating of ice-cream.
Mathematical
statistics is about the mathematical treatment of
probability, but there is more than one interpretation of
probability. The standard interpretation, called
frequentist
probability, uses it as a measure of the frequency or chance of something
happening. Taking the roll of a die as a simple example, we can calculate the
number of ways that it can fall and so attribute a probability to each face
(1/6, or roughly 16.7%). Alternatively, we could look at past performance of
the die and use that to determine the probabilities; a method that works better
in the case where a die is weighted. When dealing with the individual
events (e.g. each roll of the die), they
may be independent of one another, or dependent on previous events. A real-life
demonstration of independent events would be the roulette wheel. If the ball
had fallen on red 20 times then we’d all instinctively bet on black next, even
though the red/black probability is unchanged. Conversely, if you’d selected 20
red cards from a deck of playing cards then the probability of a black being
next has increased.
The other major interpretation of probability is called
Bayesian probability
after the Rev. Thomas Bayes (1701–1761), a mathematician and theologian who
first provided a theorem to expresses how a subjective degree of belief should
change to account for new evidence. His work was later developed further by the
famous French mathematician and astronomer Pierre-Simon, marquis de Laplace
(1749–1827). It is this view of probability, rather than anything to do with
frequency or chance, which is relevant to inferential disciplines such as
genealogy. Essentially, a Bayesian probability represents a state of knowledge
about something, such as a degree of confidence. This is where it gets philosophically
interesting because some people (the
objectivists)
consider it to be a natural extension of traditional Boolean logic to handle concepts
that cannot be represented by pairs of values with such exactitude as
true/false, definite/impossible, or 1/0. Other people (the
subjectivists) consider it to be simply an attempt to quantify
personal belief in something.
In actuarial fields, such as insurance, a person is
categorised according to their demographics, and the previous record of those
demographics is used to attribute a numerical risk factor (and an associated
insurance premium) to that person. This is therefore a frequentist application.
Consider now a
bookmaker
who is giving odds on a horse race. You might think he’s simply basing his
numbers on the past performance of the horses but you’d be wrong. A good
bookmaker watches the horses in the paddock area, and sees how they look, move
and behave. He may also talk to trainers. His odds are based on experience and
knowledge of his field and so this is more of a Bayesian application.
Accepted genealogy certainly accommodates qualitative
assessments such as primary/secondary information, original/derivative sources,
impartial/subjective viewpoint, etc. When we consider the likelihood of a given
scenario then we might use terms such as possible, very likely, or extremely
improbable, and Elizabeth Shown Mills offers a recommended list of such terms
[1].
Although there is no standard list, we all accept that our preferred terms are
ordered, with each being between the likelihoods of the adjacent terms. These
lists are not linear; meaning that the relative likelihoods are not evenly
spaced. They actually form a non-linear
[2]
scale since we have more terms the closer we get to the delimiting ‘impossible’
and ‘definite’. In effect, our assessments asymptotically approach these
idealistic terms, but never actually get there.
As part of my work on STEMMA®, I experimented with putting a
numerical ‘Surety’ value against items of evidence when used to support/refute
a conjecture, and also on the likelihood of competing explanations of something.
This turned out to be more cumbersome than I’d imagined, although a better user
interface in the software could have helped. The STEMMA rationale for using
percentages in the Surety attribute rather than simple integers was partly so
that it allowed some basic arithmetic to assess reasoning. For instance, if A
=> B, and B => C, then the surety of C is surety(A) * surety(B). Another goal,
though, was that of ‘collective assessment’. Given three alternatives, X, Y,
& Z, simple integers might allow an assessment of X against Y, or X against
Z, but not X against all the remaining alternatives (i.e. Y+Z) since they
wouldn’t add up to 100%.
Although I didn’t know it, my concept of ‘collective
assessment’ was getting vaguely close to something called
conditional
probabilities in Bayes’ work. A conditional probability is the probability
of an event (A) given that some other event (B) is true. Mathematicians write
this as P(A | B) but don’t get too worried about this; just treat it as a form
of shorthand. Bayes’ theorem can be summarised as
[3]:
P(A)
P(A | B) = ――― P(B | A)
P(B)
It helps you to invert a conditional probability so that you
can look at it the other way around. A classic example that’s often used to
demonstrate this involves a hypothetical criminal case. Suppose an accused man
is considered one-chance-in-a-hundred to be guilty of a murder (i.e. 1%). This
is known as the
prior probability and
we’ll refer to it as P(G), i.e. the probability that he’s
Guilty. Then
some new
Evidence (E) comes along; say a bloodied murder weapon found in
his house, or some DNA evidence. We might say that the probability of finding
that evidence if he was guilty (i.e. P(E | G) is 95%, but the probability of
finding it if he was NOT guilty (i.e. P(E | ¬ G)
[4] is
just 10%
[5].
What we want is the new probability of him being guilty given that this
evidence has now been found, i.e. P(G | E). This is known as the
posterior probability (yeah, yeah, no
jokes please!). The calculation itself is not too difficult, although the
result is not at all obvious.
P(E
| G) 95%
P(G | E) = ―――――――― P(G) = ―――――――――――――――――――――――― x 1% = 8.8%
P(E) (95% x 1%)
+ (10% x 99%)
This may just look like a bunch of numbers to many readers,
but the mention of finding new evidence must be ringing bells for everyone. If
you had estimated the likelihood of an explanation at such-and-such, but a new
item of evidence came along, then you should be able to adjust that likelihood
appropriately with this theorem.
So what about a genealogical example? Well, here’s a real
one that I briefly toyed with myself. An ancestor called Susanna Kindle
Richmond was born illegitimately in 1827. I estimated that there was a 15%
chance that her middle name was the surname of the biological father. If we
call this event K, for
Kindle, then it means P(K) is 15%. This figure
could be debated but it’s the difference between the prior and posterior
versions of this probability that are more significant. In other words, even if
this was a wild guess, it’s the change that any new evidence makes that I
should take notice of. It turns out that the name ‘Kindle’ is quite a rare
surname.
FreeBMD
counted less than 100 instances of Kindle/Kindel in the civil registrations of
vital events for England and Wales. In the baptism records, I later found that
there was a Kindle family living on the same street during the same year as
Susanna’s baptism. Let’s call this event — of finding a
Neighbour with
the surname Kindle ― N. I estimated the chance of finding a neighbour with this
surname if it was also the surname of her father at 1%, and the probability of
finding one if it wasn’t the surname of her father at 0.01%. What I wanted was
the new estimation of K, i.e. K | N. Well, following the method in the murder
example:
P(N | K) 1%
P(K | N) = ―――――――― P(K) = ――――――――――――――――――――――― x 15% = 94.6%
P(N) (1% x 15%)+(0.01% x 85%)
This is a rather stark result from the low probabilities
being used. I’m not claiming that this is a perfect example, or that my
estimates are spot on, but it was designed to illustrate the following two points.
Firstly, it demonstrates that the results from Bayes’ theorem can run counter
to our intuition. Secondly, though, it demonstrates the difficulty in using the
theorem correctly because this example is actually flawed. The value of 1% for P(N | K) is fair enough as
it represents the probability of finding a neighbour with the surname Kindle if
her middle name was her father’s surname. However, the figure of 0.01% for P(N
| ¬ K) was really representing the random chance of finding such a neighbour if
her middle name wasn’t Kindle at all. What it should have represented was the
probability of finding such a neighbour if her middle name was Kindle but
it wasn’t the surname of her father. However, it failed to consider that
the two families may simply have been close friends.
There is no room for debate on the mathematics of
probability, including Bayesian probability and the Bayes’ theorem. The application
of this mathematics is accepted in an enormous number of real-life fields, and
genealogy is not fundamentally different to them. As part of my professional
experience, I know that many companies use Bayesian forecasting to good effect
in the analytical field know as business
intelligence. The only controversial point presented here is the
determination of those subjective assessments. All of the fields where Bayes’
theorem is applied involve people who are quantifying assessments that are
based on experience and expertise. We already know that genealogists make
qualitative assessments but would it be a natural step to put numerical equivalents
on their ordered scales of terms. We wouldn’t argue that ‘definite’ means 100%,
or that ‘impossible’ means 0%, but employing numbers in between is more
controversial even though we may use a phrase like “50 : 50” in normal speech.
I believe there are two issues that would benefit from
rational debate: where those estimations come from, and whether it would be
practical for genealogists to specify them and make use of them through their
software. Although businesses proactively use Bayesian forecasting, the only examples
I’ve seen in fields such law and medicine have been ex post facto (after the event). For my part, I find it very easy
to put approximate numbers against real-life perceived risks, and the
likelihood of possible scenarios. I have no idea where these come from, and I
can’t pretend that someone else would conjure the same values. Maybe it’s a
simple familiarity with numbers, or maybe people are just wired differently – I
really don’t know!
Even if this works for some of us, it is unlikely to work
for all of us. By itself, though, this is not a reason for dismissing it
out-of-hand, or lashing out at the mathematically-inspired amongst the
community. A potential reaction such as ‘We happen to be qualified
genealogists, and not bookmakers’ would say more about misplaced pride than
considered analysis. Genealogists and bookmakers are both experts in their own
fields. When they say they’re sure of something, they don’t mean absolutely,
100% sure, but to what extent are they sure?
[1] Elizabeth Shown Mills, Evidence
Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore,
Maryland: Genealogical Pub. Co., 2009), p.19.
[2] If you’re thinking
“logarithmic” then you would be wrong. The range is symmetrically asymptotic at
both ends and so is hyperbolic.
[3] This simple form
applies where each event has just two outcomes: a result happening or not
happening. There is a more complicated form that applies where each event may
have an arbitrary number of outcomes.
[4] I’m using the logical
NOT sign (¬)
here to indicate the inverse of an event’s outcome. The convention is to use a
macron (bar over the letter) but that requires a specialist typeface.
[5] Yes, that’s right,
10% and 95% do not add up to 100%. The misunderstanding that they should
plagues a number of examples that I’ve seen. The probability of finding the
evidence if he was guilty, P(E | G), and the probability of finding the
evidence if he was not guilty, P(E | ¬ G), are like “apples and oranges” because they
cover different situations, and so they will not add up to 100%. However, the
probability of not finding the evidence if he was guilty, P(¬ E | G), is the
inverse of P(E | G) and so they would total 100%.