Genealogy as a Form of Data Analysis

This paper is loosely based on the Wikipedia article entitled “Data Analysis” and the book Mastering Genealogical Proof.

Genealogists use raw data to accumulate and analyze patterns and trends toward establishing a Genealogical Proof. Evidence in the genealogical community is generally understood as pieces of data that are arranged through collection, sifting, and arranging. Evidence, positive or negative, is acquired through examining and modeling data using generally accepted processes. One such process is to use the computer application Evidentia. Other processes enabling the development of evidence are those used in the legal and forensics professions (e.g., DNA analysis).

Each point of data genealogists use is inspected, cleansed, transformed, and modelled. Most serious genealogists use the Genealogical Proof Standard. While this standard is more qualitative than quantitative, the results are the same, actionable information used to formulate decisions.

The Genealogical Proof Standard follows, simply: Formulating a research question, gathering data sources, considering the information in those sources, formulating evidence from that information, and finally constructing a proof statement. The process is generally iterative since there is no such thing as a final statement of proof in genealogy.

While traditional data analysis is generally thought to be quantitative, there is much similarity to the genealogical research process. The steps in data analysis are analogous to the process used by genealogy professionals. Data analysis begins with a research question, followed by compiling source information, and finally, generating actionable conclusions.

Research Question

Sometimes thought of as a hypothesis, the research question is the beginning of both genealogical research and data analysis. Genealogists formulate a question by asking something such as “Who was Joan Jones’ mother?” Data analysts ask, “How is product A better than product B?” The answers come in basically the same way for both.

Data Collection, Processing, and Cleaning

To answer the research question, both genealogists and data analysts collect, process and classify data relevant to the issue. Almost all data is seen as relevant to analysts, but genealogists often go further, collecting source material relevant not only to the issue, but also surrounding the issue. Data analysts, on the other hand, are more focused on the question itself, locating only data relevant to products A and B.

The difference between traditional and genealogical data analysis is that genealogists have much more fuzzy information to deal with. Items like local and regional history books may include data about their question. Such items are generally not relevant to a data analyst focused on a product research project, unless it involves cultural appropriation, i.e., the Korean car makers’ KIA Tucson vehicle. 😊

Exploratory Analysis

Genealogists often explore different sets of data to glean information and evidence relevant to their questions. Similarly, a traditional data analyst will do the same, focusing more on specific items than general items.

Modelling and Algorithms

There are no “real” algorithms for genealogists to apply to their data findings. There is, however, a Genealogical Data Model, which was constructed to help genealogists apply their data to real-world projects. The Genealogical Data Model was originally constructed to be a basis for software, but since it was completed, no software has used the GDM (except for The Master Genealogist, which used large parts of it).

Data Products and Communications

Genealogists use a proof model to present data and their formulation of the evidence they’ve compiled. A traditional data analyst uses a tool such as business intelligence software to present their findings. The only real difference between the two is that they present findings in a different way.


Working Wednesday: Gigging at Fiverr

I just put up two gigs on Fiverr (see below) and so far, so good.

The vibe is like how GenLighten operated five years ago before it changed to a subscription model and shut down. There you were able to put up an offering for all to view and buy. At Fiverr you do the same in a comparable way.

The experience at Fiverr is much better, though, as you can see more stats about how well your offers are being received. You can also create links for marketing, which you could not easily do at GenLighten.

Right now, the competition in the Genealogy category seems all right. Most entries in the genealogy / family history section are good, and some not so much. I can’t comment on the quality of the deliverables, though, since I’m a seller, not a buyer.

Here are my two current gig listings:
Full-scale genealogy research
Obituary search

I am thinking about adding a third gig, for proofreading, editing, and writing family histories.



Alaska Genealogical Resources

Here’s a link for the Alaska State Library’s genealogy resources. They even have a WorldCat link so you can search for other libraries with the same materials. Some of these materials are also available at Seattle area archives where I’m a researcher available for hire.


Sanity Checking with Multiple Genealogy Programs

There are some people, like me, who are not quite satisfied with just one program for genealogical purposes. I use several, and keep an eye out on the others for features that might suit me well.



LifeLines (Photo credit: Wikipedia)

is one of the programs I use on a semi-regular basis. It is an old program (console window, anyone?) and has a long history of strong development by the maintainers. Thomas Wetmore originally wrote it back in the olden days of Unix and DOS, but it’s still around.

One of the things I like about Lifelines is it’s powerful scripting language. This language takes a bit of getting used to but once you know it, it seems intuitive. The program comes bundled with a lot of scripts, some better than others, and some near-duplicates of others. The verify script is one of the most powerful sanity-checkers on the market (did I mention that Lifelines is free?). Some of the things it reports on are age boundaries (birth, marriage, and death), multiple marriages, kids out of order, and so on. Several scripts check for people who might be in the Social Security Death Master Index. Another script called, weirdly enough, zombies, looks for people who don’t have death items (death, probate, burial, and so on).

I ran verify recently on a 5500+ person database and it came up with nearly 1600 items that it thought were interesting and that fell outside of user-programmable boundaries. It’s not for the faint-of-heart to look at this report as it can be a lot to digest. The nice thing about the report is that when I go through it, item by item, I can tighten up the quality of the data on a semi-regular basis, and gain a semi-regular consistency for the entire database. It might take years to finally go through the entire list and complete each item, but knowing about these items is the important thing.

Like verify, the zombies script reads through the database and plucks out those that have death items. This report is much simpler, and sortable so you can find the people by year, instead of in database order. The great thing about this report is that you find out who is in the database that is not marked as dead, dead, dead, as in dead. The script doesn’t consider the deceased flag, if there is one on the person, it makes you think about getting the details, and you’ll want to go out and get the details right away.

If you’ve added a lot of what I call “the moderns” you’ll want to run one of the SSDI check scripts and follow up on a visit to the Death Master Index on your favorite online site that has one. I used to use the one at, but removed it to their own site for some reason. Shucks, the Rootsweb version was better, IMHO.

Enough about the great Lifelines scripts. Multiple programs for genealogical data analysis are a must if you are serious about the pastime. Knowing what’s good data and bad is a good idea, as well as ethically correct. My other genealogy programs include an old version of Legacy, and a current version of The Master Genealogist.

TMG is the one I use on a regular basis as it is almost as powerful as Lifelines in the analysis and reporting facets. The only drawback to TMG’s reporting is that it’s not as flexible and programmable as Lifelines. Legacy, on the other hand, even though my copy is quite dated, is pretty good at picking out bad data, too. Even though I haven’t used Legacy for a while, like Lifelines, I keep it around as a variant finding tool.