Genealogy as a Form of Data Analysis

This paper is loosely based on the Wikipedia article entitled “Data Analysis” and the book Mastering Genealogical Proof.

Genealogists use raw data to accumulate and analyze patterns and trends toward establishing a Genealogical Proof. Evidence in the genealogical community is generally understood as pieces of data that are arranged through collection, sifting, and arranging. Evidence, positive or negative, is acquired through examining and modeling data using generally accepted processes. One such process is to use the computer application Evidentia. Other processes enabling the development of evidence are those used in the legal and forensics professions (e.g., DNA analysis).

Each point of data genealogists use is inspected, cleansed, transformed, and modelled. Most serious genealogists use the Genealogical Proof Standard. While this standard is more qualitative than quantitative, the results are the same, actionable information used to formulate decisions.

The Genealogical Proof Standard follows, simply: Formulating a research question, gathering data sources, considering the information in those sources, formulating evidence from that information, and finally constructing a proof statement. The process is generally iterative since there is no such thing as a final statement of proof in genealogy.

While traditional data analysis is generally thought to be quantitative, there is much similarity to the genealogical research process. The steps in data analysis are analogous to the process used by genealogy professionals. Data analysis begins with a research question, followed by compiling source information, and finally, generating actionable conclusions.

Research Question

Sometimes thought of as a hypothesis, the research question is the beginning of both genealogical research and data analysis. Genealogists formulate a question by asking something such as “Who was Joan Jones’ mother?” Data analysts ask, “How is product A better than product B?” The answers come in basically the same way for both.

Data Collection, Processing, and Cleaning

To answer the research question, both genealogists and data analysts collect, process and classify data relevant to the issue. Almost all data is seen as relevant to analysts, but genealogists often go further, collecting source material relevant not only to the issue, but also surrounding the issue. Data analysts, on the other hand, are more focused on the question itself, locating only data relevant to products A and B.

The difference between traditional and genealogical data analysis is that genealogists have much more fuzzy information to deal with. Items like local and regional history books may include data about their question. Such items are generally not relevant to a data analyst focused on a product research project, unless it involves cultural appropriation, i.e., the Korean car makers’ KIA Tucson vehicle. 😊

Exploratory Analysis

Genealogists often explore different sets of data to glean information and evidence relevant to their questions. Similarly, a traditional data analyst will do the same, focusing more on specific items than general items.

Modelling and Algorithms

There are no “real” algorithms for genealogists to apply to their data findings. There is, however, a Genealogical Data Model, which was constructed to help genealogists apply their data to real-world projects. The Genealogical Data Model was originally constructed to be a basis for software, but since it was completed, no software has used the GDM (except for The Master Genealogist, which used large parts of it).

Data Products and Communications

Genealogists use a proof model to present data and their formulation of the evidence they’ve compiled. A traditional data analyst uses a tool such as business intelligence software to present their findings. The only real difference between the two is that they present findings in a different way.

NPM

Fascinating New Find About Richard Mellen

Just found a newish website whose author is apparently unwilling to commit to scholarly diligence and credibility. Doug Sinclair wrote a paper about Richard Mellen of Massachusetts, alleged father of Simon Mellen of Sherborne and Framingham.

Mr. Sinclair states that his website doesn’t claim to the standard of scholarly journals, and it certainly doesn’t.

Further, he ignores the clear, cogent, and concise statements at the beginning of my paper on Richard Mellen, and the genealogy of Simon Mellen, that they are each “extended literature review[s].” Further, the next paragraph in those publications cogently states that the focus of each was on “published record sets.” What is not credible or clear about that?

Mr. Sinclair fails to cite his sources for the statements concerning my work, yet appears to cite everything else he writes about. Nothing new was presented that hadn’t been already written about by myself or others, just pictures. One picture in particular clearly shows he read my blog article and paper about Richard “Maling” aka “Waling.” The picture he posted clearly shows a “W”.

Any diligent genealogical researcher will use both of our works as clues only. Any diligent genealogical researcher will also follow the BCG Code of Ethics in using our works.

NPM

Friday Funnies – A Mousic Obituary

OBITUARY

The following mousic obituary is taken from the Portsmouth Evening Times:

In this city, Dec. 1st, “James D.” mouse, owned by Mr. James D. Potter, (colored) of this city, formerly of Port[l]and, at the age of 4½ years of old age and paralysis of the heart. This was a common gray mouse which Mr. Potter had trained and exhibited in many cities in this country and Canada. The mouse was forwarded to Boston by express this morning to be stuffed and when returned will be placed in the little cage which has been his home for 4 years. The mouse funeral will be held in City Hall. A special invitation has been extended to Chandler’s band and Neal Dow to be present. The mouse was insured in Chicago for one hundred dollars and Mr. Potter says he would not have taken $500 for it and will wear mourning all over his face as long as he lives. [Montreal, Chicago and Portland papers please copy.

Daily Eastern Argus, Portland, Maine, Wednesday, 6 December 1882, page 1, column 7.

ICYMI: Type for Genealogists : Linux Libertine and Biolinum

English: Sample of Linux Libertine typeface
Linux Libertine Typeface Sample

One of my favorite typefaces these days is the Linux Libertine/Biolinum family. The serif Linux Libertine and the sans serif Linux Biolinum family is a set of fonts in a more complete array than one normally gets in a free package. Including the typical roman (yes, it is lower case), bold, and italic, you get

  • Capitals
  • Slanted (or Oblique)
  • Display
  • Initials

The John¹ Burbank Descendants family sketch I posted some years ago uses the Linux Libertine typeface fonts roman, Capitals, and Slanted. The Display and Initials fonts are for other uses than usually found in genealogical text. The Display can be used as titling, for instance, and the Initials for decorative touches.

Despite their name, Linux …, they are universally usable Unicode typefaces. You can use them as defaults on a Windows 10/11 machine. For more information, see the Wikipedia page about them.

NPM

© 2021 N. P. Maling

Crafting a research question

Many good genealogy programs can help you get started crafting a good research question with their to-do features. RootsMagic has a good one, illustrated below.

The questions to ask before adding a new task are:

  • Who are you going to research?
  • What do you want to learn about the person?
  • Where was the person you are researching?
  • When was the person there?

Additionally, you might ask: Why was the person there at that time? This might seem like an existential question, but it is a good idea to add context to your family history.

These five questions get you started on the way to learning more about your ancestor.

The who is simple enough. The what can include any number of items like where/when were they born, when did they immigrate, where did they emigrate to, and who did they live with/marry/divorce, and so on. Where and when are a bit more complex due to the possible lack of information.

For instance, Lydia Peirce Gorton was born on 28 January 1822. I’ve got her birth date but no birthplace. I want to know where she was born, so I ask, “Where were Lydia Peirce Gorton’s parents, Daniel and Lydia (Peirce) Gorton, when Lydia was born in 1822?” The who, what, and when parts of this question are answered, but the best part is still unanswered: “where”?

The records I’ve got so far say different things, that she was born in Massachusetts, born in New York, born in Vermont. Most likely she was born in Vermont, though. I can make this hypothesis because her older brother was born there, and a few original records say so. This leads me to focus my question even more on Vermont records. Massachusetts records are very complete for the time and there is no indication her siblings were born there. New York state records on the other hand, are problematic, so they will have to wait for a while.

In this particular question, I ask why weren’t the parents in the records for Lydia’s potential birthplace? Were they there, just not recorded anywhere? These questions lead me to ask about the area where they may have been, to find out more about possible record sources. I also learn about the culture in that area, why the records may not exist, and what the economic conditions were during that period.

The process of crafting a specific question to be answered is key to great research. Answering the question is done during the research phase of the project. I’ll write more about the research project later this month.

Thoughts?

NPM