Genealogy as a Form of Data Analysis

This paper is loosely based on the Wikipedia article entitled “Data Analysis” and the book Mastering Genealogical Proof.

Genealogists use raw data to accumulate and analyze patterns and trends toward establishing a Genealogical Proof. Evidence in the genealogical community is generally understood as pieces of data that are arranged through collection, sifting, and arranging. Evidence, positive or negative, is acquired through examining and modeling data using generally accepted processes. One such process is to use the computer application Evidentia. Other processes enabling the development of evidence are those used in the legal and forensics professions (e.g., DNA analysis).

Each point of data genealogists use is inspected, cleansed, transformed, and modelled. Most serious genealogists use the Genealogical Proof Standard. While this standard is more qualitative than quantitative, the results are the same, actionable information used to formulate decisions.

The Genealogical Proof Standard follows, simply: Formulating a research question, gathering data sources, considering the information in those sources, formulating evidence from that information, and finally constructing a proof statement. The process is generally iterative since there is no such thing as a final statement of proof in genealogy.

While traditional data analysis is generally thought to be quantitative, there is much similarity to the genealogical research process. The steps in data analysis are analogous to the process used by genealogy professionals. Data analysis begins with a research question, followed by compiling source information, and finally, generating actionable conclusions.

Research Question

Sometimes thought of as a hypothesis, the research question is the beginning of both genealogical research and data analysis. Genealogists formulate a question by asking something such as “Who was Joan Jones’ mother?” Data analysts ask, “How is product A better than product B?” The answers come in basically the same way for both.

Data Collection, Processing, and Cleaning

To answer the research question, both genealogists and data analysts collect, process and classify data relevant to the issue. Almost all data is seen as relevant to analysts, but genealogists often go further, collecting source material relevant not only to the issue, but also surrounding the issue. Data analysts, on the other hand, are more focused on the question itself, locating only data relevant to products A and B.

The difference between traditional and genealogical data analysis is that genealogists have much more fuzzy information to deal with. Items like local and regional history books may include data about their question. Such items are generally not relevant to a data analyst focused on a product research project, unless it involves cultural appropriation, i.e., the Korean car makers’ KIA Tucson vehicle. 😊

Exploratory Analysis

Genealogists often explore different sets of data to glean information and evidence relevant to their questions. Similarly, a traditional data analyst will do the same, focusing more on specific items than general items.

Modelling and Algorithms

There are no “real” algorithms for genealogists to apply to their data findings. There is, however, a Genealogical Data Model, which was constructed to help genealogists apply their data to real-world projects. The Genealogical Data Model was originally constructed to be a basis for software, but since it was completed, no software has used the GDM (except for The Master Genealogist, which used large parts of it).

Data Products and Communications

Genealogists use a proof model to present data and their formulation of the evidence they’ve compiled. A traditional data analyst uses a tool such as business intelligence software to present their findings. The only real difference between the two is that they present findings in a different way.

NPM

ICYMI: Type for Genealogists : Linux Libertine and Biolinum

English: Sample of Linux Libertine typeface
Linux Libertine Typeface Sample

One of my favorite typefaces these days is the Linux Libertine/Biolinum family. The serif Linux Libertine and the sans serif Linux Biolinum family is a set of fonts in a more complete array than one normally gets in a free package. Including the typical roman (yes, it is lower case), bold, and italic, you get

  • Capitals
  • Slanted (or Oblique)
  • Display
  • Initials

The John¹ Burbank Descendants family sketch I posted some years ago uses the Linux Libertine typeface fonts roman, Capitals, and Slanted. The Display and Initials fonts are for other uses than usually found in genealogical text. The Display can be used as titling, for instance, and the Initials for decorative touches.

Despite their name, Linux …, they are universally usable Unicode typefaces. You can use them as defaults on a Windows 10/11 machine. For more information, see the Wikipedia page about them.

NPM

© 2021 N. P. Maling

Review: RootsMagic 8

I just purchased RootsMagic, version 8 (RM8). The initial purchase, setup, and registration went smoothly. I was able to upgrade from a registered version 6 with no problem.

The interface is reminiscent of Evidentia, with no menu-bar or toolbar across the top as was in previous versions. It is taking a bit of getting used to. There is a search field in the Command Palette in the upper right part of the main screen that will lead you to what you need to find. I also like the font scaling feature in the Settings menu.

I was able to convert a RootsMagic Essentials 7 database with no problems. Importing a GEDCOM 5.5.1-compliant file also went smoothly. I expected some errors and omissions during the conversions, but I saw no problems reported by RootsMagic. Looking at the database folder on my computer, I found the .lst file containing a bunch of “unknown info” from the GEDCOM import. None of it was significant, though, so, so far so good.

The Sources listing screen in RM8 is quite different. It is however, somewhat like what was in The Master Genealogist. One annoyance is that I cannot select whole words using the Control-Shift-arrow keys. I have to select it using just the Shift-arrow keys, character-by-character.

Although my source list got messed up during the transition from The Master Genealogist version 8 to RootsMagic 6 some years ago and having edited it by hand in the raw GEDCOM file, the sources appear well enough. In the narrative report, anyway, they are fine.

Running a Narrative Report is easy, there are a few choices for details. I chose the NEHGS Register style which I am familiar with to test-drive that report on my main family line. I would recommend that the developers read a couple of the current style guides available which detail the Register style. If you were to use this style without a care for the looks, you’d be fine, but me, I have a lot of clean-ups to do.

I will stick with LifeLines and Ancestris for their detailed and in-depth problem reports, making any changes in RootsMagic. These two other programs are free, open source software projects which are quite good themselves. I’ve also coded a rough draft of a Register-style report in the Lifelines report language, which is quite useful.

I have not tried the connections to FamilySearch or Ancestry yet. I do not have a paid Ancestry account, so will not be able to use that connector. I maintain a small tree on FamilySearch but all of it is already hand-entered in my database off-line. I like the idea that there are tutorials available for using RootsMagic and FamilySearch; I will get to those someday.

Overall, this update to RootsMagic is great. I will continue using it as one tool in my toolkit.

NPM

One Way to Cite U.S. Census Images with Zotero

Using the manuscript item type, enter the Author as Last, First and then in the Archive field the City, County, State and other location information. Following that, enter the Census title in the Location in Archive field.

For example:

Author Last: Mellen

Author First: John

Date: 1752

Archive: Cambridge, Massachusetts, page 269, NARA micropublication M252, roll 20

Location in Archive: 1800 U.S. Census

The first citation will look like:

John Mellen (n.d.), 1800 U.S. Census, Cambridge, Massachusetts, page 269, NARA micropublication M252, roll 20.

The subsequent citations will be simply the surname, unfortunately. You can work around this by copy / pasting more of the citation to differentiate between different persons of the same name in the document. In the above example the result would look like

Mellen, 1800 U.S. Census, Cambridge.

Also, the “(n.d.)” field in the first citation is not necessary and can be removed with the possibility of breaking the link to the main data in Zotero. An alternative is to use the persons birth year in the Date field to differentiate similarly named persons.

NPM

Centurial – Evidence-based Genealogy Software

I looked at a piece of software called Centurial in 2019. It was quite interesting in its approach to doing genealogical analysis. This review is very dated in its comments, so proceed with caution. [13 May 2022]

The design of the application is quite different from anything I’ve seen in my 25+ years of playing with genealogical software.

The main features are correlation of source materials to instances of persons in such a way that there is little doubt that they refer only to each other. Centurial has a scrollable and zoomable visual “network” view so you can see the relations of persons to each other. Sources are entered in a very nice way, according to the E. S. Mills Evidence Explained format.

There is a space in the analysis pane to enter a proof argument but there is currently no way to output that information in any manner, other than copy and paste; not even to a basic HTML document. You can, however, export a GEDCOM file with the tree you’ve built for transfer to a GEDCOM-based program such as Family Tree Maker or Brothers Keeper.

The Centurial author discusses the differences between his data model and the GEDCOM model on the website referenced below. His website has some small amount of documentation but otherwise you are on your own to figure out how to use the program.

One of the few drawbacks I noticed is that it takes some time to import and convert an average GEDCOM file. For instance, my regular-use GEDCOM is only about 2.5 megabytes and the converted project file is about 25 megabytes. That is a serious size difference. I haven’t looked at the insides of the project file to see what’s what, but I suspect that there is a heckuva lot of XML markup in there.

Centurial is written in English with a European flavor. You may want to view the three tutorials on YouTube before you download and start working with Centurial. They will explain quite a bit about how the author uses it and the potential use cases you may have for it.

Overall, I think Centurial is quite an achievement software-wise. It is not really intuitive but then genealogy itself isn’t all that easy. As it has only been around for a couple of years, I doubt many people have heard of it, though. I recommend it for intermediate or advanced genealogists.

Centurial is available here: https://www.centurial.net/, and was free in 2019. It does require registration and some data collection by the author. It is also a Windows-only product (7, 8, or 10) and requires a recent version of the .NET framework. Personally, I’d like to see a Linux version as well because that’s what I use most of the time.