Are You About to Bump into a Data Iceberg?

A lot of unmeasured experience may be hiding from view.

  • Helping you think about what could be missing?

Three Easily Remembered Questions

Having a complete set of data doesn’t mean just the entire file. It really means the entire measurable experience that matters. To illustrate that, our measurable experience can be simplified down to three basic questions: what are the things, times, and conditions?

  • Time. Time is needed to measure change. We want to know about a who or a what at two different points in time, so that we can make comparisons.
  • Condition. Condition is what we are comparing between two different points in time. For example, amounts, or locations.

An Example: A Troubled School Principal

Let’s put this into a real-world example. Imagine you are working for a school principal. She is concerned about student attendance, and wants the latest data to support a new attendance policy. You might go to the main office and ask for a report on absences. Later that day, you get back the following absence report:

A list of 100 students, showing that five were absent for more than two days between May 1 and May 31st.

Breaking this down:

  • Time: Month of May
  • Condition: Absence of more than two days

What Did We Miss?

We missed out on a lot of experience that wasn’t measured. We are only seeing the above-water part of the iceberg.

Students: Does the school in fact have 100 students?

Hang on, we have 110 students. 10 of them are from the district we annexed in January. That district uses 5-digit IDs, and we never reassigned them 6-digit IDs. So, the report missed them.

Days: May has 31 days. Why does the file only have 18 days?

Well, weekends aren’t included. Memorial Day wasn’t counted. But also, for 2 days, the attendance system was down.

Absences: Wait a minute. What do we mean by attendance?

Oops. The principal considers attendance to be either more than 2 days’ absences..or 1 tardiness mark. We are missing tardiness marks.

Going Further

But, as it turns out, you don’t even have 1,800 measurements. When you look at the report more closely, and do some quick calculations, it turns out you only have 1,272 — only 29% of what you want. Why? We didn’t see even more measurable experience, stemming from the combination of second-order questions:

10 students are missing two days of time sheets.

It turned out their regular homeroom teacher was out sick, and the substitute forgot to turn in the timesheets.

8 students’ timesheets are missing tardiness marks.

They were on a work-study month-long assignment; their work sponsor only marked down absences. The students’ persistent tardiness came out in the negative written reviews.

There are 2 days in the report when we have null values for absences.

The system also had a two-day glitch, and didn’t record absences. The system only recorded tardy marks for those glitch days.

Seeing The Whole Picture, and The Problem

We’re using an iceberg analogy — and showing it using a Venn diagram. A Venn diagram like this can’t be perfectly calibrated to the proportion of information that is missing. However, it organizes how we think about the problem, and helps us visualize what’s there, and what’s missing.

In Conclusion

When you start your data project, you now have a way, at the beginning (!) rather than at the end, to:

  • Ask questions about what might be missing
  • Visualize what you have, and what’s missing

Good stories of politics, science, business, and culture going all weird and wrong. Oh, the humanity.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store