How many times have you accidentally stumbled across a massive data quality problem that has gone undetected for months or until the worst possible time? As a professional big data engineer who has built analytics for some well known government and industry customers, I don’t think I can publicly admit how many times I have.
I’ll admit this — having to clean up a database really sucks. If you’re the poor guy stuck holding the keyboard, you have to reverse engineer your whole pipeline in your head. “How in the heck did the data get like that?” “What happens if I just… fix it? Oh yeah… that wouldn’t work without completely ruining this other thing over there.” “If I take all weekend and fix all of this while the customer isn’t using the system, is the problem just going to come back on Monday?” “Why didn’t our implementation team catch this?” “Did the customer even read our data spec?” With enough time and effort, and humbling oneself to your customers for the shameful mistakes you’ve made, you can almost always get the situation under control, but it sure ain’t fun.
DATA DISASTER 1 : THE COMPLETENESS CALAMITY
One of the most cute and clever ideas you might have as a software engineer is to create a single unique ID for an event based on a hash of other fields. Can’t trust a timestamp to be unique? Hash that sucker with the username. If the user might somehow do two things at once — add in a source attribute. So now your ID is something like
HASH(timestamp + username + source). That ought to do it.
You had better check to make sure the
source have 100% completeness (aka no
null data). Otherwise, you’ll find that every system account action (which happens to have
nullusernames and may occur many times per second) or may hash to the same ID, causing overwrites and loss of traceability. If the data doesn’t have timestamps, you’re really done for now — good luck untangling that mess.
DATA DISASTER 2 : THE COVERAGE UMBRAGE
Have you ever typed a query into a search box that didn’t give you quite what you were looking for? Whether it’s a Google search that returned you an uncouth result from urban dictionary, or maybe a giphy query that returns something truly bizarre (see below), you’ve interacted directly or through an API to a database you can’t always predict.
Random queries aside, a typical database query you might write is “get me all the data from yesterday” which you will then run through your daily analytics. This query might look pretty simple, and it is, but if you don’t fully understand how that data is populated, you might miss a few things.
select * from LOG_TABLE where LOG_TIME >=`2021–09–09` AND LOG_TIME < `2021–09–10`
For instance, imagine that this system is aggregating logs from a distributed system. In that system, you have some devices that are always connected and delivering data, and others that are sometimes disconnected or have some kind of delay in reporting (e.g. a mobile phone that doesn’t share data until it’s on wifi). You may not realize that LOG_TIME is when the source event happened, not when the data is loaded into the database.
This leads to the sequence of: (1) log event happens (2) database query happens (3) log event is loaded into the database. After you read the docs, you may realize the field you’re supposed to query to get all results is LOG_DX_EVENT_TIME.
Hopefully the three year backfill query you’re about to write and deliver to your customer doesn’t prompt too many questions about what you might have missed. But, that’s the price you must now pay for 100% data coverage.
DATA DISASTER 3 : THE CONFORMITY ABNORMITY
I entered my career at a fun time, the web services revolution, where all data exchange would be facilitated by self-describing XML that adhere to the lofty standards defined by coalitions of ontologists. Not only did XML give rise to fun names like AJAX and SAX, but it also meant we never had to deal with loosely defined comma separated text files and all the headache that comes with choosing delimiters and escaping quoted strings. Luckily, the only people who use CSV now are… absolutely freaking everyone.
Okay, so we’re stuck with CSV for at least the next eternity, and now we need to figure out how to cope. Parents, once you feel you children are ready for chapter books, please read them RFC4180. It’s great nighttime reading sure to induce sleep, while helping your child navigate the scary world of flat-file output.
For those not following, I’m talking about this. Let’s imagine you have a table export from your favorite database. (MSSQL Server might be my personal favorite for creating janky CSV output.)
Record ID,Name,Role,Home Address 1,Dan,Software Engineering Director,Qualytics Field Office 3 2,Sam,Asst. to the "Dir, SW Engineering",321 Cowboy Commit Ln. Baltimore MD
Now you’ve done it. Because of the extra comma and lack of proper quoting, record 2 will either blow up your system or give you five columns, depending on your error handling. In either case, Sam now resides somewhere in the Software Engineering abyss and far away from his sweet home on Cowboy Commit Ln.
Honestly, XML had plenty of conformity problems also and JSON isn’t perfect either, but I’ll bet you’ll find all sorts of them in your CSV data if you know where to look.
DATA DISASTER 4: THE CONSISTENCY OF MUD
Pop quiz, can you find three differences between these two dates?
2021–09–10T6:10:54 and 2021–09–10T10:10:54.0Z
Guess what, they represent exactly the same time 12 hours before I wrote this article. However, let’s say your customer has recently onboarded a new team member who hasn’t spent days or weeks of their life learning the nuances of the Java time format expressions. They might miss that 6am Eastern (same as 10am UTC) should be represented as 06. Or, they might not know that Z represents Zulu, the aviation term for the Prime Meridian, which happens to be local time for our friends at Royal Greenwich Observatory. Don’t even get me started about .0 and the pain you can endure due to 0 tenths of a second.
Why are we talking at such length about date formats? I’ve had customers just put a ‘Z’ at the end of their local time because our spec said it should have one. I’ve had customers deliver data with absolutely no difference between timestamps at 6am and 6pm. I’ve seen systems that threw out all data between 1am and 10am, because of that “6 should be 06” problem.
By the way, unless you’re in Hawaii, Arizona, or Venezuela, there is this thing called daylight savings time. Sometime around Pi Day, you may realize you’ve had your system misconfigured this whole time. Time to get out your old time-shift-reprocessing script and put in some scheduled downtime on the company calendar.
Wherever you find date/time formatting or configuration issues, you’ll find a lack of data consistency between each system. Events in the analytic system may be represented hours earlier from when they actually happened. Good luck explaining that in your support calls with end users.
DATA DISASTER 5 : THE DUPLICATION DUPLICATION
Last Monday, things were absolutely awesome. Your team had just hit its new annual recurring revenue (ARR) goal and accomplished a significant new milestone. Your bar chart column stood tall and proud, eclipsing the dotted target line that separates the winners from the losers. You proudly proclaimed the team’s excellence at the weekly all-hands meeting and congratulated everyone and their dear mothers for the joy they have brought into the world. You were first to pour the glass of whiskey to toast the company founders and thank everyone who gave you the opportunity to lead this amazing team.
Now it’s 20 minutes before this Monday’s meeting and you just realized… something about the sales contract terms for a huge customer screwed up your import and included the full revenue number in both last quarter and this quarter. Your cumulative sum is way off — you’re now well under goal and things aren’t looking good for hitting your numbers this quarter at all.
What are you going to do now? Fess up to the mistake? Throw the sales associate under the bus? Fix it and hope nobody notices? Blame a glitch in the reporting tool? Fix it tomorrow and take off next Monday? These are all fun choices (except for fessing up) but they each come with consequences that your replacement will have to deal with. You should probably check your data for duplicates next time.
DATA DISASTER 6 : THE TIMELINESS UNGODLINESS
It was a beautiful morning for your end user Martha, no traffic on her morning commute, the automatic coffee maker was full of water and clear of coffee ground sludge, and her computer booted up in under 10 minutes. She is ready to go, and possibly, have the most productive day of her entire career.
Martha navigates to your web application.
“No new alerts.”
Welp, I guess it’s time to TikTok.
What had happened was… the data flowing into your system missed the cutoff window. Due to a software upgrade on the customer’s application database, the automated extract was delayed until the morning. The transformation and load logic diligently waits for the extract. However, your analytic ran in the middle of the night, as it always does, to avoid heavy database use. No new data, no new alerts. No work for Martha.
The on-call staff got the page at 6am about no alerts, but they’re off dealing with a CSV format problem. Data is delivered at 9am when the customer DBA remembers to perform the extract manually… By the time on-call reviews the page, the data was there — depending on who is on-call, the issue either escalates all the way to the CEO or gets ignored.
Hopefully your system is smart enough to trigger when new data arrives or look back a couple of days to handle these untimely data problems. Martha is going to be busy tomorrow.
DATA DISASTER 7 : THE VOLUMETRICS PARADOX
Have you ever written software where you really, really, can’t miss a message? Let’s say it’s an access control system where staff must absolutely scan their badge in the card reader every time they enter or leave a secured lab to use a computer. If they don’t, your software will review the logs at the end of the day and determine that there was a security violation. The next day, identified staff-members will get visited by a security guard and a member of their friendly “people problems” team for… a chat.
What if one day the “people problems” people calls you, the system owner, and asks if you’re absolutely sure that 10 times more people snuck into that lab, while nearly as many checked-in users are nowhere to be found. They say they’ve already interviewed three people and they swear they badged-in correctly. You look at your reports — it sure does seem like a lot of violators yesterday — maybe they drank too much at the group BBQ. Or wait… why are there no badge-ins and outs from 5:30–6.
Well, it turns out, there was a system redeployment around that time. Messages in queue must have been lost. Nobody would have noticed as it was within the threshold you’ve been slowly loosening to account for the regular Friday truants. But… it is pretty light traffic for a Wednesday.
This is one all too real example of how unsophisticated volumetric checking can allow a data quality problem to seep into your system and cascade into a series of problems that put your users, and ultimately your team, in the hot seat.
DATA DISASTER 8 : THE ACCURACY CATASTROPHE
You may have heard this in the news lately, and it’s true, the earth is not round. I’m serious — not round. Not even an oblate spheroid. Sometimes it’s very much flat, and there are a lot of turtles.
Ok, actually the earth a geoid, an irregularly shaped mass of bumps. This raises the question, “How high are you right now?” No, this is not a question about how you’ve almost made it to the end of blog about data quality. It’s legit. What’s your altitude, your elevation, your height? Let me see your telemetry.
Imagine that you’ve got a sweet new camera drone, and you want to pinpoint exactly where you are on the side of the mountain where you took a picture. To do that, you take the direction and angle of the camera, plus vehicle telemetry, plus math, to get the exact coordinates of that air-selfie. You are going to need some pretty tight accuracy on not only the drone’s elevation vs ground but also elevation of the ground vs sea level and how that compares to the World Geodetic System. Otherwise, the search and rescue team is never going to find you.
I know that’s all pretty confusing, it took me a minute too. Don’t worry, here’s a cool video of a camera drone.
All you need to know is that a clean end result is only possible with highly accurate data. Find the best data you possibly can. Check it twice. Anticipate inaccuracy and find solutions to account for the error.
If you’re a pure data geek, there are actually some fun problems to solve once the data is totally flummoxed. For the rest of us, who just want to trust our data, invest up front in data quality checks. If putting in sufficient data checks in your system is really hard to get right (and it is), then check out the data quality analytics we’re building at Qualytics.
Qualytics is the complete solution to instill trust and confidence in your enterprise data ecosystem. It seamlessly connects to your databases, warehouses, and source systems, proactively improving data quality through anomaly detection, signaling and workflow. Check out our website to learn more about how you can start trusting your data.