The State of Data Quality

Data Quality is a problem for many. We as company owners and operators make thousands of decisions every day – anywhere from C-Suite to the mailroom – by looking at data that may be in our home-grown or SaaS products, in databases or data warehouses, raw or aggregated to KPIs. As we grow more dependent on data in the modern age, there is a growing need for ensuring that the data we look at is of “some” quality. In this article, we take a 5W1H approach to data quality monitoring.

Why Data Quality?

Have you ever looked at a data point and your gut told you it’s wrong? How often have you been in a boardroom and looked at a KPI that seemed too good to be true? Have you ever thought about how many wrong decisions you have made based on poor quality data? Unfortunately, you’re not alone. 

Checking data for its quality is no easy feat – did you know that a data scientist spends 60-80% of their time cleaning data? With the ever-changing nature of data, developing conditional logic with basic rules is simply not enough; we need to utilize Machine Learning methodologies to consistently learn from changing data over time. Enterprises can’t be expected to invest millions of dollars into establishing their own one-off systems for data quality monitoring; therefore the appetite for robust solutions to automate data quality checks is growing.

Gartner estimates that the cost of bad data for an average US business is $15M per year per business. This is on average across the board – IBM in 2016 estimated that the real cost of bad data in the US economy is $3.1T. With the surge of growing SaaS adoption, cloud adoption and cloud data warehousing – where almost 50% of the $100B DBMS spend is in the cloud – the case for growth of data quality monitoring is self-evident. Gartner estimates that the data quality market was $1.77B in 2019 growing 8% per year; we believe this is a gross underestimate of the data quality market because there are simply not enough data quality tools out there.

While not every single dollar can be saved from the impact of bad data in the economy, we can certainly do better than the status quo. 

Who needs Data Quality?

The data quality problem applies to any company that uses any kind of data to make any decision. Indubitably the scale of the problem changes with the size, the nature of operations and the industry the company is in. We have seen health systems having a large percentage of their patients not having genders, males being pregnant, and patients being over 130 years old. We have seen advertisement aggregation systems collating data from 500 systems and seeing missing, duplicate or mismatched data from the largest players in the advertisement space. We have seen financial services companies gather data from external lead sources and facing similar data quality issues. We have seen private equity companies facing data quality issues in their internal analytical systems where they assess performance of their investments. 

The problem is everywhere and it’s not industry-specific. 

What is Data Quality?

We have spent a lot of time dealing with data quality issues before we decided to make a stance to build Qualytics (more on this later). Going through our and our networks’ experiences, we have ultimately defined what we’re calling the Qualytics 8. We believe at our core that the 8 categories below are what data needs to be assessed for in order to have any quality program.

Completeness
availability of required data attributes
Coverage
availability of required data records
Conformity
alignment of content with required standards & schemas
Consistency
how well the data complies with required formats / definitions
Duplication
redundancy of records and/or attributes
}}
Timeliness
currency of content representation as well as whether data is available / can be used when needed
Volumetrics
volume data per row / column / client / entity over time
Accuracy
relationship of the content with original intent

While some players in the market are aligned with our thoughts here, there is a tendency to pick & choose which of the 8 they choose to monitor. We believe that you can’t do one without the other. 

The recipe from here is somewhat straightforward: come up with data quality “rules” in each category and apply them to catch anomalies.

When should we be thinking about the quality of our data?

This is one of the biggest differentiators between the players in the space. Some believe that data quality issues should be caught when data is in-flight, ultimately applying rules in a near-real-time fashion in pipelines. Others believe in applying the rules directly to a data warehouse through SQL and other methods after data has been “parked”.

At Qualytics, we strongly believe that data quality issues need to be detected and caught in-flight before the data is stored. Cleaning up bad data is a fundamental issue & alerting about a data point being “bad” is not enough – we believe we need to stop the bad data points from entering downstream systems. 

Where should we be reviewing our data?

This is the secondary big differentiator between players in the space. Some believe this is a data warehouse issue and only support Snowflake, Google BigQuery or AWS Redshift. We know that this problem is bigger than data warehousing; we need to be able to apply the same methods of data quality monitoring to product databases (SQL or NoSQL), SaaS products (think Salesforce), and even files. 

How do we review our data quality?

This is where the secret lies between the different players. We believe that we need to satisfy a few basic principles:

  1. Can we infer rules in an automated fashion?
  2. Can we apply these rules to data in a streamlined fashion?
  3. Can we do this in near-real-time?
  4. Can we bridge the chasm between humans and technology to learn and advance rules?

As technology advances, our dependency on data grows—and so does the risk. Everyone comes across bad data, but the time and financial loss of bad data can be mitigated. At Qualytics, we offer a solution to the data quality problem so you can make good decisions based on trustworthy data.

Qualytics is the complete solution to instill trust and confidence in your enterprise data ecosystem. It seamlessly connects to your databases, warehouses, and source systems, proactively improving data quality through anomaly detection, signaling and workflow. Check out the other blogs in this series to learn more about how you can start trusting your data. Let’s talk about the 5W1h of your data quality today. Contact us at hello@qualytics.co.

Check out these other interesting blogs:

The Qualytics Origin Story – https://qualytics.co/the-qualytics-origin-story/

The Qualytics Data Firewall – Explained – https://qualytics.co/the-qualytics-data-firewall-explained/

Bad Data: Big Problem – https://qualytics.co/bad-data-big-problem/