A century ago, the most valuable resource was oil. Companies rushed to mine the oil, process it, sell it and influence the dependencies on it, ultimately growing the macroeconomy and other industries with the additional mobility gained by consumers. The oil of the 21st century is data. From big behemoths like Google, Amazon, Apple, to traditionally brick & mortar companies like Walmart and Target, data is the key to unlock insights to customer interest and behavior to drive more volume and value to a business. Like oil, data in its crude format is not easily usable; it needs to be extracted and refined in order to become consumable. And while oil collection and processing is regulated to the t with strict guidelines, data collection and processing is not regulated and barely standardized.
Companies of all sizes see the value in their most valuable resource and spend considerable capital to capture their own data and purchase external data with the goal of analyzing this data to unlock insights. Enterprises are using 100s, if not 1000s of systems (SaaS or homegrown); even small companies are using 20-30 systems to simply perform their business operations. With data being the primary driving force, to do Analytics of any sort, we as companies need to invest in a Data Warehouse / Data Lake, Business Intelligence tools, data operations tools and hire a data team to ask the right questions to our data, determining KPIs, metrics, building predictive models and take our companies to scale. A fundamental problem with these new digital business models, which affects most businesses, is exponential data growth. Data models and ecosystems are increasingly complex – and this complexity becomes a barrier to achieving the true power of analytics for automation and efficiency when it is unmanaged. Advanced analytics without enterprise data governance and operational strategy can lead companies down mistaken paths with making wrong decisions with poor quality data.
The quality of the data we analyze brings big question marks to many. When building Facet Wealth, we made a bet that we can make Certified Financial Planners (CFPs) more efficient with taking a technology-enabled approach to a traditionally face-to-face service. We were able to have success with various metrics and analyses – which CFPs are the best to take an unusual approach to financial planning / wealth management; how do we measure their productivity; how do we match them with the right customers and sales folks to have higher success rates, among many other questions. This took a ton of effort – we brought data from 25 different source systems into our data warehouse, and used great data operations tools like dbt, Stitch, BigQuery and Tableau to manipulate, transform and marry the data to unlock insights.
But, what could tell us that the manipulation of data is correct? How can we know our formulas are right, and the data we have rolled-up to a KPI is even directionally correct?
Other than gutchecks and eyeballing the data with historical comparisons, running queries for volumetric analysis, and maybe – just maybe – implementing some basic high-level checks of data in-motion, we are SOL. This, along with the complexity introduced by the variability of technology choices we make in our product and data stacks (data warehouse, ETL/ELT, data operations tools), makes the ability to use an off-the-shelf product to give us a sense of normalcy at best limited.
I encountered this first with a small, simple, stupid problem I was trying to solve. We had just finished building our new, scalable platform going from MVP to scale product at Facet. We wrote a bunch of migration scripts to take and land our data into our new shiny platform. That’s great; we have data, but how can we do integrity checks and ensure the migration scripts are working as expected while understanding not only schemas but context of the data at hand? I called bullshit on doing 1000s of hours of QA sampling data and hoping for the best. I wanted 100% coverage of the data. So I rolled up my sleeves and built a simple tool; denormalizing data models at source and target first with about 8,000 lines of SQL, consolidating the data to a unified data model where we can determine context; and then built a simple UI that enabled us to interact with all of our data in the source and target platforms. We estimate that this saved us ~1000 hrs of QA – we were able to rerun scripts and do a holistic review of data integrity in an iterative fashion. It was wonderful, and our go-live is one of the smoothest my colleagues and I had ever seen.
This got me thinking and brainstorming with my now co-founders – why can’t we take a similar approach to data operations in data warehouses and with real time data? There are so many use cases where we could use this metadata we would generate by just knowing the behavioral shapes and patterns of data, which would unlock a level of data intelligence that we didn’t have before.
To us, it starts with anomaly detection as the baseline. We need to define what “normal” means to each data point, analyze its historic patterns and volumetrics to determine when data is out of whack. So, anomaly detection is a great starting point – tell me what’s wrong with my data and what anomalies may be happening in the data pipelines and operations, so I can pinpoint the source of each anomaly. Adding ML algorithms to this gives us the ability to do contextual thresholds, where normalcy is tethered to an entity (such as a patient, a household, an individual, a business) and tracked over time with external dependencies.
What happens when we see poor quality data and alert somebody that there is an anomaly? It’s great; we have detected something that went wrong. However, bad data is costly to clean up (if you can actually truly clean it up) from systems. What if we expose our newfound intelligence down the stack to stop such data in its tracks before it enters downstream systems? Especially for systems with external integrations and entanglement – which almost all systems do – having the level of anomaly detection operationalized for real-time data operations is going to be crucial in the desire to increase the quality of data. We are calling these runtime constraints for real-time data and believe they will be crucial for proactive management of data ecosystems.
Simply doing anomaly detection, runtime constraints and alerting based off of the anomalies is a great start, but it’s not enough. This is why we are thinking about data quality as part of data intelligence. Doing surveillance with our approach helps us generate 2-3 orders of magnitude of metadata for each data point. Can we put a confidence score next to each KPI for a business, so we actually know at a rollup if the data we’re making decisions off of is directionally correct? Can we monitor our data for its sovereignty and governance / policy considerations? Can we apply governance / policy rules down to our streaming and raw data?
This is why we’re calling this space data quality surveillance – it’s a whole lot more than data monitoring and anomaly detection. The intelligence we’ll derive out of the metadata around the raw data is going to be the key for data operations going forward. This is why we started Qualytics, and we are building products that enable this level of intelligence for companies of all sizes.
Qualytics is the complete solution to instill trust and confidence in your enterprise data ecosystem. It seamlessly connects to your databases, warehouses, and source systems, proactively improving data quality through anomaly detection, signaling and workflow. Check out our other blogs to learn more about how you can start trusting your data. Let’s talk about your data quality today. Contact us at firstname.lastname@example.org.
Check out these other interesting blogs:
The State of Data Quality – https://qualytics.co/the-state-of-data-qualty/
The Qualytics Data Firewall – Explained – https://qualytics.co/the-qualytics-data-firewall-explained/
Bad Data: Big Problem – https://qualytics.co/bad-data-big-problem/