Embrace Your Migration to the Data Cloud

The Red Pill Journey

As the CEO of Red Pill Analytics, I led our company through a journey similar to the one we now lead customers through. We founded the company in 2014 with a focus on building on-prem analytics stacks, which was still all the rage then, with the individual components of those stacks being primarily Oracle products. Although our name was inspired by the revolutionary Matrix film (and exactly one of the sequels) and the metaphor that data can free our mind and offer us the truth, with a nod and a wink we were also acknowledging the color most associated with Oracle. In 2016, collaborating with my good friend Kent Graziano, Red Pill partnered with Snowflake and soon after came Looker, Fivetran, dbt, Google Cloud Platform, and the rest of the modern data stack. We were reborn as a data cloud migration company. Or perhaps I should say… Reloaded?

We took some bruises during our first few migrations to Snowflake and BigQuery, with ETL code conversion occupying most of our time. So we did what we always do at Red Pill: we automated the process. Our ETL Converter tool was born, which allows us to migrate code from legacy ETL tools like Oracle Data Integrator to more cloud-friendly data engineering solutions like Data Build Tool (dbt) from dbt Labs. If you’ve ever attempted to build a code conversion tool, you know it will never be perfect. There are too many edge cases that you can’t imagine ahead of time, and can only address as you see them. But that’s really the point: it was essential that we see them! We needed a new breed of QA tool that could address the lifecycle specific to data cloud migrations:

  1. Reduce the amount of time developers spend unit testing their individual pipelines.
  2. Provide an effortless way for the QA team to identify and visualize data differences when running full regression tests.
  3. Do all of this while abstracting away the implementation differences between differing database technologies. One database’s STRING is another database’s VARCHAR… and don’t get me started on PRECISION.

Our Prototype

We built our first version of this comparison tool on a project using that customer’s current infrastructure investments. We wrote data scripts using SQL, executed them using Jenkins and database CLIs, populated the results of those SQL statements in Snowflake (our target migration platform), with everything visualized through Tableau. This sounds simple enough, but don’t take the bait. Writing the SQL scripts required functional intelligence: they were aggregation queries using meaningful GROUP BY statements to identify incorrect counts by multiple dimensions.

This SQL-based approach was not ideal. These GROUP BY statements were really a surrogate for detecting true equivalence between two tables, a capability we didn’t know how to build when comparing databases implemented in disparate technologies. Concordantly (that’s a deep cut), the developers didn’t know which dimensions the QA team would target as a way to test equivalence, and the QA team didn’t have the SQL skills to write those comparisons. It was a joint, manual effort creating thousands of SQL scripts for all the tables being migrated. And these SQL scripts were not portable; the most time-consuming aspect of building this framework was hopelessly specific to the subject area, and therefore, not reusable anywhere else except as a template.

Finally, we didn’t have a single pane of glass that collated all the executions with comparison results. To execute comparisons or view the job execution results required the Jenkins UI, while viewing the comparison results took us to the Tableau UI. I started to imagine what the perfect UI would look like: a project-focused experience that enabled configuration, execution, and visual discovery of everything involved with a finite data cloud migration project. The UI would also have executive summary capabilities: how much has been migrated, how much hasn’t, and when can we expect to finish.

The Qualytics Compare Journey

I invested in Qualytics with their earliest seed round, and soon after joined their board of advisors along with Jake Stein and others. Their focus on data confidence instead of mere data observability differentiated them from others in the space. This is manifest most clearly in the enrichment feature that Protect provides: making anomalies and all the context around them available in downstream data stores, enabling exception-based corrective actions.

A project-focused experience.

As I learned more about the architecture underneath Protect, old wheels started spinning again, and I returned again to that data cloud migration lifecycle tool. Ensuring the successful and correct migration to the data cloud is absolutely a data quality concern and a fantastic starting point for data confidence at scale. I saw a foundation for finally building my white whale. I discussed this idea with Gorkem Sevinc, and we were completely aligned, as he had stitched something similar together at a prior startup. About a month later, Qualytics had an MVP of Compare.

Qualytics Compare for the Migration Lifecycle

Qualytics Compare is the first product built for the data cloud migration lifecycle. It meets all the requirements on my wishlist above. Because ultimately it’s built using Spark on Kubernetes, it already provides the right abstractions across disparate technologies: DataFrames and Datasets. These abstractions allow us to test true equivalence between Datasets, and Datasets can be declared from any technology that has a Spark connector: tables, files in buckets, topics, etc.

Using Compare is as easy as One, Two, Three:

  1. Using the built-in Spark connectors for any supported data store, configure a connection to the baseline (the source data store for your migration) as well as one for the revision (your target data store platform). 
  2. Select all the tables from the baseline that you want to compare, and establish the schedule (if desired) for executing the job. Comparisons can also be executed using the API based on pull requests which frees developers from running endless queries to unit test their code.
  3. Enjoy the single pane of glass for job execution results, comparison results, and the executive summary that stakeholders crave for the project.

Compare identifies differences between the baseline and revision so businesses can clearly see which tables are passing, which tables are failing, and which tables are missing… meaning they exist in the baseline but not the revision. With all the historical data captured for each run over time, it provides a burndown chart for the project, and predicts when it will finish. It takes into consideration not only schema differences but also the underlying data. Comparison results include data quality fundamentals such as completeness, conformity, consistency, accuracy, and redundancy, aligning it with the Qualytics 8.

Your legacy data warehouse is not too big to migrate: the value realized in cost, administration and performance from data cloud platforms is well worth the pain of the migration. Qualytics Compare provides solid rails around the entire project, leading your data team to successful project completion.

Embrace Qualytics Protect

Screenshot
The “Checks” tab shows all the data quality checks that Protect inferred.

The data cloud journey is not complete when the migration project ends. When customers finally arrive in the data cloud, they understand just how easy it is to build data solutions in a fraction of the time traditionally required on-prem. The total of all the migrated workloads plus the net-new ones requires organizations to have continuous confidence in their data. Both Compare and Protect share the same foundation and exist together in the same product, so once the migration is over, you can enable Protect with only a few clicks. Check out Protect to learn more.

Qualytics is the complete solution to instill trust and confidence in your enterprise data ecosystem. It seamlessly connects to your databases, warehouses, and source systems, proactively improving data quality through anomaly detection, signaling and workflow.  Let’s talk about your data quality today. Contact us at hello@qualytics.co.

Comments (1)

Comments are closed.