We get it, establishing a Data Quality Plan can be overwhelming. What do we mean when we say “this” and how is “that” important? Check out our Qualytics Glossary for terms we use to assist in creating Confidence in your data.
Are we missing something that needs further explanation? Let us know!
Anomaly: Something that deviates from the standard, normal, or expected. This can be in the form of a single data point, record, or a batch of data
Accuracy: The data represents the real-world values they are expected to model.
A graphical representation of completed and remaining work versus time.
Comparison: An evaluation to determine if the structure and content of the source and target data stores match
Comparison Runs: An action to perform a comparison
Completeness: Required fields are fully populated.
Conformity: Alignment of the content to the required standards, schemas, and formats.
Connectors: Components that can be easily connected to and used to integrate with other applications and databases. Common uses include sending and receiving data.
- We can connect to anything within the Spark and Kafka ecosystems, and we have major integrations with leading ELT/ETL providers. If you have a data store we don’t yet support, talk to us!
- We currently support: Files on Object Storage (CSV, JSON, XLSX, Parquet); ETL/ELT Providers (Fivetran, Stitch, Airbyte, Matillion – and any of their connectors!); Data Warehouses (BigQuery, Snowflake, Redshift); Data Pipelining (Airflow, DBT, Prefect), Databases (MySQL, PostgreSQL, MSSQL, SQLite, etc.) and any other JDBC source
Consistency: The value is the same across all datastores within the organization.
Data-at-rest: Data that is stored in a database, warehouse, file system, data lake, or other data store.
Data Confidence: Treating data quality with a full circle solution that not only offers to identify anomalies but also to take action against that anomalous data.
Data Drift: Changes in a data set’s properties or characteristics over time.
Data-in-flight: Data that is on the move, transporting from one location to another, such as through a message queue, API, or other pipeline
Data Quality: Ensuring data is free from errors, including duplicates, inaccuracies, inappropriate fields, irrelevant data, missing elements, non-conforming data, and poor data entry.
Data Stack: A combination of technologies that extracts, transforms, cleans, and loads data through a variety of steps until it is ready for production use.
Data Store: Where data is persisted in a database, file system, or other connected retrieval systems. (*)
Data Warehouse: A system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning. (*)
ELT: Extract, load, and transform process for data.
Enrichment: Additional properties that are added to a data set to enhance its meaning. Qualytics enrichment includes whether a record is anomalous, what caused it to be an anomaly, what characteristics it was expected to have, and flags that allow other systems to act upon the data.
ETL: Extract, transform, and load process for data.
Firewall: An application that protects a system from contamination due to inputs, reducing the likelihood of contamination from an outside source. A firewall will quarantine data that is problematic, allowing the user to act upon quarantined items.
Metadata: Data about other data, including descriptions and additional information.
Object Storage: A type of data storage used for handling large amounts of unstructured data managed as objects.
Pipeline: A workflow that processes and moves data between systems.
Precision: Your data is the resolution that is expected- How tightly can you define your data?
Profiling: The process of collecting statistics on the characteristics of a dataset involving examining, analyzing, and reviewing the data.
Proprietary Algorithms: A procedure utilizing a combination of processes, tools, or systems of interrelated connections that are the property of a business or individual in order to solve a problem. (**)
Quarantine: The ability to prevent anomalous data from moving from a source to a target and entering other downstream systems. Anomalies are then retained for further remediation.
Schema: The organization of data in a data store. This could be the columns of a table, the header of a CSV file, the fields in a JSON file, or other structural constraints.
Schema Differences: Differences in the organization of information between two data stores that are supposed to hold the same content.
Source: The origin of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets extracted.
Target: The destination of data in a pipeline, migration, or other ELT/ETL process. It’s where data gets loaded.
Third-party data: Data acquired from a source outside of your company which may not be controlled by the same data quality processes. You may not have the same level of confidence in the data and it may not be as trustworthy as internally vetted datasets.
Timeliness: It can be calculated as the time between when information should be available and when it is actually available, focused on if data is available when it’s expected.
Volumetrics: Data has the same size and shape across similar cycles. It includes statistics about the size of a data set including calculations or predictions on the rate of change over time.