In a previous post, I had talked about Data Programming as a way to elevate data scientists and data engineers from the manual work of orchestrating jobs and tasks in data pipelines. Data Programming can be used to specify an end-to-end data flow (without needing to know the underlying environmental infrastructure) which involves:
- Collecting or receiving data from different sources on a continual basis and centralizing that data into both a recoverable archive in an AWS S3 data lake and an analytic data warehouse. The original post described our data connectors which are built as serverless microservices (full list can be found here).
- Transforming that centralized data in one or many ways and keeping the transformed data up to date whenever the input data changes.
- Publishing the up to date transformed data to different destinations, including to operational systems and data science experiments.
In this post, we get into more details about the data model and table types we use in and the data flow model of data programming. In a future post we will explain how we set up our batching model the extended-SQL syntax we use in our declarative Data Programming Language (DPL) and how we orchestrate the overall flow of pipelines as well.
Data Programming takes advantage of common data management practices used within data warehouses. For example, warehouses support multiple modes of updating the content of a table, these include:
- REPLACE mode replaces all the rows of a table with a new set of rows.
- APPEND mode adds additional rows to a table. Once a row has been inserted, it is never updated or deleted.
- MERGE mode updates, inserts and deletes rows in the table that are captured via change data capture from the source systems.
In a Datacoral data flow, data is populated into tables of two different types, Regular or Partitioned, within data warehouses like Redshift, Snowflake, Athena, Hive, or Databricks Delta. Each table type is explained below.
Regular tables are what most users are familiar with in a warehouse or a query engine. Regular tables are good for small and medium-sized tables, supporting efficient REPLACE and MERGE semantics in addition to APPEND semantics and when they don’t need periodic pruning of old rows via retention rules. Most data warehouses support regular tables, however there are some query engines like Hive that are better suited for append-only modes of change. In those cases, partitioned tables are preferred.
Partitioned Tables are more interesting and unique than Normal tables. They are good for supporting large tables, tables supporting the APPEND semantics and those that require efficient, periodic pruning of old rows via retention rules. A single partitioned table can contain multiple partitions, each of which corresponds to a particular column value in each row. Unfortunately, data warehouses do not treat partition tables consistently from product to product, for example some data warehouses like Redshift and Snowflake don’t support partitioned tables natively. In those cases, Datacoral makes it easy to have partitioned tables by using individual tables for each partition and creating a UNION ALL view on top. Queries on partitioned tables are rewritten to only query specific partition tables. Data warehouses like Athena, Hive, and Databricks Delta, that have a filesystem-oriented storage are much more amenable to partitioned tables. Partitioned tables are especially useful when we need to maintain an audit trail related to where and when the data entered or exited the pipeline.
Data Flow Model
A Datacoral end-to-end data flow consists of several steps. Given that Datacoral supports micro-batching in its data flow, each of the steps involves moving or processing a small batch of the data (how the batch itself is defined is discussed in a separate blog post).
Data Flow Stages
Here we will review each stage in the process and explain what is happening. I’ve also included a table reference glossary in the next section to help explain some of the process terms used.
- Extract or Receive & Stage – This step is typically the first step of any data flow. This step is a Collect Data Function that:
- either extracts or receives data from external data Source Loadunits
- groups the data into batches based on the Batching Model described in a later post
- stores the batches into partitions in Collect Stage Tables in a staging area which is typically a file-system-based storage, like AWS S3.
Datacoral mandates that these collect stage tables are partitioned in order to have a clear audit trail of all the batches of data that entered the data flow across time.
- Load – this step loads the staged batches into a query engine/data warehouse into Source Tables. The source tables can be either regular or partitioned tables in the warehouse. They are called Source Tables because they match the data in staging and the source itself. APPEND/MERGE/REPLACE modes are supported in regular tables and APPEND mode is supported for partitioned tables.
- Transform – this step is typically a set of transformations performed within different warehouses by Organize Data Functions, starting from the Source Tables (as the Upstream Tables) where each transformation results in a Downstream Table and the downstream tables can then be upstream tables to the further downstream transformations in the flow. The transformations happen in batches as well. Transformations are represented as Transformed Tables in the data warehouses which can be either regular or partitioned tables based on the user.
- Stage for Publish – Harness Data Functions publish data from some of the Source Tables or the Transformed Tables in the warehouses to external applications and databases. Each batch of data that is published is first staged in Publish Stage Tables. The published stage tables are also mandated to be partitioned (just like collect stage tables)
- Publish – Each staged partition is then published into applications via APi calls or to databases.
Below is the list of data units and their corresponding table types and update modes.
- Source Loadunits – these are objects/tables/streams that need to be centralized. They exist in Datasource systems that are external to Datacoral.
- API – APIs define ways of pulling different objects that are relevant to that application. Each such object becomes a source loadunit.
- Database – tables within a database or extracts of data from the tables within a database form a loadunit
- Stream – a single stream with a given schema becomes one loadunit
- Collect stage tables – These tables are for bookkeeping. Datacoral mandates that these tables be APPEND-only and are always PARTITIONED.
- Source tables – Users can decide whether they want the source tables to be REGULAR or PARTITIONED and can be updated in any of APPEND/REPLACE/MERGE modes.
- Transformed tables – Users can decide whether the transformed tables are REGULAR or PARTITIONED and can be updated in any of APPEND/REPLACE/MERGE modes.
- Publish stage tables – These tables are for bookkeeping. Datacoral mandates that these tables be APPEND-only and are always PARTITIONED.
- Destination Publishunits – these are objects/tables/streams that the user wants to see published to target destinations as “publish data.”
- API – REPLACE/APPEND/MERGE publish modes allowed depending on the destination
- Database – REPLACE/APPEND/MERGE publish modes allowed depending on the destination
- Stream – APPEND-only publish mode allowed
In this post, we discussed the data model and data flow model of Data Programming, and highlighted how and where we use regular and partitioned tables within each stage of the flow. Partitioned tables scale effectively and ensure that all your data is persisted as it moves through the flow, while normal tables are used during the transformation stages in the pipeline.
In the next post, we will get into more detail about how the batch processing and timing model fits into, and helps coordinate, the data flow model. Then in future weeks we will discuss the syntax of the SQL-extended Data Programming Language we use to declare these operations in each stage of a data pipeline and finally we’ll talk about how we knit it all together in AWS and keep it consistent up and down the pipeline, while we watch for data and schema changes as well as detecting errors in the process.