In my last post, we talked about the data model and stages through which data flows in Data Programming. This post will dive a little deeper into how we process the data in query batches, and introduce the concept of timelabels in order to keep the data flowing through the pipeline without overflowing. Some familiarity with the previous post’s diagram will be helpful, so we will use a modified version of it again here.
Data Batches and Time Windows
Datacoral’s data pipeline flows by follow a micro-batch processing model. In the diagram above, each square representing a Regular or Partitioned Table, is a batch of data. The batches of data first get defined and created when being received/retrieved from the different data sources by the collect data functions. Collect Data Functions create each batch by building up rows of data from each source as “loadunits” based on time windows we specify. Once the time window duration size is determined for a given source loadunit, that duration length becomes the same size for all batches of that loadunit. The batch of rows is given a label which is based on the time window during which it was executed. This batch label is called a Timelabel.
The timelabel could be based on either one of the following measurements:
- data time window – is a range of timestamps found in each row of the data, if available, and usually originates from the original source system.
- arrival time window – is a wall-clock time window that marks when the row was collected (received or extracted) during processing.
Data Time Window (DTW)
In this model, batches are deemed complete when all data for a given timelabel window has been received. If data is being extracted from the source, and it is possible for the extraction logic to have a predicate that is based on data time, like when a customer pressed the login button, data time window based batching can be chosen. DTW batching allows for more easily understandable analytics. If there can be a guarantee that all data for a given time window can be contained within a given batch, that guarantee allows for the analyst to know exactly what data they are querying by seeking a given batch.
The picture below shows the data time window collection interval and details for the 6AM timelabel (2019-03-20 06:00), whose query process begins at 7AM and ends two minutes later.
Arrival Time Window (ATW)
In this model, rows that are received are collected into batches are deemed completed at a certain wall-clock time. Batches can contain rows with timestamps that might be outside of the arrival time window. Arrival time window based batching is chosen for sources from which data is being received (like event streams, change logs, etc.) rather than extracted. In the case of ATW batching, there is no guarantee that all rows for a given time window are in the same batch, so an analyst may have has to query multiple batches in order to be guaranteed that all data for a given data time window are being queried. In order to prevent full-table scans all the time to handle out-of-order arriving data, specific processing steps are needed to ‘close the books’ for a given data time window. More on closing books (data bookkeeping) later. Examples of ATW- based batching include events coming in from applications, or change logs from production databases.
Time Window Durations are Uniform
Mapping this back to how partition tables are divided (which we covered in our last post), the size of the time window is the same for each partition within a Collect Stage Table. This size is also the frequency at which new data is populated in batches into the Source Table. In order to maintain full auditability, Datacoral mandates that the Stage Table is partitioned, and each new data batch is populated into a new, separate partition. Once the partition is written, it is immutable. In a partitioned table, there is a one-to-one mapping between a timelabel and its partition, which you can see in the diagram below.
In the picture, you can see that data arrives from Apple and Android applications every five minutes. Collections from PostgreSQL occur every fifteen minutes and Salesforce every half hour. New analytic updates are run in Redshift every hour, and updates are published to Salesforce each our, too.
Conveniently, a timelabel is also a version number that can be used to line up updates on tables that must be consistent with one another. The fact that a single timelabel value can be used to represent the entire state of a set of tables at a point in time, means that the timelabel itself could be deemed to be a clock tick interval of the entire set of tables. Similarly, the timelabel forms a clock tick for an entire data flow, if you imagine it tick by tick by tick. And in general, timelabels can be treated as clock ticks for the whole system with all the data flows. Timelabels are a very compact way of completely representing all valid versions of tables in and throughout the entire system. Cool, huh?
Timelabels Create Consistency
Timelabels become very useful when you consider what they will help you do in terms of keeping your data pipeline consistent as it flows, and as its flow changes. Timelabels automatically help with versioning, consistency, orchestration and auditability.
Knowing the version (timelabel) for one table, one can find all the versions (timelabels) of the corresponding upstream and downstream tables without any additional bookkeeping.
When MERGE or REPLACE semantics are used, the timelabel is also a version number for the table. With each new batched update, a new version of the table is deemed to be created. And, when a new version is created for a particular table, the downstream tables in its data flow also need to be recomputed to create versions that make the downstream tables consistent with the upstream tables. Timelabels, in this case, enforce downstream data consistency through the entire pipeline.
Orchestrate End-to-End Data Flow
The batching model and the notion of a Timelabel are intimately tied to provide a consistent batch processing model where the semantics of the updates are clear and there is fully visibility into the end-to-end data flow. This is how we orchestrate pipeline flow such that updates to a specific downstream table are synchronized automatically after the updates to each of its upstream tables occur.
Timelabels are also Audit Trails
Finally, The table/timelabel statuses essentially form the audit log for how the data inside of the database was created and subsequently updated. By knowing this information, we are able to look upstream at the data as well, which is how we can determine its lineage as well as flow and provide full data provenance visibility.
This post described how we produce, manage, regulate and audit flow of data in data pipelines, simply by how we build batches out of two types of time windows. We described both data and arrival time windows (DTWs, and ATWs) and how we use the Timelabels in our system as a means to manage batch sizes, provide reliability, and audit the flow all at the same time. In the next edition, we will describe the syntax of Data Programming instructions and how we use SQL as its fundamental interface.
For more information on how to automate your AWS data pipelines, and find out how much time you can save your data engineering team (hint: 2 to 3 man years, annually), please reach out to us at email@example.com or sign-up for a demo.