In my last post, I discussed the batching model that makes data programming work and introduced the concept of “Timelabels” as a means to keep the data flowing consistently while also giving us version control and a way to partition data reliably.
This post will show how Data Programming uses SQL as the interface’s language, and that we simply add extensions as comments to the SQL file to instruct the data flow.
The Data Programming Interface is SQL
The data model, the data flow model, and the batching model work really well together to allow for the composition of sophisticated end-to-end data flows with really flexible semantics. A data scientist can just use the knowledge of SQL to specify such workflows. Our Data Programming Interface allows a data scientist to only think about the shape and semantics of the data in a data flow. The interface between each of the steps in a data flow is actually the table schema and update model. So, each step of the data flow defines the input tables and the output tables – essentially through a SQL statement with comment extensions like:
/** * @datacoral O-1.0.0 * @matview-schema analytics * @matview-name engagement * @matview-type partitioned — output table is partitioned * @matview-update-mode append * @interval-in-minutes 60— update frequency * @output-warehouse myredshift — datastore where the table resides */ — the actual business logic of the data transformation SELECT … FROM events.mobile — input table – data dependency that is automatically inferred JOIN pg.objects — input table – data dependency that is automatically inferred JOIN sfdc.accounts — input table – data dependency that is automatically inferreds
For each of the tables that are produced in each step of a data flow, the following attributes are specified/inferred by parsing the comments statement like the one above:
- Name of the data store where the table resides
- Name of the schema
- Name of the table
- Type of the table – regular or partitioned
- Update frequency – essentially indicates the set of valid timelabels for a table
- Update semantics of one of REPLACE/APPEND/MERGE – indicates whether
- all rows of a table have the same timelabel (replace)
- once rows are written with a given timelabel, the timelabel of the rows doesn’t change (append), or
- the timelabel of a row is the same as when it was last updated (merge)
With the above information available for each table in a data flow, the end-to-end data pipeline can be automatically generated. Our Data Programming Interface is SQL with comments! It really is that easy.
The way the pipeline gets generated (to be fully serverless) and gets executed (based on a data-update-event-loop) are topics for future is a topic for our next data programming blog post.
Datacoral customers are using the full power of data programming to build out sophisticated data flows that involve ingesting data from different sources, transforming that data for analyses/model building, and finally harness that data by publishing or serving the data to external applications.
In future posts, we will talk about how Datacoral implements all of this as Serverless microservices, that:
- run natively in AWS using more than a dozen AWS services & products,
- are aware of data and schema changes during collection,
- assess the impact of those changes both upstream and downstream within each stage of the pipeline by maintaining metadata,
- orchestrate the execution and timing of each activity given these dependencies,
- assess failures and trigger notifications,
- offer external model computation using Python,
- publish to internal and external systems which completes the end-to-end narrative, and
- are installed securely within the customers’ VPC where we never see the data itself.