The modern analytics approach consists of centralising all the raw data from a company onto a single data warehouse. Once the raw data is there, it needs to be transformed, aggregated, normalized, joined and filtered before being usable in BI tools and other analytics projects. Dataform helps data teams transform raw data into well defined, reliable, tested and documented data tables that will power your company’s analytics.
The traditional ETL, which stands for Extraction, Transformation, and Loading, has now evolved into ELT:
You use Dataform to manage that last part: data transformation in your data warehouse.
Once raw data is loaded into your warehouse, your team transforms it into a single source of truth of data across your organisation. Dataform enables your team to follow industry best practices:
SQL is the de facto language for processing data in cloud data warehouses and SQL has many advantages.
Current SQL workflows don’t necessarily follow engineering best practices. Several key features of writing code are missing in current SQL implementations.
SQLX is an open source extension of SQL. As it is an extension, every SQL file is also a valid SQLX file. SQLX brings additional features to SQL to make development faster, more reliable, and scalable. It includes many functions including dependencies management, automated data quality testing, and data documentation.
In practice, SQLX is mostly composed of SQL in the dialect of your data warehouse (Standard SQL if you are using BigQuery, SnowSQL if you are using Snowflake…).
In SQLX, you only write SELECT statements. You specify what you want the output of the script to be in the config block, like a view
or a table
as well as other types available.
Dataform takes care of adding boilerplate statements like CREATE OR REPLACE
or INSERT
.
The ref
function is a critical concept in Dataform. Instead of hard coding the schema and table names of your data tables, the ref
function enables you to reference tables and views defined in your dataform project.
Dataform uses that ref
function to build a dependency tree of all the tables to be created or updated. When Dataform runs your project in your warehouse, that ensures that tables are processed in the right order.
The following images illustrate a simple Dataform project and its dependency tree. In practice, a Dataform project can have dependency trees with hundreds of tables.
Managing dependencies with the ref
function has numerous advantages.
One of the powerful attributes of SQLX is that you can define the transformation logic, data quality testing rules, and your table documentation all within a single file.
You can add a description of your table and its fields directly in the config block of your SQLX file. Description of your tables is available in the Dataform data catalog.
Defining description within the same file makes it easy to maintain data documentation which is always up to date.
You can define data quality tests, called assertions, directly from the config block of your SQLX file. Assertions can be used to check for uniqueness, null values or any custom row condition.
Assertions defined in the config block get added onto your project’s dependency tree after the table creation.
For more advanced use cases, assertions can also be defined in separate SQLX files. See the assertion page on documentation.
SQLX has numerous additional features to help you manage data in your warehouse and build more reliable data pipelines faster.
Check the docs to learn more.
Step 1. You develop your pipelines in SQLX files, locally or using the Dataform web editor.
Step 2. Dataform compiles your entire project into native SQL that can be run in your warehouse. That process happens in real time, resolving dependencies, checking for errors and alerting you if any issue occurs
Step 3. Dataform then connects to your data warehouse and executes those SQL commands. That happens manually, on a schedule or via an API call.
Step 4. When this is done, you get a list of data tables that are tested and documented that you can use for your analytics.
Dataform web is a web application made for data teams. It packages a rich Integrated Development Environment (IDE), a pipeline scheduler, a logs viewer and a data catalog.
Dataform web enables data teams to collaborate in a single environment and brings the following benefits:
To learn more about Dataform web, you can check dataform.co/product.