SQLX and Dataform in 5 minutes

Learn how Dataform and SQLX can help your team manage data in your warehouse.

Introduction

The modern analytics approach consists of centralising all the raw data from a company onto a single data warehouse. Once the raw data is there, it needs to be transformed, aggregated, normalized, joined and filtered before being usable in BI tools and other analytics projects. Dataform helps data teams transform raw data into well defined, reliable, tested and documented data tables that will power your company’s analytics.

From ETL to ELT

The traditional ETL, which stands for Extraction, Transformation, and Loading, has now evolved into ELT:

  1. Raw data is Extracted from source systems and Loaded into the data warehouse.
  2. Raw data is Transformed within the data warehouse.

You use Dataform to manage that last part: data transformation in your data warehouse.

Learn more about ELT with our page on where Dataform fits in the modern data stack.

Build a single source of truth of data

Once raw data is loaded into your warehouse, your team transforms it into a single source of truth of data across your organisation. Dataform enables your team to follow industry best practices:

  • Manage data with code (with SQL)
  • Set standardized development processes like version control
  • Test your data quality
  • Document your data tables

A note of Graphical User Interfaces

Graphical User Interfaces (GUI) are often easy to get started with and can help less technical users build data pipelines. In practice, we find that past 10 or 20 data tables, pipelines become extremely difficult to manage, search and reason about. SQL is one of the best abstraction to express complex data logic.

Introducing SQLX

SQL is the de facto language for processing data in cloud data warehouses and SQL has many advantages.

  • Scalable processing in the data warehouse.
  • It’s usually much simpler and easier to express your pipeline in SQL.
  • It’s a common language across teams and systems.
  • It’s easy to introspect when something goes wrong.
  • It enables faster development thanks to fast feedback loops.

The few limitations of SQL

Current SQL workflows don’t necessarily follow engineering best practices. Several key features of writing code are missing in current SQL implementations.

  • You can’t reuse code easily across different scripts.
  • There’s no way to write tests to ensure data consistency.
  • Managing dependencies is hard because it requires separate systems. In practice many teams write 1000 lines long queries to ensure data processing happens in the right order.
  • Data is often not documented because documentation is needs to be managed outside of the code, in a separate system. It makes it hard for teams to keep it updated.

What is SQLX

SQLX is an open source extension of SQL. As it is an extension, every SQL file is also a valid SQLX file. SQLX brings additional features to SQL to make development faster, more reliable, and scalable. It includes many functions including dependencies management, automated data quality testing, and data documentation.

What does SQLX look like?

In practice, SQLX is mostly composed of SQL in the dialect of your data warehouse (Standard SQL if you are using BigQuery, SnowSQL if you are using Snowflake…).

SQLX example
This illustration uses BigQuery Standard SQL. SQLX works the same way with all SQL dialects.

1 Config block

In SQLX, you only write SELECT statements. You specify what you want the output of the script to be in the config block, like a view or a table as well as other types available.

Dataform takes care of adding boilerplate statements like CREATE OR REPLACE or INSERT .

2 The Ref function and dependency management

The ref function is a critical concept in Dataform. Instead of hard coding the schema and table names of your data tables, the ref function enables you to reference tables and views defined in your dataform project.

Dataform uses that ref function to build a dependency tree of all the tables to be created or updated. When Dataform runs your project in your warehouse, that ensures that tables are processed in the right order.

The following images illustrate a simple Dataform project and its dependency tree. In practice, a Dataform project can have dependency trees with hundreds of tables.

Ref function illustration
Scripts in a Dataform project
dependency tree
Dependency tree in a Dataform project

Managing dependencies with the ref function has numerous advantages.

  • The dependency tree complexity is abstracted away. Developers simply need to use the ref function and list dependencies.
  • It enables us to write smaller, more reusable and more modular queries instead of thousand lines long queries. That makes pipelines easier to debug.
  • You get alerted in real time about issues like missing or circular dependencies

SQLX = transformation logic + data quality testing + documentation

One of the powerful attributes of SQLX is that you can define the transformation logic, data quality testing rules, and your table documentation all within a single file.

All SQLX features can be adopted incrementally. We often see teams starting with simple scripts and progressively adopting more and more SQLX features as their pipeline complexity grows.
SQLX second example
This example illustrates a SQLX file using data documentation and data quality testing features.

1 Data documentation

You can add a description of your table and its fields directly in the config block of your SQLX file. Description of your tables is available in the Dataform data catalog.

Defining description within the same file makes it easy to maintain data documentation which is always up to date.

The documentation you add to Dataform is machine readable. This allows you to parse this documentation, and push it out to other tools.

2 Data quality tests

You can define data quality tests, called assertions, directly from the config block of your SQLX file. Assertions can be used to check for uniqueness, null values or any custom row condition.

Assertions defined in the config block get added onto your project’s dependency tree after the table creation.

DAG with assertions
Assertions defined in the config block are added to the dependency tree of your project. They will run after the table creation / update.

For more advanced use cases, assertions can also be defined in separate SQLX files. See the assertion page on documentation.

Other SQLX features

SQLX has numerous additional features to help you manage data in your warehouse and build more reliable data pipelines faster.

  • Incremental tables and snapshots
  • Reusable functions and variables
  • Declaring source data
  • And many other features

Check the docs to learn more.

How do you develop in SQLX

Step 1. You develop your pipelines in SQLX files, locally or using the Dataform web editor.

Step 2. Dataform compiles your entire project into native SQL that can be run in your warehouse. That process happens in real time, resolving dependencies, checking for errors and alerting you if any issue occurs

Step 3. Dataform then connects to your data warehouse and executes those SQL commands. That happens manually, on a schedule or via an API call.

Step 4. When this is done, you get a list of data tables that are tested and documented that you can use for your analytics.

SQLX is open source

The SQLX compiler and runner is open source, you can run it locally or on your own servers.

Lean how Dataform works in more details.

Enable all your team to adopt best practices and be more productive with Dataform web

Dataform web is a web application made for data teams. It packages a rich Integrated Development Environment (IDE), a pipeline scheduler, a logs viewer and a data catalog.

Dataform web enables data teams to collaborate in a single environment and brings the following benefits:

  • Managed infrastructure to run data pipelines in the data warehouse.
  • An alerting system and detailed logs to minimize the time spent on pipeline maintenance.
  • An intuitive UX that lowers the barrier to entry of engineering best practices like version control and development environments.
  • Instant feedback while developing their project for more productivity
  • A data catalog to explore data tables and existing pipelines quickly.

To learn more about Dataform web, you can check dataform.co/product.

What's next

ELT and the modern data stack

An introduction to ELT and where Dataform fits in.

How Dataform works

Learn how Dataform compiles your project and runs it in your warehouse.

Sitemap