First steps with your Dataform project

Principles to start your project on the right foot

Organise your project with folders

Managing your project in folders makes it easier to maintain your project. One simple option is to follow the following folder structure:

Example of a Dataform project

Sources

Where you define transformations on your data sources. This is where you can create datasets that transform data from your different sources into a consistent format, using the same structure and naming conventions. Some examples of transformations you may want to include at this stage are:

  • Normalising sources (ex: ensure email have the same field name in all tables)
  • Aligning data types (ex: ensure timestamps are in a unique time zone and money fields in dollars)

All subsequent datasets should be built on top of these models, reducing the amount of duplicated code in your project.

Staging

Where you can define intermediary datasets that are only used within your Dataform project. The objectives of these tables are to make your project easier to maintain, easier to reason about, and make your pipelines more efficient.

Analytics

Where you define datasets that represent entities and processes relevant to your business and will be consumed by downstream users (BI tools, reports…).

As your project gets bigger and more complex, you can leverage subdirectories within those folders to separate groups of sources or analytics tables.

Manage dependencies with ref()

The ref() function ensures data transformation pipelines run in the correct sequence. We recommend you always use the ref() function when selecting or joining from another dataset rather than using the direct relation reference.

definitions/table_without_ref.sqlx
1config { type: "table" }
2
3select * from "schema"."my_table"

definitions/table_with_ref.sqlx
1config { type: "table" }
2
3select * from ${ref("my_table")}

The ref function is how you tell Dataform about the dependencies in your project. With this information, Dataform builds a dependency tree of your project in real time, and achieves the following things:

  • Alerts you in in real time about dependency errors
  • Ensures you are in the correct environment
  • Run your transformation pipelines in the correct sequence

Define your source data with declarations

Your Dataform project will depend on raw data stored in your warehouse, created by processes external to Dataform. These external processes can change the structure of your tables over time (column names, column types…).

We recommend defining raw data as declarations to build your projects without any direct relation reference to tables in your warehouse.

Using declarations enables you to reference your raw data in a unique place that can be updated in your data sources changes.

There are two ways to define declarations: in SQLX files and in JS files.

In SQLX files, with one declaration per file.

definitions/sources/customer.sqlx
1config {
2  type: "declaration",
3  database: "SNOWFLAKE_SAMPLE_DATA",
4  schema: "TPCH_SF1",
5  name: "CUSTOMER",
6}

In JS files, where you can define multiple declarations in one file. You can use this to define all raw data from the same source for example.

definitions/sources/stripe_dependencies.js
1declare({
2  schema: "stripe",
3  name: "charges"
4});
5
6declare({
7  schema: "stripe",
8  name: "accounts"
9});
10
11declare({
12  schema: "stripe",
13  name: "..."
14});

Once a declaration has been defined, it can be referenced by the ref() function like any other Dataform dataset.

definitions/staging/charges.sqlx
1config { type: "table" }
2
3select * from ${ref("charges")}

Use tags to manage your schedules

Tags can be added to SQLX config blocks to create collection of actions and datasets. When starting your project, it can useful to start using the following two tags:

  • daily to tag actions that need to be updated once a day
  • hourly to tag actions that need to be updated once per hour

You can create two schedules, one that will run all actions tagged daily every day and one that will run all actions tagged hourly every hour

With this setup, when creating a dataset, simply add the right tag to ensure it gets updated at the correct frequency.

definitions/analytics/customers.sqlx
1config { type: "table", tags: ["daily"] }
2
3select * from ${ref("crm_data")}

What's next

Best practices for managing dataform projects

Follow best practices to build a project that will scale and maximise productivity

Sitemap