Best practices for managing dataform projects

Follow best practices to build a project that will scale and maximise productivity

Choose the right action type

SQLX supports three types of materializations depending on your needs: tables, views and incremental tables. As general rules:

  • Use views by default (they’re cheap and quick)
  • Use tables for datasets you expect to have downstream users for performance reasons. For example datasets queries by BI tools or datasets that have multiple descendents.
  • If the input data is very large, use incremental tables. They transform only new data (but are more complex to manage)

Organise data for your users using custom schemas

Dataform has a default schema defined in your dataform.json file. You can override it in the config block of your SQLX files.

1config {
2  type: "view",
3  schema: "staging",
4  tags: ["staging", "daily"],
5  description: "Cleaned version of the raw customer table."
9  c_custkey as customer_key,
10  c_name as full_name,
11  ...

Test the quality of your data with assertions

Assertions are data quality checks, used to ensure data meets expectations.

Assertions are run as part of your schedule: if an assertion fails, you will be notified (if notifications are configured)

Assertions can be added to the config block in your .sqlx files. There are three types:

  • uniqueKey : ensure there is only one row per value of the supplied column(s)
  • nonNull : ensure the field(s) are not null
  • rowConditions : takes a custom SQL expression. If this expression is FALSE for any rows, the assertion fails.
2config {
3  type: "table",
4  assertions: {
5    nonNull: ["order_date", "order_key", "customer_key"],
6    uniqueKey: ["order_key"],
7    rowConditions: [
8      "total_parts >= 0"
9    ]
10  }
13select ...

Note: you can also create custom assertions using SQLX. Read the page on assertions for more.

Document your data

Help keep collaboration on your project frictionless by adding documentation to your Dataform code.

You can add table and columns descriptions in the config block of SQLX files.

2config {
3  type: "table",
4  description: "This table contains summary stats by date aggregated by country",
5  columns: {
6    order_date: "Date of the order",
7    order_id: "ID of the order",
8    customer_id: "ID of the customer in the CRM",
9    order_status: "Status of the order, from Shopify",
10    payment_status: "Status of payment, from Stripe",
11    payment_method: "Credit card of ACH",
12    item_count: "Number of items in that order",
13    amount: "Amount charged for that order, in US dollars using a floating FX rate"
14  }
17select ...

Documentation is automatically added to the Data Catalog within Dataform and can later be exported to other tools.

Keep your code DRY using reusable includes macros

DRY stands for Don’t Repeat Yourself. JavaScript files can be added to the includes/ folder to define simple scripts, constants or macros that can be reused across your project.

Defining macros within includes allows you to keep your transformation logic DRY: write a function or variable once, and use it across the rest of your project

Example: country group mapping The function country_group defines a mapping from country to region. It can be defined once, and then reused throughout your project. If the grouping changes, you only need to change code in one place

1module.exports = (country) => {
2  return `
3  case
4    when ${country} in ('US', 'CA') then 'NA'
5    when ${country} in ('GB', 'FR', 'DE', 'IT', 'PL', 'SE') then 'EU'
6    when ${country} in ('AU') then ${country}
7    else 'Other'
8  end`;

1config { type: "table"}
4  country as country,
5  ${country_group("country")} as country_group,

2  country as country,
3  case
4    when country in ('US', 'CA') then 'NA'
5    when country in ('GB', 'FR', 'DE', 'IT', 'PL', 'SE') then 'EU'
6    when country in ('AU') then country
7    else 'Other'
8  end as country_group,

What's next

First steps with your Dataform project

Principles to start your project on the right foot