By default, Dataform runs all of your project code from your project's master
Git branch. Configuring environments allows you to control this behaviour, enabling you to run multiple different versions of your project code.
An environment is effectively a wrapper around a version of your project code. Once you have defined an environment, and added some schedules to that environment, Dataform runs all of those schedules using the environment's version of the project code.
A common use-case for environments is to run a staged release process. After testing code in a staging
environment, the code is promoted to a stable production
environment.
@dataform/core
that your project uses is at least 1.4.9
.
Environments are configured in your Dataform project's environments.json
file.
An environment consists of:
dataform.json
)A simple example of an environments.json
file is:
environments.json1{ 2 "environments": [ 3 { 4 "name": "staging", 5 "gitRef": "67bed6bd4205ce97fa0284086ed70e5bc7f6dd75", 6 "configOverride": { 7 "defaultDatabase": "dataform_staging" 8 }, 9 "schedules": [ 10 { 11 "name": "run_everything_once_per_day", 12 "cron": "0 10 * * *" 13 } 14 ] 15 } 16 ] 17}
This staging
environment runs the run_everything_once_per_day
schedule at the 67bed6bd...
Git commit SHA. It also overrides the value of
defaultDatabase
to isolate staging
schedule runs from those in other environments.
Note that Dataform uses the environments.json
file on your master
branch to determine your project's environments. Any changes to your environments must be pushed to master
before Dataform will take note of those changes.
Dataform injects two special variables when schedules are executed: environmentName
and scheduleName
. You can use these in your code by referencing dataform.projectConfig.vars
. For example, to select 10% of data in a staging
environment:
definitions/my_view.sqlx1config { type: "view" } 2 3select 4 * 5from ${ref("data")} 6${when( 7 dataform.projectConfig.vars.environmentName === "staging", 8 "where farm_fingerprint(id) % 10 = 0", 9)}
More advanced Dataform projects typically have a tightly-controlled production
environment. All changes to project code go into a staging
environment which is intentionally kept separate from production
. Once the code in staging
has been verified to be sufficiently stable, that version of the code is then promoted to the production
environment.
Note that a staging
environment is typically not useful for code development. Usually during development you would simply run all code using the project's default settings (as defined in dataform.json
). Thus, unless you want Dataform to run schedules in a staging
environment, it may not be useful to define one.
A clean and simple way to separate staging and production data is to use a different database for each environment. However, this is only supported for BigQuery and Snowflake, so we recommend using per-environment schema suffixes for other warehouse types. The examples below show how to do both by overriding project configuration settings.
In the below example:
the production
environment is locked to a specific Git commit
the staging
environment runs the project's schedules at the latest version of the project's code (as exists on the master
branch)
A nice property of this configuration is that any change to the production
environment leaves an audit trail (by being recorded in your project's Git history).
environments.json1{ 2 "environments": [ 3 { 4 "name": "production", 5 "gitRef": "PRODUCTION_GIT_COMMIT_SHA_GOES_HERE", 6 "configOverride": { 7 "defaultDatabase": "dataform_production" 8 }, 9 "schedules": [ ... ] 10 }, 11 { 12 "name": "staging", 13 "gitRef": "master", 14 "configOverride": { 15 "defaultDatabase": "dataform_staging" 16 }, 17 "schedules": [ ... ] 18 } 19 ] 20}
To update the version of the project running in production
, change the value of commitSha
, and then push that change to your master
branch. On GitHub, Git commit SHAs can be found by opening the project page and clicking 'commits'.
A Git branch or tag can be specified instead of a commit SHA:
"gitRef": "GIT_BRANCH_OR_TAG_NAME_GOES_HERE"
However, we do not recommend using Git branches to manage a production
environment:
An alternative approach to separating production and staging data is to append a suffix to schemas in each environment. For example:
environments.json1{ 2 "environments": [ 3 { 4 "name": "production", 5 "gitRef": "PRODUCTION_GIT_COMMIT_SHA_GOES_HERE", 6 "configOverride": { 7 "schemaSuffix": "_production" 8 }, 9 "schedules": [ ... ] 10 }, 11 { 12 "name": "staging", 13 "gitRef": "master", 14 "configOverride": { 15 "schemaSuffix": "_staging" 16 }, 17 "schedules": [ ... ] 18 } 19 ] 20}
Some teams may not be at the stage where they require a staging
environment, but still would like to keep development and production data separated. This can be done using a configOverride
in the production
environment.
In the example below:
defaultDatabase
from the dataform.json
fileproduction
environment, and so use the defaultDatabase
from that environment's configOverride
production
environment specifies the master
Git branch, so all of its schedules will run using the latest version of the codedataform.json1{ 2 "warehouse": "bigquery", 3 "defaultSchema": "dataform_data", 4 "defaultDatabase": "analytics-development" 5}
environments.json1{ 2 "environments": [ 3 { 4 "name": "production", 5 "gitRef": "master", 6 "configOverride": { "defaultDatabase": "analytics-production" }, 7 "schedules": [ ... ] 8 } 9 ] 10}
Note that the defaultDatabase
setting is only supported for BigQuery and Snowflake. For other warehouses, we recommend overriding schema suffixes (as described above).