Create a new example.sqlx
file in your project under the definitions/
folder:
example.sqlx1config { type: "view" } 2SELECT 1 AS TEST
Upon running this query a view called schema.example
will be created (where schema
is the Dataform schema defined in your dataform.json
file).
Note that non-SQL statements are stripped during query compilation, so the final executable query in this case would be:
1SELECT 1 AS TEST
For a list of all configuration options for datasets, view the ITableConfig
reference documentation.
To create a table, i.e. a full independent copy of the query result, set the type
of the query to table
.
1config { type: "table" } 2SELECT 1 AS TEST
For a list of all configuration options for datasets, view the ITableConfig
reference documentation.
Dataform provides methods that enable you to easily reference another dataset in your project without having to provide the full SQL dataset name.
definitions/source.sqlx1SELECT 1 AS sourcedata
definitions/ref_example.sqlx1SELECT * FROM ${ref("source")}
In order to reference the dataset created by a source.sqlx
file, the value that should be passed to the ref()
function is "source"
, i.e. the name of the file defining the dataset (without the file extension).
The query will be compiled into the following SQL before it is run:
1SELECT * FROM "dataform_schema"."source"
Any dataset that is referenced by a query will automatically be added to that query's dependencies. Dependency queries are always executed before dependent queries to ensure pipeline correctness.
If you want to manually add a dependency to a query - one that is not already explicitly referenced with ref()
- you should configure the file's dependencies
:
1config { dependencies: [ "some_table" ] } 2SELECT * FROM ...
Multiple dependencies may be provided in a single invocation:
1config { dependencies: [ "some_table", "some_other_table" ] } 2SELECT * FROM ...
To stop a query being run, you can disable it. This will keep the dataset as part of your graph, but stop it from executing when you run your project. This can be useful for example if the relevant query breaks for some reason and you don't want your pipeline to fail while it's being fixed.
1config { disabled: true} 2SELECT * FROM ...
You can specify pre_operations { ... }
and post_operations { ... }
to configure Dataform to execute one or more SQL statements before or after creating a dataset.
post_operations
The following example defines a post-query operation to configure dataset access permissions. It makes use of the built-in self()
method which returns the fully-qualified name of the current dataset.
1SELECT * FROM ... 2 3post_operations { 4 GRANT SELECT ON ${self()} TO GROUP "allusers@dataform.co" 5}
To specify multiple operations, separate them with ---
(using ---
to separate statements in pre- and post-operations ensures compatibility across all DB engines, while ;
only works with engines that support multiple statements in a single query):
1post_operations { 2 GRANT SELECT ON ${self()} TO GROUP "allusers@dataform.co" 3 --- 4 GRANT SELECT ON ${self()} TO GROUP "otherusers@dataform.co" 5}
By default, a dataset's schema is set to the default schema chosen when initializing a project, usually dataform
, and a dataset's name is set to the name of the corresponding file.
This can be overridden on a per-file basis by specifying the schema
and/or name
options. For example, to create a dataset called example
in the schema other_schema
:
1config { 2 schema: "other_schema", 3 name: "example" 4} 5SELECT * FROM ...
To reference datasets with overridden schemas, use the fully qualified name of the dataset:
1SELECT * FROM ${ref("example")}
For more information on configuring BigQuery datasets, such as enabling dataset partitioning, check out the BigQuery guide.
For more information on configuring Redshift datasets, such as sort keys and dist keys, check out the Redshift guide.
For more information on configuring SQL Data Warehouse datasets, such as distribution settings, check out the SQL Data Warehouse guide.