Dataform will connect to BigQuery using Application Default Credentials or using a service account.
If using Application Default Credentials ensure that the service account or user has BigQuery Admin
role or equivalent. (Dataform requires access to create queries and list tables.) Read this if you need help.
You’ll need to create a service account from your Google Cloud Console and assign it permissions to access BigQuery.
BigQuery Admin
role. (Admin access is required by Dataform so that it can create datasets and list tables.) Read
this if you need help.BigQuery specific options can be applied to tables using the bigquery
configuration parameter.
BigQuery supports partitioned tables. These can be useful when you have data spread across many different dates but usually query the table on only a small range of dates. In these circumstances, partitioning will increase query performance and reduce cost.
BigQuery partitions can be configured in Dataform using the partitionBy
option:
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "DATE(ts)" 5 } 6} 7SELECT CURRENT_TIMESTAMP() AS ts
This query compiles to the following statement which takes advantage of BigQuery's DDL to configure partitioning:
1CREATE OR REPLACE TABLE dataform.example 2PARTITION BY DATE(ts) 3AS (SELECT CURRENT_TIMESTAMP() AS ts)
Tables can also be partitioned by hour,
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "DATETIME_TRUNC(<timestamp_column>, HOUR)" 5 } 6}
... month,
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "DATE_TRUNC(<date_column>, MONTH)" 5 } 6}
... or an integer value.
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "RANGE_BUCKET(<integer_column>, GENERATE_ARRAY(0, 1000000, 1000))" 5 } 6}
If desired, tables can be clustered by using the clusterBy
option, for example:
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "DATE(ts)", 5 clusterBy: ["name", "revenue"] 6 } 7} 8SELECT CURRENT_TIMESTAMP() as ts, name, revenue
If needed, you can set whether queries over the table require a partition filter. To do this, set requirePartitionFilter
option to true
.
If you want to control retention of all partitions in a partitioned table, set partitionExpirationDays
accordingly to your needs.
For example:
1config { 2 type: "table", 3 bigquery: { 4 partitionBy: "DATE(ts)", 5 partitionExpirationDays: 14, 6 requirePartitionFilter : true 7 } 8} 9SELECT CURRENT_TIMESTAMP() AS ts
Only newly created models will have partitionExpirationDays
and requirePartitionFilter
set. If you changed those values after model was created, altering table will be needed.
In order to be able to query Google Sheets tables via BigQuery, you'll need to share the sheet with the service account that is used by Dataform.
.df-credentials.json
file.BigQuery labels are key-value pairs that help you organize your Google Cloud BigQuery resources. To use them in Dataform, add them to the config block:
1config { 2 type: "table", 3 bigquery: { 4 labels: { 5 label1: "val1", 6 /* If the label name contains special characters, e.g. hyphens, then quote its name. */ 7 "label-2": "val2" 8 } 9 } 10} 11 12select "test" as column1
You can both read from and publish to two separate GCP project_ids within a single Dataform project. For example, you may have a project_id called raw
that contains raw data loaded in your warehouse and a project_id called analytics
in which you create data tables you use for analytics and reporting.
Your default project id is defined in the defaultDatabase
field in your dataform.json
file.
1{ 2 "warehouse": "bigquery", 3 "defaultSchema": "dataform", 4 "assertionSchema": "dataform_assertions", 5 "defaultDatabase": "raw" 6}
You can then override the default gcp project id in the database
field in the config block of your SQLX files
1config { 2 type: "table", 3 database: "analytics" 4}
You can configure separate project-ids for development and production in your environment.json
file. The process is described on this page.
When creating incremental tables from partitioned tables in BigQuery, some extra care needs to be taken to avoid full table scans, as BigQuery can't optimize where
statements that are computed from inline select statements. To work around this, values used in the where clause should be moved to a pre_operations
block and saved into a variable using BigQuery scripting.
Here's a simple incremental example for BigQuery that follows this pattern, where the source table raw_events
is already partitioned by event_timestamp
.
1config { 2 type: "incremental", 3} 4 5pre_operations { 6 declare event_timestamp_checkpoint default ( 7 ${when(incremental(), 8 `select max(event_timestamp) from ${self()}`, 9 `select timestamp("2000-01-01")`)} 10 ) 11} 12 13select 14 * 15from 16 ${ref("raw_events")} 17where event_timestamp > event_timestamp_checkpoint
This will avoid a full table scan on the raw_events
table when inserting new rows, only looking at the most recent partitions it needs to.
If you use policy tags for managing column-level security in BigQuery, then you can set policy tags on columns in tables via the Dataform config block. Note that any policy tags created directly in BigQuery will get overwritten when Dataform updates the table, so make sure that you configure all policy tags within Dataform.
Policy Tag Admin
permission.
Here's an example of setting a policy tag on a column. The full tag identifier must be used as in the example below. This can be easily copied to your clipboard from the taxonomies and tags page inside the Google Catalog.
my_table.sqlx1config { 2 type: "table", 3 columns: { 4 column1: { 5 description: "Some description", 6 bigqueryPolicyTags: ["projects/dataform-integration-tests/locations/us/taxonomies/800183280162998443/policyTags/494923997126550963"] 7 } 8 } 9} 10 11select "test" as column1
Dataform validates the compiled script you are editing against BigQuery in real time. It will let you know if the query is valid (or won’t run) before having to run it.
Dataform displays Bytes processed and Bytes billed for every run you do in Dataform in the run logs page. You can then estimate the cost of those queries by multiplying the bytes billed by your company price per Byte.
We prepared the following sample project using the Stackoverflow public dataset.
We published a package that helps the analysis of BigQuery usage logs. You can find more information by reading the related blog post or the package page.
We published the following blog post specifically for BigQuery users.
The blog post offers a walkthrough to use Google Cloud Functions to push data from tables and views in BigQuery to third party services like Intercom.
The blog post offers a walkthrough to build a simple end to end Machine Learning pipeline using BigQueryML in Dataform.
If you are using Dataform web and are having trouble connecting to BigQuery, please reach out to us by using the intercom messenger icon at the bottom right of the app.
If you have other questions related to BigQuery, you can join our slack community and ask question on the #bigquery channel.