The traditional ETL, which stands for Extraction, Transformation, and Loading, has now evolved onto ELT. Data is Extracted from source systems, Loaded into the data warehouse and then Transformed within the data warehouse. Dataform helps you manage that last part, the transformation in your data warehouse.
Imagine you want to build the data stack of an ecommerce shop. This business has three data sources: Shopify for the web store, Stripe to process payments, and Salesforce as their CRM. You want to use all that data to build reports to track KPIs, create dashboards and conduct ad hocs analysis to understand the business. One of the first tasks would be to create a dashboard in a BI tool with all your customers information for everyone in the company to know about your customers. This dashboard will have all the data you have about your customers.
In this example, you want to use all the data you have (coming from Shopify, Stripe, and Salesforce) to create a unified dashboard.
The data warehouse is the epicenter of modern stacks. Raw data from across the company is centralized in the warehouse. Data is transformed in the warehouse. BI and analytics tools read data from the warehouse.
Most businesses today will use a cloud data warehouses like Google BigQuery.
The first step in building a data stack is to Extract raw data from all sources and load it in the data warehouse. You can achieve this with third party tools, or by writing custom scripts.
The data loaded in your warehouse is raw and unprocessed. Each of those sources will generate dozens of tables in your data warehouse. At this stage, the data is not really usable for analytics. Answering simple questions like “Which customers order the most products?" would probably take several hours and writing complex queries.
The next step is to transform the data. You want to turn the hundreds of tables of raw data loaded in your warehouse into a single source of truth that will represent your business.
For our example of creating a customer dashboard, you will want to join the data from the different sources together, normalize the fields and filter bad data to create a unique customers table. That table will contain all the information you have about your customers and will let you answer questions like “Which customers order the most products?” very quickly.
This customers table will be the table you will use for your dashboards. That means that everyone in the company will see that data and rely on it to make decisions. As a result this data needs to be tested so that it can be trusted. It needs to be refreshed frequently so your dashboards have the latest information. It needs to be documented for everyone to know what the different fields mean.
After your data is transformed, you can use BI and other analytics tools to build dashboards and conduct analysis.