Getting Started with dbt
What is dbt?
dbt is an abbreviation for data build tools. It is primarily a SQL based transformation workflow, supported by yaml, to allow teams to collaborate on analytics code whilst implementing software engineering best practices like modularity, portability, CI/CD, testing, and documentation.
dbt is available using a CLI in the form of dbt core, or as a paid-for SaaS product in the form of dbt cloud.
dbt Core features a CLI for you to be able to create analytical models and tests, whereas dbt Cloud adds auto-generated documentation at run-time and scheduled workflows.
dbt materializes the results of queries as physical tables or views, thereby avoiding the maintenance of tables, such as managing the schema, transactions and other activities which occupy data teams. You can decompose data models into reusable, modular components to be used across different models, as well as leveraging metadata for optimisation.
SQL elements in dbt can utilise Jinga, which helps to reduce repetitive tasks by using macros and packages. The Directed Acyclic Graph (DAG) feature of dbt allows you to determine the order of execution of the objects, but also to propagate changes to the model to dependents. As the files that dbt uses are largely SQL and yaml files, this allows them to be easily source controlled. The power of dbt comes from the build engine itself, that creates the DAG, runs data quality tests on the data, hosts embedded documentation of the models and fields, as well as easily define snapshots and easily maintain seed values and files for repeatable testing.
It can connect to and use data from multiple systems such as: Postgres, Snowflake, Databricks and many others. For the purpose of this blog post, we’ll use Databricks as our source and use dbt core as our dbt instance.
We’re assuming you’ve already got a Databricks workspace available, you are familiar with Git, and tools like VSCode.
Installing dbt locally
For local development, dbt core is installed and configured using Python pip commands
pip install dbt-core
will install dbt core without any connection configuration, whereas something like
pip install dbt-databricks
will install dbt core as well as the connection configuration for Databricks, other options are available.
Initialise dbt
Create / Re-use an existing repo and clone locally using VS Code.
Within your project repo, open a new terminal window and use the following command to initialize the repo with base dbt structure for a given project:
dbt init {project_name}
There’ll be a series of prompts in the terminal, asking for the Databricks connection details. You’ll need to provide the host, the http path, the token and the schema that objects will be deployed to.
You’ll then be greeted with a folder structure that looks like:
In the terminal, navigate to your new project folder using
cd {project_name}
And now the real fun begins – creating data models!
Creating your first model
For this, we’re going to be using some sample data that comes with Databricks: Flights. Bring the data through and register it as a delta table.
delaysDF = (spark.read.option("header", True).csv('dbfs:/databricks-datasets/flights/departuredelays.csv'))
delaysDF.write.saveAsTable("Flight_Delays")
airportCodes = (spark.read.option("header", True).option("delimiter", "\t").csv('dbfs:/databricks-datasets/flights/airport-codes-na.txt'))
airportCodes.write.saveAsTable("Airport_Codes")
Within the models folder of the project, we’re going to add a subfolder called flights and three sub-folders underneath that called bronze
, silver
and gold
– referencing the medallion architecture.
We’re then going to add the following files to the bronze folder:
bronze_airports.sql
SELECT * FROM Airport_Codes
bronze_delays.sql
SELECT * FROM Flight_Delays
The silver folder:
silver_airport.sql
SELECT
IATA AS AirportCode,
City,
State,
Country
FROM {{ ref('bronze_airports') }}
silver_delays.sql
SELECT
CAST(date AS int) AS TripId,
CAST(concat('2014-', LEFT(date,2),'-', RIGHT(LEFT(date,4),2), ' ', LEFT(RIGHT(date,4),2), ':', RIGHT(date,2)) AS timestamp) AS LocaleDate,
CAST(delay AS int) AS Delay,
CAST(distance AS int) AS Distance,
origin AS OriginAirport,
destination AS DestinationAirport
FROM {{ ref('bronze_delays') }}
And the gold folder:
gold_airports.sql
SELECT
monotonically_increasing_id() AS AirportKey,
AirportCode,
City,
State,
Country
FROM {{ ref('silver_airports') }}
gold_delays.sql
SELECT
TripId,
LocaleDate,
Delay,
Distance,
OriginAirport.AirportKey AS OriginAirportKey,
DestinationAirport.airportKey AS DestinationAirportKey
FROM {{ ref('silver_delays') }}
JOIN {{ ref('gold_airports') }} AS OriginAirport ON OriginAirport.AirportCode = OriginAirport
JOIN {{ ref('gold_airports') }} AS DestinationAirport ON DestinationAirport.AirportCode = DestinationAirport
In the terminal, if we execute:
dbt run
It’s now going to compile the scripts, by default it will compile and execute the scripts as views and execute them in order. But how does it determine the order? Through the DAG. In the scripts, we’ve got a {{ ref(‘another_fileName’) }}
reference, which tells dbt that this reference is a dependency for the current script.
To get the queries to be materialized as a table instead, we can add the materialization type to the dbt_project.yml :
models:
project_x:
+materialized: table
Or can be controlled for each script using a macro at the top of the file:
{{ config(materialized='table') }}
Documenting your model
Documentation is typically something that data professionals are not very good at, as it’s usually in a separate system to where the code is, and therefore away from where we work. dbt makes creating and maintaining documentation easier by including documentation as a code artefact.
This sits alongside the technical code in a schema.yml file, which is also where tests are authored (more on that below).
models:
- name: gold_delays
description: Contains the average delay for each flight and airport in 2014
columns:
- name: TripId
description: Primary key
- name: LocaleDate
description: Local date of the flight
- name: delay
description: Delay of the flight
- name: Distance
description: Distance of the flight
- name: OriginAirportKey
description: Foreign Key for the Origin airport
- name: DestinationAirportKey
description: Foreign Key for the Destination airport
When we build out our model, we can easily build out the descriptions for each component of the model.
If we run dbt docs generate
in the terminal, it generates the documentation for your project. dbt generates a JSON file with rich documentation about your project, based on the contents of the schema file as well as the model code itself.
Running dbt docs serve
will launch the documentation in a local website.
Testing your model
For those of you familiar with Great Expectations or even table constraints, the tests that dbt will run will be run against the data itself rather than testing the processes.
Tests are added to the schema file and are associated with a column.
models:
- name: gold_delays
description: Contains the average delay for each flight and airport in 2014
columns:
- name: TripId
description: Primary key
tests:
- unique
- not_null
Using the schema approach is a generic approach to applying tests.
You can create more specific tests by writing SQL and storing those test scripts in the tests folder.
SELECT
TripId,
SUM(Distance) AS TotalDistance
FROM {{ ref('gold_delays' )}}
GROUP BY 1
HAVING NOT(TotalDistance >= 0)
Executing dbt test
will run those tests and produce an output in the terminal.
Our TripId test failed, which means we’ll need to inspect the logic for producing the column and make sure that it is unique, so that the test can evaluate as true.
Summary
Using the Flights data we’ve created a basic model, we’ve documented it and we’ve run some tests against it.
What we haven’t done, is explore some more complex features of dbt such as Snapshots, Incremental loading, deployment, scheduling, and seed files.
For what we’ve demonstrated in this blog post, dbt has made the process of creating, documenting, and testing a data model really easy. We’ll soon be exploring how well it handles more complex use cases – namely Incremental loading and Snapshots in a follow-up series of posts.
Hope you’ve enjoyed reading and following along – the full code for this post can be found in GitHub.