Introduction

Key things to know

What is column-level lineage (CLL)?

Column-level lineage is a more nuanced visualization of your DAG that accounts for the connections between columns in each table in your data warehouse, not just tables.

How can I use CLL right now to improve my dbt workflow?

Debugging: CLL can dramatically accelerate the debugging process by giving you visibility into how a column is calculated without having to go line-by-line through each model.
Knowledge sharing & discussion: CLL can make it easier for data team members to get up to speed on a new data table. We’ve found it makes a great visual for bug bashes, demos, and a variety of internal meeting use cases.

How is CLL calculated?

Most vendors who offer CLL use batch metadata syncs from your cloud warehouse provider to generate it. This is effective but quite slow. As a result, lineage is often only recalculated once per day.

Turntable takes a different approach. We convert your dbt code into sql abstract syntax (using sqlglot) and leverage our own algorithm to recurse the syntax tree and build CLL.

What connection types are possible between columns?

With table lineage, tables are either connected or they are not. With CLL, connections are more nuanced. Here are the main types, and how they are represented in the extension:

Category	Edge type	Description	Common sql keywords
Calculation logic	Select as is	Column is selected exactly as is	SELECT
Calculation logic	Select with modifications	Column is selected but with transformations, such as changing, type, changing alias, or aggregating	SELECT, CAST, AS, SUM, COUNT
Values included	Filter	Rows of a given column are excluded via a predicated statement	WHERE, HAVING, QUALIFY
Values included	Join Key	Rows of a given column are excluded based on join keys	ON, USING
Values included	Group by	Rows of a given column are excluded based on column grouping	GROUP BY

caution

One important caveat: two columns can — and often are — connected by multiple edges. This primarily occurs primarily when a query includes multiple CTEs. In these cases, we take the union of all edge types involved in our visualization. More on this on the next page.

What are the limitations of your approach?

Limitation	Impact	Mitigation
We depend on the `catalog.json` file to help with resolve column references.	We may show missing/incomplete columns in some situations.	- Make sure you have the credentials to run `dbt docs generate` for your project. - Make sure you materialize models (i.e. `dbt run`) and their ancestors if you make code updates.
If we can’t parse your sql, we can’t generate column lineage.	We may only have table-level lineage for some tables.	- Qualify your columns (e.g. `table.id` instead of `id` when there are multiple joins). - Report any issues to us. We can work to make library changes upstream.
We currently cannot correctly parse situations where your table reference is a function (e.g. UNNEST clauses).	We may show incomplete lineage in these cases.	- Use SELECT statements where possible. - Report any issues to us. We will prioritize this once we have enough demand.
We only support Bigquery and Snowflake.	You may not be able to use this feature if your company uses another CDW.	- Drop us a line in our company slack or simply click the `chat with us` button to let us know. We’d love to your help in prioritizing which dbs to support next!

What improvements do you plan to make to this over time?

Better dialect coverage: we plan to expand our solution to Databricks, Redshift, Starburst/Trino, and others going forward.
Better syntax coverage: our goal is 99+% coverage of sql syntax in our supported dialects
New features based on CLL: we are actively working on new features that leverage CLL for a better experience, including breaking change detection and column auto-renaming.

Worked example

note

Below is the code for an example dbt model called current_pipeline. See the toggle below for its column-level lineage. We’ve hidden it in case you want to try to work through the answer yourself first!

models/marts/sales/current_pipeline.sql
select 
  r.pipeline_id,
  r.stage_id,
  s.order_nr as stage_order,
  coalesce(r.funnel_size,0) as funnel_size
from {{ ref('base_pipedrive__stage') }} s
left join {{ ref('ranked_deals') }} r on r.stage_id = s.id
where r.pipeline_id = 1
order by s.order_nr desc

Answer

From column	Connection type(s)	To column
ranked_deals.pipeline_id	Select as is, Filter	current_pipeline.pipeline_id
ranked_deals.pipeline_id	Filter	current_pipeline.stage_id
ranked_deals.pipeline_id	Filter	current_pipeline.stage_order
ranked_deals.pipeline_id	Filter	current_pipeline.funnel_size
ranked_deals.stage_id	Select as is, Join key	current_pipeline.stage_id
ranked_deals.stage_id	Join key	current_pipeline.pipeline_id
ranked_deals.stage_id	Join key	current_pipeline.stage_order
ranked_deals.stage_id	Join key	current_pipeline.funnel_size
ranked_deals.funnel_size	Modify	current_pipeline.funnel_size
base_pipedrive.order_nr	Modify	current_pipeline.stage_order
base_pipedrive__stage.id	Join key	current_pipeline.pipeline_id
base_pipedrive__stage.id	Join key	current_pipeline.stage_id
base_pipedrive__stage.id	Join key	current_pipeline.stage_order
base_pipedrive__stage.id	Join key	current_pipeline.funnel_size

Key things to know​

Worked example​

Key things to know

Worked example