Skip to main content

Introduction

Key things to know

What is column-level lineage (CLL)?

Column-level lineage is a more nuanced visualization of your DAG that accounts for the connections between columns in each table in your data warehouse, not just tables.

How can I use CLL right now to improve my dbt workflow?
  • Debugging: CLL can dramatically accelerate the debugging process by giving you visibility into how a column is calculated without having to go line-by-line through each model.
  • Knowledge sharing & discussion: CLL can make it easier for data team members to get up to speed on a new data table. We’ve found it makes a great visual for bug bashes, demos, and a variety of internal meeting use cases.
How is CLL calculated?

Most vendors who offer CLL use batch metadata syncs from your cloud warehouse provider to generate it. This is effective but quite slow. As a result, lineage is often only recalculated once per day.

Turntable takes a different approach. We convert your dbt code into sql abstract syntax (using sqlglot) and leverage our own algorithm to recurse the syntax tree and build CLL.

What connection types are possible between columns?

With table lineage, tables are either connected or they are not. With CLL, connections are more nuanced. Here are the main types, and how they are represented in the extension:

CategoryEdge typeDescriptionCommon sql keywords
Calculation logicSelect as isColumn is selected exactly as isSELECT
Calculation logicSelect with modificationsColumn is selected but with transformations, such as changing, type, changing alias, or aggregatingSELECT, CAST, AS, SUM, COUNT
Values includedFilterRows of a given column are excluded via a predicated statementWHERE, HAVING, QUALIFY
Values includedJoin KeyRows of a given column are excluded based on join keysON, USING
Values includedGroup byRows of a given column are excluded based on column groupingGROUP BY
caution

One important caveat: two columns can — and often are — connected by multiple edges. This primarily occurs primarily when a query includes multiple CTEs. In these cases, we take the union of all edge types involved in our visualization. More on this on the next page.

What are the limitations of your approach?
LimitationImpactMitigation
We depend on the catalog.json file to help with resolve column references.We may show missing/incomplete columns in some situations.- Make sure you have the credentials to run dbt docs generate for your project.
- Make sure you materialize models (i.e. dbt run) and their ancestors if you make code updates.
If we can’t parse your sql, we can’t generate column lineage.We may only have table-level lineage for some tables.- Qualify your columns (e.g. table.id instead of id when there are multiple joins).
- Report any issues to us. We can work to make library changes upstream.
We currently cannot correctly parse situations where your table reference is a function (e.g. UNNEST clauses).We may show incomplete lineage in these cases.- Use SELECT statements where possible.
- Report any issues to us. We will prioritize this once we have enough demand.
We only support Bigquery and Snowflake.You may not be able to use this feature if your company uses another CDW.- Drop us a line in our company slack or simply click the chat with us button to let us know. We’d love to your help in prioritizing which dbs to support next!
What improvements do you plan to make to this over time?
  • Better dialect coverage: we plan to expand our solution to Databricks, Redshift, Starburst/Trino, and others going forward.
  • Better syntax coverage: our goal is 99+% coverage of sql syntax in our supported dialects
  • New features based on CLL: we are actively working on new features that leverage CLL for a better experience, including breaking change detection and column auto-renaming.

Worked example

note

Below is the code for an example dbt model called current_pipeline. See the toggle below for its column-level lineage. We’ve hidden it in case you want to try to work through the answer yourself first!

models/marts/sales/current_pipeline.sql
select 
r.pipeline_id,
r.stage_id,
s.order_nr as stage_order,
coalesce(r.funnel_size,0) as funnel_size
from {{ ref('base_pipedrive__stage') }} s
left join {{ ref('ranked_deals') }} r on r.stage_id = s.id
where r.pipeline_id = 1
order by s.order_nr desc
Answer
From columnConnection type(s)To column
ranked_deals.pipeline_idSelect as is, Filtercurrent_pipeline.pipeline_id
ranked_deals.pipeline_idFiltercurrent_pipeline.stage_id
ranked_deals.pipeline_idFiltercurrent_pipeline.stage_order
ranked_deals.pipeline_idFiltercurrent_pipeline.funnel_size
ranked_deals.stage_idSelect as is, Join keycurrent_pipeline.stage_id
ranked_deals.stage_idJoin keycurrent_pipeline.pipeline_id
ranked_deals.stage_idJoin keycurrent_pipeline.stage_order
ranked_deals.stage_idJoin keycurrent_pipeline.funnel_size
ranked_deals.funnel_sizeModifycurrent_pipeline.funnel_size
base_pipedrive.order_nrModifycurrent_pipeline.stage_order
base_pipedrive__stage.idJoin keycurrent_pipeline.pipeline_id
base_pipedrive__stage.idJoin keycurrent_pipeline.stage_id
base_pipedrive__stage.idJoin keycurrent_pipeline.stage_order
base_pipedrive__stage.idJoin keycurrent_pipeline.funnel_size