How and Why to Bring External Data Lineage into Databricks

3 days ago

5 min read

Official Databricks documentation: Bring your own data lineage - Azure Databricks | Microsoft Learn

Own your data, own your data lineage — Can you relate to this?

I don’t know about you, but I’ve lost count of how many times I’ve been frustrated by missing data lineage. There’s nothing worse than having no clue where your data comes from. Or who to even ask about it... It sounds like a simple problem, but in reality, it can be such a nightmare, especially in large organizations with multiple teams and data sources. That’s why I was so excited when Databricks announced the ability to create full data lineage by bringing your own data lineage! The possibilities here neat and with REST API support, it’s possible to automate so much of the process. No more chasing people down or opening endless tickets; now you can actually pull lineage information from external systems automatically. Of course not every external system support this out of the box (missing integration endpoints), but applying the Pareto principle, even covering the majority makes a big impact.

Let's check it together how to get started and analyze new opportunities in this space. The goal is to enable business end-users who access Unity Catalog in Databricks to easily view data assets, see their lineage, and gain a clearer understanding of the bigger picture. For data developers, it becomes straightforward to trace where the data originates and how it flows from source systems all the way to dashboards or operational systems. When there are odd values or data quality issues (I'd love to see a company who have solved data quality issues once and for all), this approach makes it possible to pinpoint the exact tables the problematic data came from. From there, you can methodically check step by step moving from downstream towards upstream to identify where the data is being corrupted. It’s a truly efficient way to maintain data integrity.

A quick glance is all it takes

External Metadata example on Databricks Unity Catalog. AI/BI Dashboard, Kafka, Power BI, PostgreSQL — Example lineage modeling

We have now evolved from medallion architecture tables to full-scale, end-to-end data flow! As you can see, it’s possible to include source systems like SAP and Kafka (or use "OTHER"), all the way to the endpoints, which are usually dashboards or operational systems. Or Databricks' own Lakebase (why look further than what you have nearby). This makes modeling truly enjoyable and much easier to understand. The entire data flow and projects can be modeled with full lineage tracking. And yes, you can even create external metadata -> external metadata lineage. So there’s really no excuse not to have complete lineage available in one place, Databricks.

But like everything in life, there are caveats. Yes, you can add lineage at the column level and across multiple objects simultaneously. But the more you add, the messier the picture becomes. Once you have many-to-many lineages, it can get quite difficult to understand what’s really happening. Remember the golden balance here - don’t try to add too much, but focus smartly on the most important data assets. By default, lineage details are hidden behind plus signs to provide a clean user experience. But when you have 50 lineages in one table and opening one lineage reveals another 50, you get the picture. Avoid overusing this feature - moderation in all things, just like eating hamburgers in DAIS 😅

Enough talk - show me something concrete!

The official documentation offers user-friendly instructions on how to handle this process manually. But as you know me, everything needs to be automated. Luckily this can be done quite easily since REST API support is there. I noticed that even when AI/BI dashboards are logged in system tables, their lineages don’t appear in the UC visualization (bug or feature?). Inspired by this, I quickly created a demo notebook to automate the entire process. You can apply the same logic to other external systems to crawl metadata and then insert it into Databricks. Practically, it involves three steps:

Fetch metadata
Create external metadata in Unity Catalog
Create lineages

*Remember to check that you have the necessary permissions in Unity Catalog.

I haven’t found Terraform support for this yet, but it’s pretty straightforward to set up on your own, using similar logic Terraform uses. Simply create a YAML or JSON file that keeps track of the current deployed metadata state. Then based on that file, insert, update or delete external metadata after each crawl job. It’s really important to store all that metadata somewhere instead of relying on click-ops. Don’t get me wrong, I have nothing against click-ops as long as the data is stored properly. Each has their own style!

I’ve put together a demo notebook that automates the entire process:

Fetches all AI/BI dashboards you have access to
Checks all tables AI/BI Dashboard is using (+ columns)
Creates external metadata
Automatically adds lineage for all those tables

The YAML/JSON file idea isn’t implemented here, but you get the picture. Here I’m dynamically automating the entire insertion process. This would also make a great automation task for agents. Since the Databricks side is standardized, an agent would just need to fetch metadata from the external system, make some adjustments and create external metadata with the correct lineage.

📦 Repo can be found here:

Databricks tips & tricks

Ensuring the information stays up to date

Another important aspect is maintaining external lineages. As we know, source and target systems (and the objects within them) are constantly changing. However, crawling all external sources every day doesn’t really make sense. So, how do we ensure that lineage stays accurate and up to date? The answer often depends on the nature of the external data item. Some are more static and may only require a one-time crawl, while others need weekly or monthly updates. This leans into process management territory. Finding the right schedule for each data item is key. Time to start communicating with other teams, that's the fun part! It’s also essential to verify that integration options exist to fetch metadata programmatically, making automation smooth and reliable.

When it comes to crawlers, it's usually about reading metadata either through a database or via REST API. Once you have that metadata, you can either create a new object or update an existing one if it has changed. I recommend storing this metadata somewhere, like in a YAML file. Then, updates can be fetched from that YAML file and applied using the REST API.

Having external data lineage on Databricks is a must

If there’s one thing I’ve learned in life, it’s that people tend to overcomplicate things - especially consultants. The simpler you make things, the better off you are.

The same principle applies here. Having full data lineage in one place is an absolute must. It makes your life so much easier in the long run. Sure, setting it up takes some effort and yes, the technical side is a lot easier part than updating company processes. But trust me, you’ll thank yourself down the road. If done right, you might even eliminate the need for third-party data catalog providers. That means smoother development and a better experience for end users.

P.S. Keep in mind this feature is currently in Public Preview, so REST API parameters might still evolve. If something breaks, make sure to check the latest documentation here: Get an external metadata object | External Metadata API | REST API reference | Azure Databricks