Learn how to properly handle agent authentication in Databricks

Jun 17

6 min read

Now that DAIS 2025 is experienced, it's time to get back to work. You know the old saying "go big or go home"? Well, after eating 5 burgers in 6 days, I think I took that a bit too literally. But enough chit-chat. Time to switch back to Finnish mode - talk only when you have something to say. And this time, it's time to deep-dive into the world of agent authentication.

Managing access rights has always been a challenge

Providing proper access rights management to data has always been critical and the same goes for agents. You want to have robust and transparent access control to your data. Previously, this was handled via a dedicated service principal which is a robust approach. But the problem is scalability and the differing permission requirements of users. For example, the Nordic sales team should see only their region’s data, the Americas team theirs, and so on. So you can imagine how it's not a viable solution to create thousands of agents with slightly different permissions. It generates significant costs and the ROI would be hard to justify. Imagine thousands of employees having fully identical agents, with the only difference being their access rights. Doesn't sound like a convenient setup, right? But if they could use the same agent, which automatically enforces the end user’s access rights securely, it would solve this problem. And that’s what user-on-behalf authentication is all about.

On-behalf-authentication on Databricks

Databricks has enabled behalf-of-user authentication in beta, finally! This means agents can now access Databricks resources using the identity of the Databricks end user who invoked the agent. It opens so many new doors, as previously this was done using a system service principal, which was quite limited from an access rights perspective. All deployed agents in Mosaic AI have been working under a system service principal and it’s possible grant permissions on dedicated resources during the deployment. Unfortunately, you couldn’t grant any permissions to that SP on your own (using the App ID), which made advanced agent permission management sometimes a problematic. It worked neatly for many use cases, but now that Genie has been harnessed as your own personal data analyst, the authentication process needs to be optimized as well. Luckily, there are now more resources available for deployment and thanks to behalf-of-user authorization, those can be utilized dynamically.

User-level authentication is a fairly straightforward process, but there are a couple of really important things to keep in mind. First, this feature is still in beta, so there might be bugs or issues. It means no YOLO production solutions or commits, sorry! Especially when dealing with sensitive data, be extra careful. Another key point is where authentication happens. Previously, you might have handled SDK authentication in the __init__, but in this case, it has to be done inside the predict function to ensure secure usage and prevent session spilling. So, some code refactoring might be needed. Let’s take a look at how it works, including the whole deployment process with an example agent solution.

Example code to provide clarity

I created a simple function-calling agent with one Genie space and random weather tool to get you started. The repo can be found here: ikidata/agent-authentication-demo: A demo agent solution on Databricks behalf-of-user authentication. In the configs, remember to change your Genie space ID and description in the config.py file. If you're wondering why it's not using YAML, there are some hiccups with path management during deployment, and this was my preferable solution to avoid extra manual work or latency due to multipath checking. There are a couple of steps where you have to add the auth code, so let's get started.

Here you can see example code of setting up authentication process.

from databricks.sdk import WorkspaceClient
from databricks.sdk.credentials_provider import ModelServingUserCredentials

get_workspace_client(auth_type: str) -> object:

    if auth_type == 'system':

        w = WorkspaceClient()

    elif auth_type == 'user':

        w = WorkspaceClient(credentials_strategy=ModelServingUserCredentials())

    else:

        raise ValueError("auth_type must be 'system' or 'user'")

    return w

In functions.py file, you can find an easy-to-use function that can be used for sys and user SDK authentication. Here you can see a new credentials strategy that needs to be used for user-level authentication.

In genie_functions.py, you can see I'm creating new SDK authentication on a user level. The next part is IMPORTANT. You can see system service principal authentication is done in init (all users can use those permissions, like accessing the LLM model) and user-level authentication in the predict/predict_stream function. That's how it can be ensured user-level permissions aren't leaked to other users, but common access rights are easily available for all agent users. So once again, whenever using user-on-behalf authentication, ALWAYS use it inside the predict/predict_stream function. And yes, predict_streaming was kept as a really simplified version for this solution, just "streaming" messages instead of chunks.

Demonstration of authentication logic

Agent authorization illustrated — Agent authentication illustrated

Here you can see the logic. It's easier to use sys sp auth for authenticate on the "common" components, ensuring easy management & monitoring. You want to monitor how much the agent is using SQL warehouses and LLM models and how much users are interacting with the agent. This approach streamlines permission management significantly, especially with external connections via Unity Catalog (OAuth). Access to the data occurs at the user level using behalf-of-user authorization, ensuring users can only see the data they are permitted to access. Using Genie as a "subagent" is a powerful solution in this context. For example, when asking something from the Genie, the monitoring will correctly display the user (rather than an unknown sp ID), and Genie respects user access rights and row/column-level security. Quite neat, right?

Regarding Vector Search Index (RAG solutions), Mosaic AI doesn't support row and column permissions directly. However, it is possible to implement these permissions using custom solutions, which is why it has been positioned in the middle.

Unfortunately Databricks apps do not yet support model endpoint behalf-of-user authorization, "only" SQL, files, and Genie spaces are currently supported. Eagerly waiting for this to be added, as it would enable the development of even cooler and more sophisticated end-to-end agent solutions on Databricks. That said, it’s already possible to build some neat stuff inside the app. And yes, once model deployment support is added, it's time to upgrade my old multi-Genie solution as well. (ikidata/multi-genie-agent: This is a showcase repository for the multi-genie agent solution).

Databricks now supports Model Context Protocol (MCP) servers as well, but that’s a topic for the next article. You can't have all the treats at the same time 😉 MCP essentially standardizes REST APIs, which is why I prefer to keep SDK Genie functions visible. It's important to understand the core logic of how things work before deciding to use wrappers on top of them.

Here you can see which API endpoints are supported currently:

Databricks Resource	API Scopes Required
Vector Search Index	serving.serving-endpoints,vectorsearch.vector-search-endpoints,vectorsearch.vector-search-indexes
Model Serving Endpoint	serving.serving-endpoints
SQL Warehouse	sql.statement-execution,sql.warehouses
UC Connections	catalog.connections
Genie Space	dashboards.genie

Deployment process

And then it's time to deploy the agent to Mosaic AI model serving and start using it! In the example repo, you can find the code in the "driver" notebook. Here you can also see the deployment code with new authentication features. Remember to set up both system and user authorization policies.

import mlflow
from mlflow.models.auth_policy import SystemAuthPolicy, UserAuthPolicy, AuthPolicy

# Here you can see which resources are supported currently
from mlflow.models.resources import (
  DatabricksVectorSearchIndex,
  DatabricksServingEndpoint,
  DatabricksSQLWarehouse,
  DatabricksFunction,
  DatabricksGenieSpace,
  DatabricksTable,
  DatabricksUCConnection
)

###############
# System Service Principal Auth
###############
resources = [DatabricksServingEndpoint(endpoint_name="skynet")]
system_auth_policy = SystemAuthPolicy(resources=resources)

###############
# On-behalf-of-user authentication. 
# Remember to specify the API scopes needed for the agent to access Databricks resources
###############

user_auth_policy = UserAuthPolicy(
    api_scopes=[
        "dashboards.genie"
    ]
)
with mlflow.start_run():
    logged_agent_info = mlflow.pyfunc.log_model(
        name="agent",
	   ...
        auth_policy=AuthPolicy(
            system_auth_policy=system_auth_policy,
            user_auth_policy=user_auth_policy
        )
    )

Once you are ready, remember to deactivate the agent. Keep in mind that when activating agent inference tables, it automatically creates a workflow to handle history data management. This workflow runs every 15 minutes, so you may need to adjust the update cycle or deactivate it if necessary. Otherwise it might generate surprise costs quickly. Note that this workflow won't be deactivated when the agent is.

And then time to use agent authentication on Databricks

It's important to keep the overall system straightforward, understandable and secure. For agents, it's easiest to maintain authentication at the agent service principal level as much as possible. For data, user-specific permissions are desirable to make the agent more versatile and efficient. It's beneficial to utilize Genie as an independent analyst, making the whole system very effective. Genie respects user access rights and runs queries in real-time, ensuring that the used data is always up-to-date. This makes it a significantly more flexible solution compared to vector search indexes, where you need to balance data update cycles. As you can see, on-behalf-of-user authorization is really simple and effective to implement in your agent solutions. It's like refactoring the code while waiting for a flight at Munich Airport 😄 But remember that this is in beta, so be extra careful with the new agentic solutions. Whenever using personal access rights, it's a risky business, so rigorous testing or sticking to the more secure service principal authentication is advisable.

Here's a link to the GitHub repository, which contains example code for the entire deployment process: