💻 GitHub Repo link to the code
I have to humbly say right from the start, it was a nice surprise how much attention the Multi-Genie-Agent Solution has picked up. While I was on holiday in Asia, I learned that the Genie conversation API had finally entered public preview. A moment I had been eagerly waiting! At first I thought the window of opportunity had already passed by the time I got back to Finland and everything would already be set up and fully developed. Well, turns out that was not exactly the case. So it was time to publish a PoC version right away to inspire people and get the conversation going.
Now that the solution has gained some attention and quite a few questions have come in, it felt like the perfect moment to revisit the PoC. Yes, I finally had an evening to properly sit down and make some upgrades.
Let’s start with the most interesting questions. If you're interested in all the shiny updates, the list can be found at the end.
"How does the authentication logic truly work and could it be streamlined?"
When I put together the PoC solution, most of the features were brand new and still had some limitations. This led to, well... let's just say some unoptimized choices. But since everything evolves so quickly, most of those issues have now been addressed, making it possible to create a more robust solution.
The first major change is transitioning to a dedicated Service Principal (SP) approach. I’m really fond of this approach because it opens up great possibilities for monitoring, access rights, and lifecycle management. Databricks Apps creates it's own dedicated SP, which I consider as the "agent" SP. This SP is used for authentication and to execute SDK commands, which are now working smoothly. Authentication across different components is handled via SP, so it's important to ensure that the SP has the necessary access rights. Additionally, it's now possible to implement row and column-level security on the data, and it works smoothly when Genie reads the data. Pretty cool indeed!
At the moment, I haven't been able to find a way to pass user-level access all the way through to Genie dynamically using Apps. Currently, there is a user authorization feature in public preview, but its API scope is limited to SQL (Yep, I’ve tried out a bunch of things with the token). I hope this to be extended in the near future. Or if there is already a way to do this, please let me know. That said, I still prefer the dedicated SP approach, as Apps function as a standalone solution, making it a much more secure option. But there are caveats. Chat history behavior tracking/optimization doesn't work properly because Genie assumes all requests are coming from the same user. Additionally, permissions need to be added directly to the dataset.
Here is the new approach:

As you can see, Unity Catalog connections have been removed, well except for the DevOps connection. This change is due to the static authentication method used, which is based on the connection creator. For instance, if I create a UC connection with my PAT and use it to run Genie, the Genie logs will always show my usage. To address this, I updated the setup to use the Genie SDK, eliminating the need for PAT and host tokens, and deprecated the REST API solution. While REST APIs are great for experimenting with new features, I prefer transitioning to the SDK once updates become available there.
"Can you improve Dash functionalities and monitoring capabilities?"
Yep yep I know, the first iteration was whipped up pretty quickly. The entire PoC version was put together in about a day, so cut me some slack 😉 But yes, it's time to spice things up a bit. Now, Apps is monitoring users on a personal level, significantly improving governance and robustness.
I added a blocker to prevent questions from being asked while the agent is processing a previous request and updated print statements to logging, which now tracks activity at the user level. This makes it easy to see what's happening when there are multiple users. I considered updating the batch output to streaming, but that would have required more code changes, so I decided to leave it for next time (ok let's be honest, I prefer batch output over streaming). Oh almost forgot, SQL queries used by Genie were also included in the logging.
"How about agent performance evaluation and monitoring?"
The agent-based solution here is a simple function-calling agent that primarily depends on the quality of different Genie spaces. It's essentially integrated on top of Databricks Apps, functioning as a single solution. Therefore, I don't see the need for another abstraction layer (saving MLflow and agent deployment for more suitable use cases), as it would only complicate things unnecessarily. In the Apps section, you can find the Python library versions and the LLM model used with the agent logic.
Regarding evaluation, it heavily relies on Genie spaces, with the system prompt being the only influential factor besides the LLM model in determining how different Genie spaces are triggered (tool descriptions are vital!!). But let's keep things simple and not overcomplicate them 😁 At first I considered using chain-of-thought or ReAct approaches but quickly abandoned those ideas. The use case is so straightforward that adding more potential for hallucinations would just reduce the overall quality. It's possible to add LLM-as-a-judge to validate results before presenting them to end-users and having dynamic feedback loop, but then it would add latency even more and reduce user-experience. Validation should be focused at the Genie space level.
There is a case for agent optimization once there are multiple Genie Spaces with closely related domains. Then agent thought process need to be optimized, requiring even better context awareness. For example talking about "jobs" - is it about human jobs or data pipelines? And once the message history gets long, it might confuse agent because it doesn't know where to focus. Currently, messages are stored for an Apps session and using temp extended memory for tool call process. To tidy agent memory history, just click "clear" button (takes care of both operations).
Using MLflow effectively here is a bit more complex, but luckily inference tables can be activated even with the default Databricks-hosted LLM models. Since a dedicated SP is being used, it's easy to monitor and create necessary AI/BI dashboards (or use out-of-the-box options). And yes, I'm also working on an agentic flow evaluation solution currently. Probably will share more about it eventually.
"What kind of performance testing have you tried?"
Unfortunately my agent army isn't quite large enough to conquer the world yet, so let's call it limited. I haven't come across the limits yet, so if you manage to find them, remember to share the information! Oh and yes, you can have more than two Genie spaces here.
"What's next? Have any fun ideas for future development?"
The original idea was to create a quick PoC to inspire fellow Databricks developers with new possibilities. However, once it gained some attention, I had to revisit and upgrade it to a more robust version. Now that it’s done, I'm quite satisfied with it. There are many other exciting agentic projects I'm currently working on, so stay tuned 😉 But probably will come back with new updates once dynamic and secure user-level access can be handled.
Here's the list of updates that have been implemented
A lot of breaking changes have been made and the older version won't be supported (wasn't good enough for it).
Deprecated input parameters in deployment notebook
databricks_token_secret_value
databricks_host_secret_value
genie_connection
Unity Catalog Genie connection has been deprecated (using Apps SP for authentication)
Genie polling logic updated to SDK and REST API code has been deprecated
Mosaic Model serving authentication updated to use get_open_ai_client() (query still lacks tool support)
Updated all tokens to SP auth via SDK
Updated Apps Dash to support user level usage and monitoring
Updated print to logging
Added enhanced logging capabilities on Apps
Removed documentation tool & example
Default LLM model updated to use Claude 3.7. Sonnet pay-per-token on Databricks
And then some more minor stuff

Written by Aarni Sillanpää
Only progress is certain