AI

Unveiling the Human Element in Improving AI Judges: Insights from Databricks Research

Published

3 months ago

November 4, 2025

Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

The Role of AI Judges in Enterprise AI Deployments

In the realm of artificial intelligence (AI) models, the primary hurdle hindering enterprise deployments is not the intelligence of the models themselves, but rather the challenge of defining and measuring quality. This critical issue is where AI judges come into play, assuming an increasingly vital role in the evaluation process.

Databricks, a prominent player in the AI space, has introduced Judge Builder, a framework designed to facilitate the creation of judges. Initially integrated into the company’s Agent Bricks technology, this framework has undergone significant evolution based on direct user feedback and real-world deployments.

While earlier versions of Judge Builder focused primarily on technical aspects, customer input highlighted a crucial bottleneck in organizational alignment. Consequently, Databricks now offers a structured workshop process to assist teams in overcoming three core challenges: establishing consensus among stakeholders on quality criteria, harnessing domain expertise from limited subject matter experts, and deploying evaluation systems at scale.

Jonathan Frankle, Databricks’ chief AI scientist, emphasized that the intelligence of AI models is typically not the limiting factor. Rather, the key questions revolve around how to ensure models behave as desired and how to verify their performance against predefined criteria.

The ‘Ouroboros Problem’ in AI Evaluation

One of the key issues addressed by Judge Builder is what Pallavi Koppol, a Databricks research scientist, refers to as the “Ouroboros problem.” This concept draws on the ancient symbol of a snake consuming its own tail, reflecting the circular validation challenge inherent in using AI systems to evaluate other AI systems.

Koppol explained that the dilemma arises when organizations seek validation of their AI systems through AI judges, prompting the question of how to ascertain the trustworthiness of these judges when they themselves are AI systems. The solution lies in measuring the “distance to human expert ground truth” as the primary scoring function, aiming to align AI judge evaluations closely with how domain experts would assess the same outputs.

This methodology diverges from traditional guardrail systems or single-metric evaluations by tailoring evaluation criteria specific to each organization’s domain expertise and business needs. Judge Builder’s technical implementation, integrating with Databricks’ MLflow and prompt optimization tools, distinguishes it further by enabling version control of judges, performance tracking, and simultaneous deployment across diverse quality dimensions.

Lessons Learned: Building Effective AI Judges

Through their collaboration with enterprise clients, Databricks has gleaned valuable insights that offer universal lessons for those involved in creating AI judges.

Lesson One: Expert Disagreement

Organizations often discover that even their subject matter experts exhibit varying opinions on what constitutes acceptable output, particularly in subjective quality assessments. To address this, batched annotation with inter-rater reliability checks can help identify and rectify misalignments early on, leading to higher judge performance and reduced noise in training data.

Lesson Two: Specificity in Criteria

Breaking down vague quality criteria into specific judges targeting distinct quality aspects yields more actionable insights than a generic overall quality assessment. By combining top-down requirements with bottom-up discovery, organizations can create production-friendly judges aligned with both regulatory constraints and observed failure patterns.

Lesson Three: Efficiency in Example Selection

Teams can develop robust judges with as few as 20-30 carefully chosen examples, focusing on edge cases that expose disagreement rather than universally accepted instances. This streamlined approach enables swift judge creation without compromising quality.

Production Results and Recommendations

Frankle shared key metrics used by Databricks to assess the success of Judge Builder, highlighting customer satisfaction, increased AI spending, and advancements in AI utilization as key indicators.

Notable success stories include customers creating multiple judges post-workshop, leading to enhanced measurement capabilities. Additionally, several clients have transitioned to seven-figure investments in Databricks’ GenAI platform following their engagements with Judge Builder.

For enterprises seeking to transition from AI pilots to full-scale production, Databricks recommends focusing on high-impact judges, establishing lightweight workflows with subject matter experts, and scheduling regular judge reviews using production data. By treating judges as evolving assets and aligning them with evolving systems, organizations can leverage them effectively for continuous improvement and optimization.

Ultimately, a well-crafted judge serves not only as an evaluation tool but also as a means of setting guardrails, facilitating metric-driven optimizations, and enabling advanced techniques like reinforcement learning. By harnessing the power of AI judges, organizations can enhance their AI capabilities and drive significant business outcomes.