At Databricks, we use reinforcement studying (RL) to develop reasoning fashions for issues that our clients face in addition to for our merchandise, such because the Databricks Assistant and AI/BI Genie. These duties embody producing code, analyzing information, integrating organizational data, domain-specific analysis, and info extraction (IE) from paperwork. Duties like coding or info extraction typically have verifiable rewards — correctness may be checked instantly (e.g., passing exams, matching labels). This permits for reinforcement studying with out a discovered reward mannequin, generally known as RLVR (reinforcement studying with verifiable rewards). In different domains, a customized reward mannequin could also be required — which Databricks additionally helps. On this publish, we deal with the RLVR setting.

As an instance of the facility of RLVR, we utilized our coaching stack to a well-liked educational benchmark in information science known as BIRD. This benchmark research the duty of remodeling a pure language question to a SQL code that runs on a database. This is a vital drawback for Databricks customers, enabling non-SQL specialists to speak to their information. Additionally it is a difficult process the place even the very best proprietary LLMs don’t work nicely out of the field. Whereas BIRD neither absolutely captures the real-world complexity of this process nor the full-breadth of actual merchandise like Databricks AI/BI Genie (Determine 1), its recognition permits us to measure the efficacy of RLVR for information science on a nicely understood benchmark.

We deal with bettering a base SQL coding mannequin utilizing RLVR, isolating these good points from enhancements pushed by agentic designs. Progress is measured on the single-model, single‑technology observe of the BIRD leaderboard (i.e., no self‑consistency), which evaluates on a non-public check set.
We set a brand new state-of-the-art check accuracy of 73.5% on this benchmark. We did so utilizing our commonplace RLVR stack and coaching solely on the BIRD coaching set. The earlier finest rating on this observe was 71.8%[1], achieved by augmenting the BIRD coaching set with further information and utilizing a proprietary LLM (GPT-4o). Our rating is considerably higher than each the unique base mannequin and proprietary LLMs (see Determine 2). This outcome showcases the simplicity and generality of RLVR: we reached this rating with off-the-shelf information and the usual RL parts we’re rolling out in Agent Bricks, and we did so on our first submission to BIRD. RLVR is a robust baseline that AI builders ought to contemplate each time sufficient coaching information is accessible.
We constructed our submission primarily based on the BIRD dev set. We discovered that Qwen 2.5 32B Coder Instruct was the very best start line. We fine-tuned this mannequin utilizing each Databricks TAO – an offline RL methodology, and our RLVR stack. This method alongside cautious immediate and mannequin choice was adequate to get us to the highest of the BIRD Benchmark. This result’s a public demonstration of the identical methods we’re utilizing to enhance well-liked Databricks merchandise like AI/BI Genie and Assistant and to assist our clients construct brokers utilizing Agent Bricks.
Our outcomes spotlight the facility of RLVR and the efficacy of our coaching stack. Databricks clients have additionally reported nice outcomes utilizing our stack of their reasoning domains. We expect this recipe is highly effective, composable, and extensively relevant to a spread of duties. If you’d wish to preview RLVR on Databricks, contact us right here.
1See Desk 1 in https://arxiv.org/pdf/2505.20315
Authors: Alnur Ali, Ashutosh Baheti, Jonathan Chang, Ta-Chung Chi, Brandon Cui, Andrew Drozdov, Jonathan Frankle, Abhay Gupta, Pallavi Koppol, Sean Kulinski, Jonathan Li, Dipendra Kumar Misra, Jose Javier Gonzalez Ortiz, Krista Opsahl-Ong