AI
Enhancing Small Models’ Ability to Tackle Complex Reasoning with Google’s AI Training Method
Revolutionizing AI Reasoning with Supervised Reinforcement Learning
A groundbreaking collaboration between Google Cloud and UCLA has introduced a cutting-edge reinforcement learning framework that enhances the problem-solving capabilities of language models, particularly in complex multi-step reasoning tasks. Known as Supervised Reinforcement Learning (SRL), this innovative approach redefines problem-solving by breaking it down into logical actions, offering comprehensive learning signals throughout the training phase.
SRL empowers smaller models to tackle intricate problems previously deemed unattainable by conventional training methods. Notably, experiments have showcased SRL’s prowess in excelling at math reasoning benchmarks and effectively generalizing to agentic software engineering tasks.
With its versatility, SRL serves as a transformative training framework that elevates smaller, more cost-effective models to achieve higher reasoning capabilities.
The Challenges of Current LLM Reasoning Training
Recent advancements in training large language models (LLMs) for reasoning have predominantly relied on reinforcement learning with verifiable rewards (RLVR). This method rewards models based on the accuracy of their final answers, fostering effective problem-solving strategies through iterative attempts and feedback mechanisms.
However, this outcome-driven approach faces limitations when models struggle to find correct solutions within a limited number of attempts, also known as “rollouts.” As each rollout incurs computational costs, models encounter barriers in solving exceedingly difficult problems within their allocated resources.
This bottleneck in learning becomes apparent in multi-step reasoning scenarios where models may solve several steps accurately but falter due to a single error, resulting in an incorrect overall answer. Under RLVR, such partial successes receive negative feedback, hindering the model’s learning process. This binary feedback system fails to offer detailed insights and only provides sparse rewards.
Alternatively, supervised fine-tuning (SFT) involves models learning from expert-provided examples illustrating the full reasoning process. While SFT can cultivate reasoning skills, it often leads to overfitting, where models merely mimic the trajectories in training data without generalizing to unseen problems. Moreover, the scarcity and high cost of creating high-quality human-generated training data exacerbate this issue.
Recognizing these constraints, there exists a notable gap in training small open-source models to effectively tackle challenging problems.
Understanding Supervised Reinforcement Learning
SRL introduces a novel framework that reimagines problem-solving as a sequential decision-making process, striking a balance between outcome-centric reinforcement learning and imitation learning. Rather than focusing solely on final answers or mimicking an expert’s entire reasoning process, SRL instructs models to replicate a sequence of key actions pivotal to expert reasoning. This approach enables models to emulate expert actions while developing their unique internal reasoning styles.
Within the SRL framework, expert demonstrations are deconstructed into a series of concrete intermediate actions, each representing a significant step. For instance, in a math problem, an action could entail an algebraic manipulation, whereas in software engineering, it might involve executing a command in a code repository. To generate training data, SRL leverages a robust teacher model to create solution trajectories, which subsequently train smaller models.
I-Hung Hsu, a research scientist at Google and co-author of the paper, emphasizes the efficacy of this middle-ground approach in real-world scenarios. According to Hsu, SRL encapsulates the structured flexibility inherent in real-world problem-solving, where multiple valid strategies coexist alongside clear standards for ‘good reasoning’ at each step. This adaptability makes SRL well-suited for domains like data science automation or supply chain optimization, rewarding sound intermediate reasoning rather than mere final outcomes.
During training, models engage in an “inner monologue” – an internal reasoning process encapsulated within
Implementing SRL in Practice
Experimental results demonstrate SRL’s remarkable superiority over strong baselines in both demanding mathematical reasoning and agentic software engineering benchmarks. The research team observed that SRL fosters more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, enhancing solution quality without unnecessary verbosity.
For organizational leaders, performance enhancements hold value when achieved without escalating costs. Hsu clarifies that SRL-trained models exhibit enhanced reasoning efficiency. “The performance gains stem from improved reasoning quality and structure rather than verbosity,” he explained. “In terms of efficiency, SRL-trained models maintain token usage levels comparable to base models… while SRL doesn’t aim to reduce inference costs, it enhances reasoning performance without inflating costs.”
For mathematical assessments, the team fine-tuned Qwen2.5-7B-Instruct using a dataset featuring 1,000 challenging math questions. They compared its performance against models trained with SFT and RLVR (employing the GRPO algorithm prevalent in models like DeepSeek-R1) across four competition-level math benchmarks. The SRL-trained model achieved a notable 3.0% average performance increase over alternative methods.
Expanding the scope of SRL to agentic software engineering, a pivotal domain for enterprise automation, the team trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories detailing agent interactions within a coding environment. Benchmarking the SRL-trained model against the base model and SWE-Gym-7B, a robust baseline fine-tuned with SFT, revealed that SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This underscores SRL’s efficacy in training proficient AI agents for complex real-world programming tasks.
Setting a New Standard for High-Stakes AI?
The research’s most compelling outcomes emerged from a blend of methodologies: leveraging SRL to instill foundational reasoning skills and subsequently refining those skills with RLVR. In their experiments, combining SRL as a pre-training phase followed by RLVR post-training led to a 3.7% average performance boost, showcasing a potent curriculum learning strategy.
This approach prompts speculation on whether it could serve as a blueprint for developing specialized AI solutions.
“SRL serves as a robust foundation,” Hsu emphasized. “In essence, SRL provides a curriculum – teaching models to think and act methodically – before refining these behaviors with outcome-driven reinforcement learning. This SRL-first strategy not only stabilizes subsequent RL phases but also enhances reasoning interpretability and generalizability, crucial for high-stakes applications.”
Looking ahead, Hsu acknowledges the challenges in scaling this pipeline, particularly the complexity and cost associated with end-to-end RLVR for agentic tasks. Nevertheless, he remains optimistic about future prospects. “While high-quality expert trajectories remain indispensable,” he concluded, “we anticipate significant advancements by automating their generation and filtration – leveraging robust teacher models or self-improving student models to bootstrap new data.”
-
Facebook4 months agoEU Takes Action Against Instagram and Facebook for Violating Illegal Content Rules
-
Facebook4 months agoWarning: Facebook Creators Face Monetization Loss for Stealing and Reposting Videos
-
Facebook4 months agoFacebook Compliance: ICE-tracking Page Removed After US Government Intervention
-
Facebook4 months agoInstaDub: Meta’s AI Translation Tool for Instagram Videos
-
Facebook2 months agoFacebook’s New Look: A Blend of Instagram’s Style
-
Facebook2 months agoFacebook and Instagram to Reduce Personalized Ads for European Users
-
Facebook2 months agoReclaim Your Account: Facebook and Instagram Launch New Hub for Account Recovery
-
Apple4 months agoMeta discontinues Messenger apps for Windows and macOS

