Connect with us

AI

AI Acceleration: How ATLAS Adaptive Speculator Achieved a 400% Inference Speedup Through Real-Time Workload Learning

Published

on

Together AI's ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

Overcoming the Invisible Performance Wall in AI Inference with ATLAS

Enterprises delving deeper into AI deployments are encountering a hidden obstacle in performance. The issue lies with static speculators that struggle to keep up with changing workloads.

Speculators, which are smaller AI models, complement large language models during inference. They anticipate multiple tokens ahead, which the main model then validates in parallel. This approach, known as speculative decoding, has become crucial for enterprises aiming to reduce inference costs and latency. Rather than generating tokens sequentially, the system can process multiple tokens simultaneously, significantly enhancing throughput.

Together AI recently introduced ATLAS (AdapTive-LeArning Speculator System), a new system designed to help enterprises tackle the challenge posed by static speculators. This technique offers a self-learning inference optimization feature that can deliver up to 400% faster inference performance compared to existing inference technologies like vLLM. The system addresses a critical issue: as AI workloads evolve, inference speeds decrease, even with specialized speculators in operation.

The company, founded in 2023, has been dedicated to enhancing inference on its enterprise AI platform. Earlier this year, the company secured a funding of $305 million as customer adoption and demand surged.

“As companies scale up, they often encounter shifting workloads, resulting in reduced speedup from speculative execution,” explained Tri Dao, chief scientist at Together AI, in an exclusive interview with VentureBeat. “These speculators tend to underperform when the workload domain undergoes changes.”

The Unspoken Workload Drift Challenge

Most speculators currently in production are static models. They are trained once on a fixed dataset representing anticipated workloads and then deployed without the ability to adapt. Companies like Meta and Mistral deploy pre-trained speculators alongside their main models. Inference platforms such as vLLM leverage these static speculators to enhance throughput without compromising output quality.

However, there’s a catch. When an enterprise’s AI usage evolves, the accuracy of static speculators declines.

“If a company develops coding agents, and most of their developers have been coding in Python, but suddenly some switch to Rust or C, the speed starts to decline,” explained Dao. “The speculator faces a mismatch between its training data and the actual workload.”

See also  Cost-Efficient AI Model Retraining to Prevent Forgetting: A Breakthrough Discovery by Researchers

This workload drift poses a hidden challenge for scaling AI. Enterprises are left with the choice to accept degraded performance or invest in retraining custom speculators, which provides only a temporary fix.

How Adaptive Speculators Operate: A Dual-Model Strategy

ATLAS employs a dual-speculator architecture that combines stability with adaptability:

The static speculator – A heavyweight model trained on extensive data offers consistent baseline performance, serving as a “speed floor.”

The adaptive speculator – A lightweight model continuously learns from real-time data, specializing on-the-fly to emerging domains and usage patterns.

The confidence-aware controller – An orchestration layer dynamically selects which speculator to utilize, adjusting the speculation “lookahead” based on confidence scores.

“Initially, before the adaptive speculator acquires knowledge, we rely on the static speculator to provide the speed boost,” explained Ben Athiwaratkun, staff AI scientist at Together AI. “Once the adaptive speculator gains confidence, the speed improves over time.”

The key innovation lies in maintaining a balance between acceptance rate (the frequency at which the target model agrees with drafted tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller increasingly relies on the lightweight speculator and extends the lookahead, leading to enhanced performance gains.

Users are not required to adjust any parameters. “Users don’t need to make any adjustments,” stated Dao. “We handle the configurations on our end to deliver optimal speedup.”

Performance Comparable to Custom Silicon

Testing conducted by Together AI demonstrates ATLAS achieving 500 tokens per second on DeepSeek-V3.1 when fully adapted. Remarkably, these numbers on Nvidia B200 GPUs match or surpass specialized inference chips like Groq’s custom hardware.

“Through software and algorithmic enhancements, we have narrowed the gap with specialized hardware,” noted Dao. “We have observed 500 tokens per second on these large models, which outperform some customized chips.”

See also  Is the emphasis on coding vibes creating a lost generation of engineers?

The claimed 400% speedup in inference by the company is a result of Together’s Turbo optimization suite. FP4 quantization provides an 80% speedup over the FP8 baseline. The Turbo Speculator further enhances performance by 80-100%. The adaptive system builds upon these optimizations, creating a cumulative effect. Each optimization amplifies the benefits of the others.

When compared to standard inference engines like vLLM or Nvidia’s TensorRT-LLM, the improvement is significant. Together AI benchmarks against the stronger baseline between the two for each workload before applying speculative optimizations.

The Memory-Compute Tradeoff Explained

The performance enhancements stem from capitalizing on a fundamental inefficiency in modern inference: underutilized compute capacity.

Dao explained that during inference, a substantial portion of the compute power remains idle.

“Inference, which currently dominates workloads, heavily relies on the memory subsystem,” he stated.

Speculative decoding trades idle compute for reduced memory access. When a model generates tokens sequentially, it becomes memory-bound, with the GPU waiting for memory access. However, proposing multiple tokens by the speculator for verification by the target model spikes compute utilization while maintaining relatively constant memory access.

“Although the total compute required to generate five tokens remains the same, accessing memory only once, rather than five times, enhances efficiency,” explained Dao.

Intelligent Caching for AI

For infrastructure teams familiar with traditional database optimization, adaptive speculators function akin to an intelligent caching layer, albeit with a crucial distinction.

Traditional caching systems such as Redis or memcached necessitate exact matches. You store and retrieve the exact query result when the specific query recurs. Adaptive speculators operate differently.

“It’s akin to intelligent caching, where instead of storing exact matches, we identify patterns in the data,” clarified Dao. “Broadly, we observe similarities in code or compute control, allowing us to predict outcomes more accurately.”

Instead of storing precise responses, the system learns patterns in token generation by the model. It recognizes that certain token sequences are more likely when editing Python files in a specific codebase. The speculator adjusts to these patterns, refining its predictions over time without requiring identical inputs.

See also  Advancing Language Models: MIT's Revolutionary SEAL Technique for Self-Improvement

Use Cases: RL Training and Dynamic Workloads

Adaptive speculators offer significant benefits in two enterprise scenarios:

Reinforcement learning training: Static speculators struggle to keep pace as the policy evolves during training. ATLAS adapts continuously to the evolving policy distribution.

Dynamic workloads: As enterprises explore new AI applications, workload composition changes. “They may initially use AI for chatbots but then realize its coding capabilities, leading to a shift towards coding tasks,” noted Dao. “Or they discover that AI can operate tools, control computers, and handle accounting tasks.”

During a coding session, the adaptive system can specialize for the specific codebase being edited, even if these files were not part of the training data. This further enhances acceptance rates and decoding speed.

Implications for Enterprises and the Inference Ecosystem

ATLAS is now accessible on Together AI’s dedicated endpoints as part of the platform at no extra cost. The company’s developer base of over 800,000 (up from 450,000 in February) can leverage this optimization.

However, the broader implications extend beyond a single vendor’s product. The transition from static to adaptive optimization signifies a fundamental shift in how inference platforms should operate. As enterprises deploy AI across diverse domains, the industry must progress from one-time trained models to systems that continuously learn and improve.

While the integrated ATLAS system remains proprietary, Together AI has a history of open-sourcing some of its research techniques and collaborating with projects like vLLM. Although the complete ATLAS system is proprietary, some underlying techniques could influence the broader inference ecosystem in the future.

For enterprises aiming to lead in AI, the message is clear: adaptive algorithms on standard hardware can rival custom silicon at a reduced cost. As this approach gains traction across the industry, software optimization increasingly outperforms specialized hardware.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending