AI

The Power of Retention: Uncovering the Brumby-14B-Base in Qwen3

Published

6 months ago

November 5, 2025

Attention ISN'T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique

Revolutionizing AI: The Rise of Power Retention

Back in 2017, the introduction of the transformer architecture in the groundbreaking Google paper “Attention Is All You Need” marked a pivotal moment in the field of artificial intelligence. Since then, every major large language model (LLM) has been built upon some form of its central mechanism: attention. This mathematical operation allows models to analyze vast amounts of input data and identify the most relevant information.

Fast forward eight years, and the once-revered attention mechanism is now facing limitations. While powerful, attention comes at a high cost – both in terms of computational resources and memory. As models strive to analyze extensive documents, codebases, or video streams spanning hours or even days, attention has become a significant bottleneck.

On October 28, 2025, a relatively unknown AI startup called Manifest AI introduced a groundbreaking alternative. Their latest model, Brumby-14B-Base, is a reimagined version of the popular Qwen3-14B-Base transformer model. What sets Brumby apart is its abandonment of the attention mechanism in favor of a novel approach known as Power Retention.

Power Retention, a recurrent and hardware-efficient architecture developed by Manifest AI, offers a solution to the scalability issues posed by attention. Unlike attention, Power Retention allows for information retention over long contexts without the exponential increase in memory requirements. This innovative approach promises to deliver similar performance to traditional transformer models while significantly reducing computational costs.

The Shift from Attention to Retention

At the heart of Manifest AI’s innovation lies the Power Retention layer. Unlike traditional transformers that rely on the extensive pairwise comparison enabled by attention, Power Retention introduces a recurrent state update mechanism. This approach involves maintaining a memory matrix that is continually updated based on incoming data, offering a more efficient alternative to attention.

By leveraging Power Retention, the computational cost remains constant regardless of the length of the input sequence. This unique property sets Power Retention apart from traditional transformers and positions it as a game-changer in the field of AI architecture.

Moreover, Power Retention retains the expressive capabilities that made attention successful while introducing higher-order dependencies between past and present tokens. This results in an architecture that can handle long-term dependencies efficiently, combining the benefits of both RNNs and transformers.

Retraining for Success

One of the most remarkable aspects of Brumby-14B’s training process is its efficiency. Manifest AI managed to train the model in just 60 hours using 32 Nvidia H100 GPUs at a cost of $4,000. This represents a significant cost reduction compared to training a conventional model of similar scale from scratch.

However, it’s important to note that while Brumby was based on a transformer model, this breakthrough does not signal the end of the transformer era. Jacob Buckman, the founder of Manifest AI, highlighted that leveraging existing transformer models is crucial for accelerating the adoption of new architectural paradigms.

Through a brief retraining phase, Brumby was able to recalibrate its weights and align them with the Power Retention framework. This process allowed the model to quickly recover its performance, showcasing the adaptability and efficiency of attention-free systems.

Benchmarking Success

Across various evaluation tasks, Brumby-14B-Base consistently demonstrates performance on par with or even surpassing traditional transformer models of similar scale. While it may lag slightly behind in certain evaluations, Brumby excels in tasks requiring mathematical reasoning and long-context analysis.

Task	Brumby-14B	Qwen3-14B	GLM-4.5-Air	Nemotron Nano (12B)
ARC	0.89	0.94	0.92	0.93
GSM8K	0.88	0.84	0.83	0.84
GSM8K (Platinum)	0.87	0.88	0.85	0.87
HellaSwag	0.77	0.81	0.85	0.82
MATH	0.62	0.54	0.47	0.26
MBPP	0.57	0.75	0.73	0.71
MMLU	0.71	0.78	0.77	0.78
MMLU (Pro)	0.36	0.55	0.51	0.53

While Brumby may face some challenges in certain evaluations, its performance in critical reasoning tasks underscores the potential of retention-based systems in handling complex dependencies.

Efficiency and Innovation

One of the key advantages of Brumby’s Power Retention design is its hardware efficiency. By utilizing local matrix operations for state updates, inference can be implemented with linear complexity in sequence length. This approach offers significant speedups over traditional attention mechanisms, making it a promising solution for processing long inputs.

Manifest AI’s Power Retention kernels demonstrate improved hardware utilization compared to other architectures, showcasing the efficiency gains of this innovative approach. The reduced computational complexity and memory requirements further enhance the performance of the model on extended sequences.

Training and Scalability

The low cost of training Brumby-14B highlights the scalability and efficiency of Power Retention. Buckman emphasized that retraining larger models becomes more straightforward as the parameter count increases, leading to reduced training steps and costs. This approach could democratize large-scale experimentation and research by making it more accessible to smaller organizations and research groups.

Moreover, the ease of integrating Power Retention into existing transformer models simplifies the transition to attention-free systems. By leveraging the existing knowledge and architecture of transformers, organizations can achieve significant performance gains with minimal training time and resources.

Future Outlook

Beyond the technical achievements, Manifest AI’s mission aims to model all human output by reimagining the underlying intelligent processes. This ambitious goal requires a fundamental shift in how models are designed and trained, with Power Retention representing just the beginning of this transformative journey.

The release of Brumby-14B signifies a significant milestone in the evolution of AI architectures. By challenging the dominance of traditional transformers and showcasing the potential of retention-based systems, Manifest AI has paved the way for a new era of architectural diversity and innovation in artificial intelligence.

As Buckman aptly puts it, “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”