AI

Unleashing Million-Token AI Reasoning: The Power of Markovian Thinking

Published

4 months ago

October 22, 2025

New 'Markovian Thinking' technique unlocks a path to million-token AI reasoning

Revolutionizing Language Models with Markovian Thinking

Researchers at Mila have introduced a groundbreaking technique to enhance the efficiency of large language models (LLMs) in complex reasoning tasks. Known as Markovian Thinking, this approach enables LLMs to engage in extensive reasoning without facing the computational limitations that currently hinder such activities.

The team’s innovation, a platform named Delethink, segments the reasoning process into fixed-size chunks, addressing the scalability issue that often hampers long LLM responses. Initial evaluations demonstrate that for a 1.5B parameter model, this technique can reduce training costs by over two-thirds compared to conventional methods.

The Challenge of Long-Chain Reasoning

Complex problem-solving with LLMs typically involves generating a lengthy series of intermediate “thinking” tokens, also known as chain-of-thought (CoT). Recent advancements in utilizing reinforcement learning (RL) to train models for producing longer CoTs, or LongCoTs, have significantly enhanced their reasoning abilities.

However, the traditional approach faces a critical drawback: the AI’s “state” grows exponentially with each new reasoning token, leading to a quadratic increase in computational costs as the reasoning chain extends. While existing strategies aim to limit the model’s thinking to control costs, they are still constrained by the quadratic nature of LongCoT.

Instead of managing computational growth, Mila’s solution involves creating an RL environment that sidesteps the quadratic problem. By adopting a Markovian Thinker approach, the model maintains a constant reasoning context window size, transforming the quadratic growth challenge into linear compute and fixed memory requirements for LLM reasoning.

Introducing Delethink for Chunk-Based Reasoning

The researchers’ paradigm, the “Markovian Thinker,” enables the model to reason while keeping the size of its context window constant. This novel approach separates “how long the model thinks” from “how much context it processes,” converting the quadratic growth issue into linear compute and memory demands.

Implemented through Delethink, this method forces the model to reason in a series of fixed-size chunks, such as 8,000 tokens per chunk. Within each chunk, the model utilizes the classic attention mechanism for reasoning. When reaching the chunk limit, the environment resets the context, providing a new prompt that includes the original query and a brief “carryover” from the previous chunk.

By restructuring the problem in this manner, the model learns to embed a summary of its progress, or a “textual Markovian state,” in the carryover to continue reasoning in the next chunk. This ensures that the model retains critical information from earlier steps, enhancing its overall performance.

According to Kazemnejad, the model adapts to carry forward essential task-related information during training. Notably, the original input prompt remains unaltered, preserving the integrity of the data for effective reasoning.

Performance of Delethink in Action

The researchers tested their methodology by training R1-Distill-1.5B with Delethink on a dataset of competition-level math problems and evaluating it against various benchmarks. The model was trained to reason for up to 24,000 tokens in fixed 8,000-token chunks.

Comparative analysis with models trained using the standard LongCoT-RL approach revealed that the Delethink-trained model could reason up to 24,000 tokens, matching or outperforming LongCoT models on math benchmarks. In tasks like coding and PhD-level questions, Delethink also demonstrated comparable or superior performance to its LongCoT counterparts.

Furthermore, Delethink showcased superior scalability beyond the training budget, surpassing LongCoT-trained models in performance improvement. The linear compute advantage offered by Delethink was particularly advantageous for enterprise applications, significantly reducing training costs and enhancing overall efficiency.

This efficiency carries over to inference, the primary operational cost for enterprises, providing consistent benefits in terms of compute and memory requirements post-training. The practical implications of Delethink extend to enhancing the performance of off-the-shelf reasoning models, showcasing its versatility and compatibility with state-of-the-art LLMs.

Overall, the success of Markovian Thinking and Delethink signifies a significant leap forward in enabling LLMs to engage in extended reasoning tasks efficiently. This breakthrough paves the way for next-generation AI capabilities, facilitating advancements in scientific discovery and problem-solving.

By leveraging Markovian Thinking, developers can unlock the potential for AI models to “think” over extended horizons, ushering in a new era of possibilities in artificial intelligence and machine learning.