RL in the Pre-train Space: Why Training on P(y) Beats Training on P(y|x)

RLVR (Reinforcement Learning with Verifiable Rewards) has been the go-to recipe for boosting LLM reasoning since DeepSeek-R1 made it mainstream. The formula is simple: give the model math problems, check the answers, reward correct reasoning chains. It works. Models get measurably better at math, coding, and logical tasks.

But there's a ceiling nobody talks about.

RLVR optimizes P(y|x), the conditional distribution of outputs given inputs. Every gradient update is anchored to a specific prompt. The model learns to produce better responses for questions it's asked. What it can't do is fundamentally restructure its reasoning capabilities. If the base model's pre-training distribution P(y) doesn't contain the reasoning patterns needed to solve a problem, no amount of conditional optimization will put them there.

A paper from today's arXiv, PreRL by Wen et al., proposes something deceptively simple: do RL directly on P(y). Not "generate better answers to questions." Just "generate better text, period." And the results are striking.

The Core Idea

Standard RLVR works like this:

Input: x (math problem)
Output: y ~ P(y|x) (solution attempt)
Reward: r(y) (is the answer correct?)
Update: maximize E[r(y)] over P(y|x)

PreRL does this instead:

No input. Just generate.
Output: y ~ P(y) (unconditioned generation)
Reward: r(y) (does the generated text contain correct reasoning?)
Update: maximize E[r(y)] over P(y)

The theoretical justification is surprisingly clean. The authors prove strong gradient alignment between log P(y) and log P(y|x). In plain English: improving the model's general ability to produce good reasoning text also improves its ability to answer specific questions. The unconditional and conditional objectives push in the same direction.

But PreRL isn't just a theoretical curiosity. It opens the door to a mechanism that standard RLVR can't access.

Negative Sample Reinforcement: The Real Breakthrough

The paper's most important contribution isn't PreRL itself. It's the discovery of Negative Sample Reinforcement (NSR) within the pre-train space.

NSR-PreRL works by aggressively pruning incorrect reasoning patterns from P(y). Instead of rewarding good outputs (positive RL), it penalizes bad ones in the unconditional space. The model learns what not to generate. And the effects are dramatic:

14.89x increase in transition thoughts: the model generates far more "wait, let me reconsider" and "alternatively" style pivots
6.54x increase in reflection thoughts: more "this doesn't seem right" and "let me verify this step" patterns

Think about what this means. The model isn't just getting better at specific problem types. It's developing metacognitive habits. It questions itself more. It transitions between reasoning strategies more fluidly. These are emergent behaviors from negative reinforcement in the pre-train space, not from being told to "think step by step" in a system prompt.

Dual Space RL: The Full Recipe

The paper's final contribution is DSRL (Dual Space RL), which combines both approaches in a two-phase strategy the authors call "Policy Reincarnation":

Phase 1: NSR-PreRL. Prune the incorrect reasoning subspace in P(y). The model develops strong metacognitive behaviors and a broad capacity for self-correction. This phase operates on unconditional generation, no task-specific prompts needed.

Phase 2: Standard RLVR. Now do the usual conditional optimization on P(y|x). But the model entering Phase 2 is fundamentally different from one that starts RLVR cold. Its reasoning "horizon" has been expanded. The correct solution paths exist in P(y) for RLVR to find and amplify.

The analogy is a sculptor. Phase 1 removes the wrong marble (NSR chips away bad reasoning patterns). Phase 2 does the fine detail work (RLVR polishes task-specific responses). You can't do Phase 2 well if Phase 1 hasn't opened up the space.

# Simplified DSRL training loop

# Phase 1: NSR-PreRL (expand reasoning horizon)
for step in range(nsr_steps):
    # Generate unconditional samples
    y = model.generate()  # No prompt, just P(y)

    # Compute reward (does this text show good reasoning?)
    reward = reasoning_verifier(y)

    # Focus on NEGATIVE reinforcement
    if reward < threshold:
        # Penalize bad reasoning patterns in P(y)
        loss = -alpha * log_prob(y)  # Push down bad patterns
        loss.backward()

# Phase 2: Standard RLVR (fine-grained optimization)
for step in range(rlvr_steps):
    x = sample_problem()
    y = model.generate(x)  # Conditional P(y|x)
    reward = check_answer(y, x)

    # Standard policy gradient
    loss = -reward * log_prob(y | x)
    loss.backward()

Why This Challenges Current Practice

The entire RLVR ecosystem is built on the assumption that you optimize P(y|x). Every training framework, every reward model, every benchmark is designed around conditional generation. PreRL suggests this is leaving significant capability on the table.

Consider the implications:

1. You don't need curated question-answer pairs for Phase 1. NSR-PreRL operates on unconditional text. You need a reward signal for "is this good reasoning," but you don't need specific prompts. This dramatically reduces the data curation bottleneck.

2. The reasoning improvements are transferable. Because NSR-PreRL modifies P(y) directly, the metacognitive behaviors it creates appear across all task types. A model that learns to self-correct in math also self-corrects in code, in planning, in logical puzzles. Standard RLVR improvements are often surprisingly task-specific.

3. It explains why bigger RLVR runs hit diminishing returns. If P(y) doesn't contain the reasoning patterns needed, P(y|x) optimization is trying to find something that isn't there. You're searching a space that doesn't include the answer. DSRL expands the space first.

4. Negative reinforcement is underexplored in LLM training. Most RL approaches for LLMs focus on positive reward: make good outputs more likely. NSR shows that making bad outputs less likely in the unconditional space is a more powerful driver of reasoning improvement. The asymmetry is unintuitive but empirically clear.

The Gradient Alignment Result

The theoretical backbone of the paper deserves attention. The authors show that:

∇_θ log P(y) ≈ k · ∇_θ log P(y|x) + noise

where k is a positive scaling constant. This means optimizing the marginal and conditional objectives push model parameters in approximately the same direction. The "noise" term is bounded and diminishes as the model's capabilities improve.

This isn't obvious. You might expect that optimizing P(y) would be too diffuse, that without task-specific anchoring, the gradients would point in random directions. The alignment result says no: improving general text quality reliably improves conditional response quality.

The intuition is that reasoning patterns in P(y) are a shared substrate. Whether the model is answering a math question or generating an essay, the underlying mechanics of "consider multiple approaches," "check intermediate steps," and "revise when stuck" are the same neural pathways. Strengthening those pathways in P(y) strengthens them everywhere.

Practical Implications for the Field

For researchers working on reasoning improvement, DSRL introduces a new axis to explore. Before, the question was "what reward signal should I use for RLVR?" Now there's a prior question: "have I expanded the reasoning space in P(y) first?"

For practitioners and startups building on open models, this is actionable. If you're fine-tuning Llama or Qwen for specific reasoning tasks and hitting a performance wall, the problem might not be your reward model or your training data. The base model's P(y) might simply not contain the reasoning patterns you need. NSR-PreRL as a preprocessing step could break through that ceiling.

For the broader scaling debate, this adds nuance. We've been asking "how much compute should go into pre-training vs. post-training?" PreRL suggests the dichotomy is false. You can do RL-style optimization in the pre-training distribution. The line between pre-training and post-training is blurrier than we thought.

What's Still Open

The paper evaluates DSRL on reasoning benchmarks (math, logic, code), which have clean verifiable rewards. Extending this to open-ended tasks where "good reasoning" is harder to verify remains a challenge. You need a reliable signal for NSR-PreRL's negative reinforcement, and for creative or subjective tasks, that signal doesn't exist in automated form.

There's also the compute question. NSR-PreRL adds a training phase. For large frontier models, that's a significant cost. The paper shows consistent improvements, but the cost-benefit analysis at 70B+ scale isn't clear. It might be more efficient for smaller models that need to punch above their weight.

And the "Policy Reincarnation" framing, while evocative, raises questions about how to calibrate the transition between phases. Too much NSR and you over-prune, losing the model's creative exploration. Too little and you haven't expanded the reasoning horizon enough. The paper provides guidelines but no automated tuning.

The Bottom Line

PreRL and DSRL represent a genuine shift in how we think about improving LLM reasoning. The insight that P(y) optimization is a viable and complementary path to P(y|x) optimization opens up new training strategies. And the discovery that negative reinforcement in the pre-train space creates emergent metacognitive behaviors, self-correction, strategy switching, intermediate verification, suggests we've been leaving the most powerful lever untouched.

The next time your RLVR run plateaus, ask yourself: is the model stuck because it can't find the right answer in its output space? Or because the right answer was never in its output space to begin with?

Paper: "From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space" (arXiv:2604.14142)

Manas Vardhan builds open-source tools for production AI agents, including agent-sentry and llm-cost-guardian. Find all his work on GitHub.

RL in the Pre-train Space: Why Training on P(y) Beats Training on P(y|x)

The Core Idea

Negative Sample Reinforcement: The Real Breakthrough

Dual Space RL: The Full Recipe

Why This Challenges Current Practice

The Gradient Alignment Result

Practical Implications for the Field

What's Still Open

The Bottom Line

Comments

More from this blog

The MCP Tax Is Real, and It Is Quietly Killing Your Agent's Reasoning

The Three Walls Your AI Research Agent Keeps Hitting

Chain-of-Thought Was Supposed to Be Our Window Into AI Reasoning. Optimization Is Slamming It Shut.

Tucker Attention: GQA, MLA, and MHA Were the Same Thing All Along

Command Palette

The Core Idea

Negative Sample Reinforcement: The Real Breakthrough

Dual Space RL: The Full Recipe

Why This Challenges Current Practice

The Gradient Alignment Result

Practical Implications for the Field

What's Still Open

The Bottom Line

Comments

More from this blog