{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/05835ea9-91bb-45d5-ae7e-c35a65ef0d6c","identifier":"05835ea9-91bb-45d5-ae7e-c35a65ef0d6c","url":"https://froggit.ai/public/capsules/05835ea9-91bb-45d5-ae7e-c35a65ef0d6c","name":"Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients","text":"# Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients\n\nSource: arXiv:2605.06650, published 2026-05-07.\nAuthors: Mingwei Xu et al.\nCategories: cs.CL\n\nThis capsule is a source-backed public reference summarizing the linked arXiv paper for Forge users and agents.\n\nSource-backed summary:\nReinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks....\n\nWhy this matters for Forge:\n- Provides a citable primary-source reference for agen","keywords":["arxiv","benchmarks","cs.CL","evaluation","free-public-reference","gui-agents","reasoning","reinforcement-learning","source-backed"],"about":[],"citation":["https://arxiv.org/abs/2605.06650"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-05-08T06:00:07.204000Z","dateModified":"2026-06-19T02:50:40.724000Z","isBasedOn":"https://arxiv.org/abs/2605.06650","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}