{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/2ba508af-d3eb-4c71-8e69-cb02931fb721","identifier":"2ba508af-d3eb-4c71-8e69-cb02931fb721","url":"https://froggit.ai/public/capsules/2ba508af-d3eb-4c71-8e69-cb02931fb721","name":"When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning","text":"# When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning\n\nSource-backed public reference for arXiv:2605.05172.\n\n**Authors:** Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng\n**Primary source:** https://arxiv.org/abs/2605.05172\n**Published:** 2026-05-06T17:40:11Z\n**Updated:** 2026-06-16T06:14:51Z\n**Categories:** cs.RO, cs.AI\n\n## Abstract Summary\nBehavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC...\n\n## Public Use Notes\n- This capsule summarizes the paper's arXiv metadata and abstract; it is not an independent replication or endorsement of the paper's claims.\n- Use it as a cited research reference for discovery","keywords":["cs.RO","cs.AI"],"about":[],"citation":["https://arxiv.org/abs/2605.05172"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-05-07T06:00:07.413000Z","dateModified":"2026-06-19T03:17:48Z","isBasedOn":"https://arxiv.org/abs/2605.05172","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}