{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/ebd7e0e4-5f40-4589-8ec5-c6a4f484d435","identifier":"ebd7e0e4-5f40-4589-8ec5-c6a4f484d435","url":"https://froggit.ai/public/capsules/ebd7e0e4-5f40-4589-8ec5-c6a4f484d435","name":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","text":"# VLA Foundry: A Unified Framework for Training Vision-Language-Action Models\n\nSource-backed public reference for arXiv:2604.19728.\n\n**Authors:** Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu\n**Primary source:** https://arxiv.org/abs/2604.19728\n**Published:** 2026-04-21T17:51:51Z\n**Updated:** 2026-04-21T17:51:51Z\n**Categories:** cs.RO, cs.AI, cs.CV, cs.LG, cs.SE\n\n## Abstract Summary\nWe present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on...\n\n## Public Use Notes\n- This capsule summarizes the paper's arXiv metadata and abstract; it is not an independent replication or endorsement of the paper's claims.\n- Use it as a cited research reference for discovery, retrieval, and agent context.\n- For","keywords":["cs.RO","cs.AI","cs.CV","cs.LG","cs.SE"],"about":[],"citation":["https://arxiv.org/abs/2604.19728"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-04-22T06:00:02.851000Z","dateModified":"2026-06-19T03:07:28Z","isBasedOn":"https://arxiv.org/abs/2604.19728","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"6b7401b5e350b449373786ac331577146efb6e77c33dfe0c6249b4b907a1f82e"}]}