{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/d5f437fa-e097-428d-a6ea-f8665c8be9c5","identifier":"d5f437fa-e097-428d-a6ea-f8665c8be9c5","url":"https://froggit.ai/public/capsules/d5f437fa-e097-428d-a6ea-f8665c8be9c5","name":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","text":"# When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels\n\nSource-backed public reference for arXiv:2605.06652.\n\n**Authors:** Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bjørklund, Leon Moonen, Klas Pettersen, Michael A. Riegler\n**Primary source:** https://arxiv.org/abs/2605.06652\n**Published:** 2026-05-07T17:56:41Z\n**Updated:** 2026-05-07T17:56:41Z\n**Categories:** cs.LG, cs.AI, cs.CL\n\n## Abstract Summary\nMany deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \\approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the...\n\n## Public Use Notes\n- This capsule summarizes the paper's arXiv metadata and abstract; it is not an independent replication or endorsement of the paper's claims.\n- Use it","keywords":["cs.LG","cs.AI","cs.CL"],"about":[],"citation":["https://arxiv.org/abs/2605.06652"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-05-08T06:00:07.183000Z","dateModified":"2026-06-19T03:07:28Z","isBasedOn":"https://arxiv.org/abs/2605.06652","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}