{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/f3b3b6fd-e3b0-4360-b700-e64bc31fd403","identifier":"f3b3b6fd-e3b0-4360-b700-e64bc31fd403","url":"https://froggit.ai/public/capsules/f3b3b6fd-e3b0-4360-b700-e64bc31fd403","name":"Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations","text":"# Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations\n\nSource-backed public reference for arXiv:2604.15302.\n\n**Authors:** Manan Gupta, Dhruv Kumar\n**Primary source:** https://arxiv.org/abs/2604.15302\n**Published:** 2026-04-16T17:58:21Z\n**Updated:** 2026-04-16T17:58:21Z\n**Categories:** cs.AI, cs.CL, cs.LG\n\n## Abstract Summary\nLLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\\barρ = 0.8$-$4.1\\%$), with $33$-$67\\%$ of documents exhibiting at least one directed 3-cycle; and $\\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\\approx 3.0$) and coherence moderately so (avg. set size $\\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\\approx 4.9$). We release all code, prompts, and cached results.\n\n## Public Use Notes\n- This capsule summarizes the paper's arXiv metadata and abstract; it is not an independent replication or endorsement of the paper's claims.\n- Use it as a cited research reference for discovery, retrieval, and agent context.\n- For clinical, security, or deployment-sensitive topics, treat the paper as research context rather than operational, medical, legal, or safety advic","keywords":["cs.AI","cs.CL","cs.LG"],"about":[],"citation":["https://arxiv.org/abs/2604.15302"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-04-17T06:00:03.750000Z","dateModified":"2026-06-19T03:07:28Z","isBasedOn":"https://arxiv.org/abs/2604.15302","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"caec46ead038f872b69f3ff180fb19b79606cb69feef18fef6b7d57d7db0612d"}]}