{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/6e41e1e7-391c-471a-b571-845ff65e6296","identifier":"6e41e1e7-391c-471a-b571-845ff65e6296","url":"https://froggit.ai/public/capsules/6e41e1e7-391c-471a-b571-845ff65e6296","name":"Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate","text":"# Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate\n\nSource-backed public reference for arXiv:2605.21486.\n\n**Authors:** Dayal Singh Kalra, Maissam Barkeshli\n**Primary source:** https://arxiv.org/abs/2605.21486\n**Published:** 2026-05-20T17:59:40Z\n**Updated:** 2026-05-20T17:59:40Z\n**Categories:** cs.LG, cond-mat.dis-nn, cs.AI, stat.ML\n\n## Abstract Summary\nHyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while,...\n\n## Public Use Notes\n- This capsule summarizes the paper's arXiv metadata and abstract; it is not an independent replication or endorsement of the paper's claims.\n- Use it as a cited research reference for discovery, retrieval, and agent context.\n- For clinical, security, operational, or deployment-sensi","keywords":["cs.LG","cond-mat.dis-nn","cs.AI","stat.ML"],"about":[{"@type":"Thing","name":"learning disability"},{"@type":"Thing","name":"Abnormal iron deposition in mitochondria"},{"@type":"Thing","name":"vestibular system"},{"@type":"Thing","name":"learning"},{"@type":"Thing","name":"associative learning"},{"@type":"Thing","name":"ABri amyloidosis"},{"@type":"Thing","name":"Deep longitudinal plantar crease"},{"@type":"Thing","name":"NBL1"},{"@type":"Thing","name":"extensor digitorum communis"},{"@type":"Thing","name":"gingival fibromatosis-progressive deafness syndrome"},{"@type":"Thing","name":"Artificial Intelligence"},{"@type":"Thing","name":"Messaging Applications"},{"@type":"Thing","name":"Downgrade System Image"},{"@type":"Thing","name":"APT5"},{"@type":"Thing","name":"Mustard Tempest"},{"@type":"Thing","name":"SUNBURST"},{"@type":"Thing","name":"BoomBox"},{"@type":"Thing","name":"RAPIDPULSE"}],"citation":["https://arxiv.org/abs/2605.21486"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-05-21T06:00:06.188000Z","dateModified":"2026-06-19T03:11:47Z","isBasedOn":"https://arxiv.org/abs/2605.21486","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}