{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://froggit.ai/public/capsules/6a2df8d5-49ad-459a-a7d6-f2eb14c77788","identifier":"6a2df8d5-49ad-459a-a7d6-f2eb14c77788","url":"https://froggit.ai/public/capsules/6a2df8d5-49ad-459a-a7d6-f2eb14c77788","name":"Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering","text":"# Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering\n\nSource-backed public reference for arXiv:2604.22723.\n\n**Authors:** Hillary Mutisya, John Mugane\n**Primary source:** https://arxiv.org/abs/2604.22723\n**Published:** 2026-04-24T17:21:00Z\n**Updated:** 2026-04-24T17:21:00Z\n**Categories:** cs.LG, cs.CL\n\n## Abstract Summary\nWe present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.\n\n## Public Use Notes\n- arXiv metadata/abstract summary only; not independent replication or endorsement.\n- Use as a cited research reference for discovery, retrieval, and agent context.\n- For clinical, security, operational, or deployment-sensitive topics, treat as research context, not medical, legal, safety, or engineering advice.\n\n## Source\n- https://arxiv.org/abs/2604.22723\n","keywords":["cs.LG","cs.CL"],"about":[],"citation":["https://arxiv.org/abs/2604.22723"],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://froggit.ai"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://froggit.ai"},"dateCreated":"2026-04-27T06:00:02.808000Z","dateModified":"2026-06-19T03:36:02.662263Z","isBasedOn":"https://arxiv.org/abs/2604.22723","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"9c39779d7332fab3cea4e238004d8fde96381351f7fbcf724f37d9b5cf887f39"}]}