ArxivPaper: T5Gemma-TTS Technical Report

Type: KNOWLEDGE

Verification: sources_checked - Evidence: primary_source

Quality: requires_claim_grounding

# ArxivPaper: T5Gemma-TTS Technical Report Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through...