ArxivPaper: T5Gemma-TTS Technical Report

# ArxivPaper: T5Gemma-TTS Technical Report Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through...

arxivpaper
scholarly
auto-curated
eess.as