Events

LLM based Speech to Speech Translation (S2ST) System – An Overview

CS Seminar


Event details

Abstract

This talk presents a technical overview of Large Language Model (LLM)-based Speech-to-Speech Translation (S2ST) systems, exploring how they move beyond traditional cascaded pipelines. We discuss the role of self-supervised feature extraction and the adoption of decoder-only transformer architectures for unified speech understanding and generation. A central theme is the use of Chain of Thought (CoT) reasoning to decompose the translation task into interpretable sub-steps. We also cover how Residual Vector Quantization (RVQ) codes enable efficient discrete audio representations with disentangled semantic and acoustic content, and how it is used as LLM generation target. The talk concludes with practical considerations around data preparation, model training, and inference, highlighting how these design choices come together to support real-time, controllable, and high-fidelity speech translation.

Speaker:

Spike Zhang is an applied scientist in Amazon’s Artificial General Intelligence (AGI) department. His research interests are LLM based audio/speech generative models, Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Speech-to-Speech(S2S) systems. His work at Amazon includes the launches of Alexa’s feminine and mescaline sounding voices in Europe, Asia, the Americas, bespoke voice assistants with Amazon’s auto partners (e.g. BMW, Stellantis, Mini), and automatic S2S translation/dubbing solutions for internal customers. Spike received his PhD degree in electrical engineering in 2020 from Imperial College London, with a focus on network optimisation and deep reinforcement learning during his PhD.

Location:

Harrison Building 209