LLM based Speech to Speech Translation (S2ST) System – An Overview

Name: LLM based Speech to Speech Translation (S2ST) System – An Overview
Start: Thursday 19 March 2026
Location: Online and On Campus at the University of Exeter

CS Seminar

A Computer Science seminar
Date	19 March 2026
Time	14:30 to 15:30
Place	Harrison Building 209

Event details

Abstract

This talk presents a technical overview of Large Language Model (LLM)-based Speech-to-Speech Translation (S2ST) systems, exploring how they move beyond traditional cascaded pipelines. We discuss the role of self-supervised feature extraction and the adoption of decoder-only transformer architectures for unified speech understanding and generation. A central theme is the use of Chain of Thought (CoT) reasoning to decompose the translation task into interpretable sub-steps. We also cover how Residual Vector Quantization (RVQ) codes enable efficient discrete audio representations with disentangled semantic and acoustic content, and how it is used as LLM generation target. The talk concludes with practical considerations around data preparation, model training, and inference, highlighting how these design choices come together to support real-time, controllable, and high-fidelity speech translation.

Speaker:

Spike Zhang is an applied scientist in Amazon’s Artificial General Intelligence (AGI) department. His research interests are LLM based audio/speech generative models, Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Speech-to-Speech(S2S) systems. His work at Amazon includes the launches of Alexa’s feminine and mescaline sounding voices in Europe, Asia, the Americas, bespoke voice assistants with Amazon’s auto partners (e.g. BMW, Stellantis, Mini), and automatic S2S translation/dubbing solutions for internal customers. Spike received his PhD degree in electrical engineering in 2020 from Imperial College London, with a focus on network optimisation and deep reinforcement learning during his PhD.

Location:

Harrison Building 209