基于大语言模型的自动语音识别研究（LLM-based ASR） —— 南洋理工大学语音实验室的探索-湖大信息科学与工程学院

我的位置在：首页 > 学术报告 > 正文

基于大语言模型的自动语音识别研究（LLM-based ASR） —— 南洋理工大学语音实验室的探索

浏览次数:日期：2025-06-25编辑：信科院科研办

报告人：Chng Eng Siong 教授，新加坡南洋理工大学计算与数据科学学院

报告时间：2025年7月1 日上午10：00

报告地点：湖南大学信息科学与工程学院624会议室

报告摘要：大语言模型（LLM）的兴起彻底革新了自然语言处理领域，其在文本理解、生成及语境建模方面展现出前所未有的能力。近期研究进展进一步推动 LLM 向多模态领域拓展，涵盖音频、视频、图像等非文本领域。

本次报告聚焦语音模态与 LLM 的融合。当前研究界已提出多种创新方案，例如采用离散表征、将预训练 ASR 编码器集成至 LLM 解码器架构（如 Qwen-Audio）、多任务学习及多模态预训练等。报告将重点介绍南洋理工大学近期在以下三个方向的研究进展：

1.“Hyporadise”方法：利用 LLM 对传统 ASR 模型生成的 N-best 假设序列进行生成式纠错，以达到更精准的语音转写结果；

2.“Hyporadise”扩展方案：在训练阶段引入声学与文本噪声特征，提升模型在不同噪声环境下，尤其是低信噪比条件下的鲁棒性；

3.基于 LLM 的生成式语音增强技术。

Towards LLM-based ASR – experiences from NTU’s Speech Lab

Abstract: The advent of large language models (LLMs) has revolutionized natural language processing, offering unprecedented capabilities in understanding, generating, and contextualizing text. Recent advances have enabled it for other modalities: such as audio, video and images.

Our focus in this talk is the integration of speech modality into LLMs. For this task, the research community has proposed various innovative approaches, e.g., applying discrete representations, integrating pre-trained ASR encoder to existing LLM decoder architectures (Qwen-Audio), multitask learning and multimodal pretraining. In this talk, I will discuss NTU’s recent approaches for this goal, specifically in the following three areas:

(1) “Hyporadise”: Applying LLM on N-best hypothesis generated by traditional ASR models to perform generative error correction, so as to generate more accurate transcription output;

(2) Extending Hyporadise to include acoustic and textual noise information during training to improve the robustness against noisy scenarios, even for low-SNR speech conditions.

(3) Using LLM for generative speech enhancement.

报告人简介：庄永祥博士是新加坡南洋理工大学（Nanyang Technological University, NTU）计算与数据科学学院（College of Computing and Data Science, CCDS）教授。2003年加入南洋理工大学之前，曾任职于美国Knowles Electronics、比利时Lernout and Hauspie、新加坡信息通信研究院（I2R）及日本理化学研究所（RIKEN）。庄教授先后于1991 年和 1996 年在英国爱丁堡大学获得电子工程荣誉学士学位（BEng Hons）和博士学位（PhD），专业方向为数字信号处理。其研究领域涵盖机器学习、语音技术以及大语言模型（LLM）应用。

庄教授同时担任新加坡人工智能实验室首席科学家，并在南洋理工大学组建了语音与语言技术实验室，参与组建了阿里巴巴 ANGEL 实验室以及NTU-Rolls Royce联合实验室，获得包括阿里巴巴、NTU-Rolls Royce、新加坡国防部（Mindef）、教育部（MOE）及新加坡科技研究局（A*STAR）等企业与机构的多项研究资助。2007年获得陈振传基金（Tan Chin Tuan Fellowship）支持赴清华大学开展研究交流；2008年获日本学术振兴会（JSPS）奖金赴东京工业大学开展研究交流。庄教授出版专著2部、发表期刊与会议论文200 余篇，组织了 2019 年自动语音识别与理解（ASRU）国际会议，并担任 2024 ICAICTA及IEEE SLT国际会议联合主席。

Prof. Chng Eng Siong , Nanyang Technological Univerisity, Singapore

Dr. Chng Eng Siong is Professor with the College of Computing and Data Science (CCDS) at Nanyang Technological University (NTU) in Singapore. Prior to joining NTU in 2003, he worked at Knowles Electronics (USA), Lernout and Hauspie (Belgium), the Institute of Infocomm Research (I2R) in Singapore, and RIKEN in Japan. He received both a PhD and a BEng (Hons) from the University of Edinburgh, U.K., in 1996 and 1991, respectively, specializing in digital signal processing. His areas of expertise include machine learning, speech research, and applications of Large Language Models.

He currently serves as the Principal Investigator (PI) of the AI-Singapore Speech Lab from 2023 to 2025. Throughout his career, he has secured research grants from various institutions, including Alibaba ANGEL Lab, NTU-Rolls Royce, Mindef, MOE, and AStar. These grants were awarded under the “Speech and Language Technology Program (SLTP)” in the School of Computer Science and Engineering (SCSE) at NTU. In recognition of his expertise, he was awarded the Tan Chin Tuan fellowship in 2007 to conduct research at Tsinghua University. Additionally, he received the JSPS travel grant award in 2008 to visit Tokyo Institute of Technology. His publication record includes 2 edited books and over 200 journal and conference papers. Additionally, he is in the organizing committee for ASRU 2019 (Singapore), ICAICTA 2024 (General Co-chair) and SLT 2024 (General Co-chair).

邀请人：钟雄虎

联系人：罗娟（学） 联系电话：15700748750

上一篇：: Key Technologies of AI-based Communications towards 6G

下一篇：: 诺奖启示的深度学习“哲学”之问