Speech Translation with Large Language Models: An Industrial Practice


Zhichao Huang*, Rong Ye*, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li


{zhichao.huang, yerong}@bytedance.com

Abstract. Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a cutting-edge speech translation model constructed upon a pre-trained LLM. By integrating the large language model with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation.

This page is for research demonstration purposes only.


Model Overview

Figure. Overview of LLM-ST. The framework consists of two LM-based components: an Audio Encoder and an Audio-Conditioned LLM.

Examples of Speech-to-text Recognition and Translation

English-to-Chinese Translation

Chinese to English Translation

Examples of Speech Translations

Prosody helps the translation

English Speech ASR Cascaded System LLM-ST

Help! Help! ahhhhh!

帮助帮助。 啊。


One finger or two fingers?



You guys like dumplings?



To Vanessa, Natalia, Bianca, Capri, my wife and I will keep you close in our hearts and our prayers.

致凡妮莎·娜塔莉亚·比安卡·卡普里。 我和我的妻子会在我们的心中,在我们的祈祷中把你放在身边。


Context helps to resolve ambiguity

English Speech ASR Cascaded System LLM-ST

We're actually foundation, um, owned, so we have less of the quarterly pressure.

我们实际上建立了自己的基础。 所以我们的季度压力较小。


Stocks are not as complicated as those Wall Street firms make you believe.



When it's tough, will you give up, or will you be relentless?



Will you be a cynic, or will you be a builder?

你会成为一个愤世嫉俗的人 还是一个建筑工人?


All right, we're back in Damascus right now, after a two hour drive from Homs.



Yeah, I looked up before that, as long as you get the official taxi stand, then it should be about 100 RMB to the center of Beijing.

是的,在那之前我抬头看了看。 只要你有官方的出租车站,那么到北京市中心应该大约需要100元人民币。

是的,我之前查过,只要你到官方的出租车停靠站, 到北京市中心大约需要100元人民币。

Better Translations for code-switch speech

English Speech ASR Cascaded System LLM-ST

But in Taiwan, where I learned about Sanbeiji, Thai basil is always part of the dish.


但在台湾,我了解到三杯鸡, 泰国罗勒总是这道菜的一部分。

I tried that. Didn't work at all for me, so stick with Thai basil and enjoy your delicious Sanbeiji.

我试过了。那对我一点用都没有。 所以坚持用泰式罗勒,享受你美味的桑白吉。


Fu Qi Fei Pian literally means slices of husband and wife's lungs. Eh.

Uchife pan字面意思是丈夫和妻子的肺部切片。


It got this name because when the dish first originated, it used "Fei" or unwanted meats from cows, like cow tongues and lungs, and the dish was made best by a husband and wife duo.



Better Translations for name entities

English Speech ASR Cascaded System LLM-ST

And these are some of the basic concepts for the Metaverse.



The pop star and brains behind Fenty Beauty is worth an estimated 1.7 billion dollars on the 2022 list.


这位流行歌星和 Fenty Beauty 背后的智囊团在 2022 年的榜单上的身价估计为 17 亿美元。

After bursting onto the fashion scene with the brand collective Vetements, Demna took the helm of Balenciaga in 2015 as part of a radical modernization, including the Triple S, which set a fire the ugly sneaker trend, and helped Balenciaga surpass a billion dollars in revenue in 2019, spawning a thousand copycats.

在凭借品牌闯入时尚界后 集体兽医购物中心,Demna于2015年掌舵巴黎世家,作为激进现代化的一部分, 包括Triple s, 它点燃了丑陋的运动鞋潮流,并帮助巴黎世家在2019年的收入超过10亿美元,催生了1000名模仿者。

在以品牌集合 Vetements 闯入时尚界后,德姆纳于 2015 年执掌巴黎世家, 作为激进现代化的一部分, 包括 Triple S, 它点燃了丑陋运动鞋的潮流, 并帮助巴黎世家在 2019 年的收入超过了 10 亿美元, 催生出 1000 个模仿者。