ToB企服应用市场:ToB评测及商务社交产业平台

标题: 大模型之二十七-语音识别Whisper实例浅析 [打印本页]

作者: 自由的羽毛    时间: 2024-9-19 22:14
标题: 大模型之二十七-语音识别Whisper实例浅析
Whisper简介

Whisper是OpenAI于2022年9月开源的一个多语种识别模型,目前支持99种语言,是目前性能最好的开源多语种识别ASR大模型,第一版版使用了68万小时标注好的语料预训练模型,而large-v3的标注数据凌驾了500万小时,其paper中并没透露使用语料的详细来源,估计是爬了一些版权数据,在Huggingface上提到模型有很强的泛化能力,可以或许在未经特定训练的情况下处理惩罚新的语言或任务,同时可以使用fine-tune的方式提高特定语言的识别性能。
开源的Whisper情况如下:
SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speedLayersWidthHeadstiny39 Mtiny.entiny~1 GB~32x43846base74 Mbase.enbase~1 GB~16x65128small244 Msmall.ensmall~2 GB~6x1276812medium769 Mmedium.enmedium~5 GB~2x24102416large1550 MN/Alarge(2022.09)~10 GB1x32128020large-v21550 MN/Alarge-v2(2022.12)~10 GB1x32128020large-v31550 MN/ALarge-v3(2023.11)~10 GB1x32128020 只有large-v3是23年底开源的模型,在Encoder和Decoder上同large和large-v2是一样的,以下两点有差异:

但这并不意味着在任何时候v3的模型一定是比v2好的,比如:
  1. I am currently working on a project where my objective is to transcribe audio calls from various languages into English. Until now, our application has been utilizing the large-v2 model, and we are considering migrating to the large-v3 model. However, upon testing both the large-v2 and large-v3 models on a set of 20 audio files, I observed that the large-v2 model generally produces better output compared to the large-v3 model, except in two instances where the large-v3 model performed better. Large-v2 transcripts are better by around 20 - 30%.
  2. I am trying to understand if there’s something I might be overlooking. The large-v3 model is purported to be an improvement, yet in my experience, it seems to be the opposite.
  3. For reference, I am using the code provided for the large-v3 model, which can be found here: huggingface[.]co/openai/whisper-large-v3.
复制代码
这意味想要针对业务场景使用,可能需要对各个场景和模型进行评估,并且在必要的时候进行fine-tune以及增长音频的前处理惩罚。本篇就在先容Whisper的基础上,先容基于Huggingface Transformer的fine-tune过程。
Whisper模型结构

Whisper是基于Transformer的编码器-解码器模型,也被称为序列到序列模型。它将音频频谱特性映射为文本token,然后再转为文本。