人工智能-OpenAI Whisper 语音转文本实行

小秦哥 发表于 2024-8-19 02:07:21

OpenAI Whisper 语音转文本实行

https://img-blog.csdnimg.cn/direct/9b929e0e86ed4b5e9503138f2d432adf.png
为了实现语音方式与大语言模子的对话，需要利用语音识别（Voice2Text）和语音输出（Text2Voice）。感觉这项技术已比较成熟了，国内也有许多的机构开发这项技术，但是像寻找一个方便测试的技术居然还不容易。Google 墙了，微软需要注册，而国内的资料很少，最后选择了OpenAI 的Whisper。
Whisper 简介

        Whisper是OpenAI于2022年12月发布的语音处理系统。它以英语为主，支持99种语言，包括中文。
提供了从tiny到large，从小到大的五种规格模子，得当不同场景。
Large 模子有2.88G，Basic 模子约莫几百M。测试下来，Large 模子比较慢，Basic比较快。
Whisper 安装

pip install openai-whisper 安装 ffmpeg

whisper 要利用ffmpeg 程序，在windows 的PowerShell 下安装的方式：
choco install ffmpeg 其它一些模块的安装

测试的语音文件

在网络上找中文的语音文件似乎不太容易，不是收费，就是文不对题，在github 上找了一个英文的语音样文件。
audio-samples.github.io
Whisper 语音转文本

import whisper
print("Start....")
whisper_model = whisper.load_model("large")
print("Begine...")
result = whisper_model.transcribe("E:/yao2024/sample-0.wav",language='en')
print(", ".join( for i in result["segments"] if i is not None])) 程序运行时要下载相关的模子数据，耗费一段时间
Langchain 语音助手

Langchain 有语音助手链，它利用pyttsx3和speech_recognition库分别将文本转换为语音和语音转换为文本。
speech_recognition

是一个语音识别引擎，它可以调用多个语音识别的API ，此中包括:

[*] CMU Sphinx (works offline)
[*] Google Speech Recognition
[*] Google Cloud Speech API
[*] Wit.ai
[*] Microsoft Azure Speech
[*] Microsoft Bing Voice Recognition (Deprecated)
[*] Houndify API
[*] IBM Speech to Text
[*] Snowboy Hotword Detection (works offline)
[*] Tensorflow
[*] Vosk API (works offline)
[*] OpenAI whisper (works offline)
[*] Whisper API
我们选择了OpenAI_whisper 离线方式。
实行程序

pyttsx3 的实行

import pyttsx3
#语音播放
pyttsx3.speak("How are you?")
pyttsx3.speak("I am fine, thank you")
pyttsx3.speak("太行,王屋二山，方七百里，高万仞，本在冀州之南，河阳之北。") 对话程序
importspeech_recognitionas sr
import pyttsx3
from langchain.chat_models import ErnieBotChat
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferWindowMemory
llm= ErnieBotChat(model_name='ERNIE-Bot', #ERNIE-Bot
               ernie_client_id='FAiHIjSQqH5gAhET3sHNTkiH',
               ernie_client_secret='wlIBmWY4d2Zvrs0GyQbT3JeTXV6kdub4',
               temperature=0.75,
               )
template = """Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.
Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.
Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
Assistant is aware that human input is being transcribed from audio and as such there may be some errors in the transcription. It will attempt to account for some words being swapped with similar-sounding words or phrases. Assistant will also keep responses concise, because human attention spans are more limited over the audio channel since it takes time to listen to a response.
{history}
Human: {human_input}
Assistant:"""
prompt = PromptTemplate(
input_variables=["history", "human_input"],
template=template
)
chatgpt_chain = LLMChain(
llm=llm,
prompt=prompt,
verbose=True,
memory=ConversationBufferWindowMemory(k=2),
)

engine = pyttsx3.init()

# 定义一个函数用于监听麦克风输入并进行处理
def listen():
r = sr.Recognizer()
with sr.Microphone() as source:
   print('校准中...')
   r.adjust_for_ambient_noise(source, duration=10)
   # 可选参数，用于调整麦克风灵敏度
   #r.energy_threshold = 200
   r.pause_threshold=0.5
   print('好的，开始吧！')
   while (1):
         text = ''
         print('正在倾听...')
         try:
            audio = r.listen(source, timeout=10)
            print('识别中...')
            # 进行语音识别
            text = r.recognize_whisper(audio)
            print(text)
         except Exception as e:
            unrecognized_speech_text = f'抱歉，我没听清楚。错误信息: {e}s'
            text = unrecognized_speech_text
         print(text)
         # 使用语言模型生成对话回复
         response_text = chatgpt_chain.predict(human_input=text)
         print(response_text)
         # 使用语音合成引擎将回复转换为语音并播放
         engine.say(response_text)
         engine.runAndWait()

listen() 讲英文，回答英文，讲中文它会回答中文，但是识别同音字效果并不好。不知道如何进步同音字识别效果

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

qidao123.com技术社区-IT企服评测·应用市场's Archiver

OpenAI Whisper 语音转文本实行