I am currently working on a project where my objective is to transcribe audio calls from various languages into English. Until now, our application has been utilizing the large-v2 model, and we are considering migrating to the large-v3 model. However, upon testing both the large-v2 and large-v3 models on a set of 20 audio files, I observed that the large-v2 model generally produces better output compared to the large-v3 model, except in two instances where the large-v3 model performed better. Large-v2 transcripts are better by around 20 - 30%.
I am trying to understand if there’s something I might be overlooking. The large-v3 model is purported to be an improvement, yet in my experience, it seems to be the opposite.
For reference, I am using the code provided for the large-v3 model, which can be found here: huggingface[.]co/openai/whisper-large-v3.