Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's multilingual model and it guesses the language automatically. To apply model to your audio you have 2 options:
1) "Apply to original file" - it means that whisper model will be applied directly to file you submit
2) "Extract vocals first" - in this case before using whisper, BS Roformer model is applied to extract vocals first. It can remove unnecessary noise to make output of Whisper better.
Original model has some problem with transcription timings. It was fixed by @linto-ai. His transcription is used by default (Option: New timestamped). You can return to original timings by choosing option "Old by whisper".
More info on model can be found here: https://huggingface.co/openai/whisper-large-v3 and here: https://github.com/openai/whisper