Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's mulitlingual model and it guesses the language automatically. To apply model to your audio you have 2 options:
1) "Apply to original file" - it means that whisper model will be applied directly to file you submit
2) "Extract vocals first" - in this case before using whisper, MDX23C model is applied to extract vocals first. It can remove unesessary noise to make output of Whisper better.
More info on model can be found here: https://huggingface.co/openai/whisper-large-v3