This ensemble is based on algorithm which took 2nd place at Music Demixing Track of Sound Demixing Challenge 2023. The main changes comparing to contest version is much better vocal models, which is used here. We use following different models for vocals: UVR-MDX-NET-Voc_FT, Demucs4 Vocals 2023, best MDX23C model, VitLarge23 and BS Roformer. For stems 'bass', 'drums' and 'other' we us the following 4 models: demucsht_ft, deumcs_ht, demucs_6s and demucs_mmi. Initial winning model available here: https://github.com/ZFTurbo/MVSEP-MDX23-music-separation-model
It's Ensemble (vocals, instrum, bass, drums, other) + more models included like guitars, piano, back/lead vocals and drumsep (4 stems extracted from drums: kick, toms, snare, cymbals). The algorithm works very slow but gives the most precise result plus a lot of additional stems. Guitar, piano, drums bass etc gives better results because they use filtered source from other stems.
Algorithm Demucs4 HT splits track into 4 stems (bass, drums, vocals, other). It's now best for bass/drums/other separation. It was released in year 2022 and has 3 versions:
htdemucs_ft - best quality, but slow
htdemucs - lower quality, but fast
htdemucs_6s - it has 2 additional stems "piano" and "guitar" (quality for them is still so-so).
BS Roformer model. Excellent quality for vocals/instrumental separation. It's modified version of initial BS Roformer model. Modifications were made by lucidrains on github. 2nd version of weights for model (with better quality) was prepared by viperx. Latest versions is fintuned viperx model with better metrics on 3 different checking systems.
Algorithm for separating tracks into vocal and instrumental parts based on the MelBand Roformer neural network. The neural network was first proposed in the paper "Mel-Band RoFormer for Music Source Separation" by a group of scientists from ByteDance. The first high-quality weights were made publicly available by Kimberley Jensen. The neural network with open weights was then slightly modified and further trained by the MVSep team in order to improve quality metrics.
Quality metrics
Algorithm name
Multisong dataset
Synth dataset
MDX23 Leaderboard
SDR Vocals
SDR Instrumental
SDR Vocals
SDR Instrumental
SDR Vocals
MelBand Roformer (Kimberley Jensen)
11.01
17.32
12.68
12.38
11.543
MelBand Roformer (ver. 2024.08)
11.17
17.48
13.34
13.05
---
Bas Curtiz edition
11.18
17.49
13.89
13.60
---
unwa Instrumental
10.24
16.54
12.25
11.95
---
unwa Instrumental v1e Note: Max instrum fullness, but noisy
10.05
16.36
---
---
---
unwa big beta v5e Note: Max vocals fullness, but noisy
New set of models MDX23C is based on code released by kuielab for Sound Demixing Challenge 2023. All models are full band, e.g. they don't cut high frequences.
Algorithm for separating tracks into vocal and instrumental parts based on the SCNet neural network. The neural network was proposed in the article "SCNet: Sparse Compression Network for Music Source Separation" by a group of scientists from China. The authors made the neural network code open source, and the MVSep team was able to reproduce results similar to those presented in the published article. First, we trained a small version of SCNet, and then after some time, a heavier version of SCNet was prepared. The quality metrics are quite close to the quality of Roformer models (which are the top models at the moment), but still slightly inferior. However, in some cases, the model can work better than Roformers.
Demucs4 Vocals 2023 model - it's Demucs4 HT model fine-tuned on big vocal/instrumental dataset. It has better metrics for vocals separation compared to Demucs4 HT (_ft version). It usually gives worse metrics than MDX23C models, but can be useful for ensembles, since the model is very different from MDX23C.
The MDX-B Karaoke model was prepared as part of the Ultimate Vocal Remover project. The model produces high-quality lead vocal extraction from a music track.The model is available in two versions.In the first version, the neural network model is used directly on the entire track.In the second version, the track is first divided into two parts, vocal and instrumental, and then the neural network model is applied only to the vocal part.In the second version, the quality of separation is usually higher and it becomes possible to additionally separate the backing vocals into a separate track.The model was compared with two other models from UVR (they are also available on the website) on a large validation set.The metric used is SDR: the higher the better.
MVSep Piano model is based on MDX23C, MelRoformer and SCNet Large architectures. It produces high quality separation for piano and other stems. We provide comparison with other public model (Demucs4HT (6 stems)). Used metrics is SDR - the more the better.
The MVSep Guitar model is based on the MDX23C, Mel Roformer and BSRoformer architectures.The model produces high-quality separation of music into a guitar part (including acoustic and electronic) and everything else.The model was compared with the Demucs4HT model (6 stems) on a guitar validation set.The metric used is SDR: the higher the better.
The MVSep Bass model is an ensemble of 2 models HTDemucs4 and BSRoformer. The model produces high-quality separation of music into a bass part and everything else.
The model is available in two versions.In the first version, the neural network model for the bass is used directly on the entire track.In the second case, the track is first divided into two parts, vocal and instrumental, and then the neural network model for the bass is applied only to the instrumental part.In the second case, the separation quality is usually slightly higher.
For MultiSong dataset SDR bass: 13.25 If you extract vocals first SDR bass: 13.42
The MVSep Drums model exists in 3 different variants based on following architectures: HTDemucs4, MelRoformer and SCNet. The model produces high-quality separation of music into a drums part and everything else.
Quality metrics
Algorithm name
Multisong dataset
MDX23 Leaderboard
SDR Drums
SDR Other
SDR Drums
HTDemucs4
12.04
16.56
---
MelBand Roformer
12.76
17.28
---
SCNet Large
13.01
17.53
---
MelBand + SCNet Ensemble
13.48
18.00
---
MelBand + SCNet Ensemble (+extract from Instrumental)
The MVSep Strings model is a model based on the MDX23C architecture for separating music into bowed string instruments and everything else. SDR metric: 3.84
The MVSep Wind model produces high-quality separation of music into a wind part and everything else. The MVSep Wind model exists in 2 different variants based on following architectures: MelRoformer and SCNet Large. Wind includes 2 categories of instruments: brass and woodwind. More specific we inluded in wind: flute, saxophone, trumpet, trombone, horn, clarinet, oboe, harmonica, bagpipes, bassoon, tuba, kazoo, piccolo, flugelhorn, ocarina, shakuhachi, melodica, reeds, didgeridoo, mussette, gaida.
Quality metrics
Algorithm name
Wind dataset
SDR Wind
SDR Other
MelBand Roformer
6.73
16.10
SCNet Large
6.76
16.13
MelBand + SCNet Ensemble
7.22
16.59
MelBand + SCNet Ensemble (+extract from Instrumental)
Experimental model VitLarge23 based on Vision Transformers.In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.
An unique model for removing crowd sounds from music recordings (applause, clapping, whistling, noise, laugh etc.). Current metrics on our internal dataset for quality control:
Algorithm name
Crowd dataset
SDR crowd
SDR other
Crowd model MDX23C (v1)
5.57
18.79
Crowd model MDX23C (v2)
6.06
19.28
Examples of how the model works can be found: here и here.
MVSep DnR v3 is a cinematic model for splitting tracks into 3 stems: music, sfx and speech. It is trained on a huge multilingual dataset DnR v3. The quality metrics on the test data turned out to be better than those of a similar multilingual model Bandit v2. The model is available in 3 variants: based on SCNet, MelBand Roformer architectures, and an ensemble of these two models. See the table below:
The DrumSep model divides the drums stem into 4 types: 'kick', 'snare', 'cymbals', 'toms'.The model from this github repository is used.The model has two operating modes.The first (default) first applies the Demucs4 HT model to the track, which extracts only the drums part from the track.Next, the DrumSep model is applied.If your track consists only of drums, then it makes sense to use the second mode, where the DrumSep model is applied directly to the uploaded audio.
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's mulitlingual model and it guesses the language automatically. To apply model to your audio you have 2 options: 1) "Apply to original file" - it means that whisper model will be applied directly to file you submit 2) "Extract vocals first" - in this case before using whisper, MDX23C model is applied to extract vocals first. It can remove unesessary noise to make output of Whisper better.
Medley Vox is a dataset for testing algorithms for separating multiple singers within a single music track. Also, the authors of Medley Vox proposed a neural network architecture for separating singers. However, unfortunately, they did not publish the weights. Later, their training process was repeated by Cyru5, who trained several models and published the weights in the public domain. Now the trained neural network is available on MVSep.
MVSep Multichannel BS - this model is prepared for extracting vocals from multichannel sound (5.1, 7.1, etc.).Emphasis on lack of transformation and loss of quality.After processing, the model returns multi-channel audio in the same format in which it was sent to the server with the same sample rate.
Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.
Mel Band Roformer - a model proposed by employees of the company ByteDance for the competition Sound Demixing Challenge 2023, where they took first place on LeaderBoard C. Unfortunately, the model was not made publicly available and was reproduced according to a scientific article by the developer @lucidrains on the github. The vocal model was trained from scratch on our internal dataset. Unfortunately, we have not yet been able to achieve similar metrics as the authors.
The LarsNet model divides the drums stem into 5 types: 'kick', 'snare', 'cymbals', 'toms', 'hihat'. The model is from this github repository and it was trained on the dataset StemGMD. The model has two operating modes. The first (default) applies the Demucs4 HT model to the track at stage one, which extracts only the drum part from the track. On the second stage, the LarsNet model is used. If your track consists only of drums, then it makes sense to use the second mode, where the LarsNet model is applied directly to the uploaded audio. Unfortunately, subjectively, the quality of separation is inferior in quality to the model DrumSep.