Algorithm for separating tracks into vocal and instrumental parts based on the MelBand Roformer neural network. The neural network was first proposed in the paper "Mel-Band RoFormer for Music Source Separation" by a group of scientists from ByteDance. The first high-quality weights were made publicly available by Kimberley Jensen. The neural network with open weights was then slightly modified and further trained by the MVSep team in order to improve quality metrics. Also there are high quality weights provided by: @Bas Curtiz, @unwa and @becruily.
Quality metrics
Algorithm name | Multisong dataset | Synth dataset | MDX23 Leaderboard |
||
SDR Vocals | SDR Instrumental | SDR Vocals | SDR Instrumental | SDR Vocals | |
MelBand Roformer (Kimberley Jensen) | 11.01 | 17.32 | 12.68 | 12.38 | 11.543 |
MelBand Roformer (ver. 2024.08) | 11.17 | 17.48 | 13.34 | 13.05 | --- |
Bas Curtiz edition | 11.18 | 17.49 | 13.89 | 13.60 | --- |
unwa Instrumental | 10.24 | 16.54 | 12.25 | 11.95 | --- |
unwa Instrumental v1e Note: Max instrum fullness, but noisy |
10.05 | 16.36 | --- | --- | --- |
unwa big beta v5e Note: Max vocals fullness, but noisy |
10.59 | 16.89 | --- | --- | --- |
MelBand Roformer (ver. 2024.10) | 11.28 | 17.59 | 13.89 | 13.59 | --- |
becruily instrum max fullness Note: Max instrum fullness, but noisy |
10.16 | 16.47 | --- | --- | --- |
becruily vocals max fullness Note: Max vocals fullness, but noisy |
10.55 | 16.86 | --- | --- | --- |
Detailed statistics on Multisong dataset:
Model | Vocals fullness | Vocals bleedless | Vocals SDR | Vocals L1Freq | Instrum fullness | Instrum bleedless | Instrum SDR | Instrum L1Freq |
MelBand Roformer (Kimberley Jensen) | 16.66 | 36.51 | 11.01 | 38.96 | 27.71 | 46.72 | 17.32 | 39.77 |
MelBand Roformer (ver. 2024.08) | 16.39 | 39.13 | 11.18 | 39.26 | 27.74 | 47.07 | 17.49 | 40.16 |
Bas Curtiz edition | 16.30 | 38.94 | 11.18 | 39.18 | 27.49 | 47.00 | 17.49 | 40.15 |
MelBand Roformer (ver. 2024.10) | 16.92 | 37.78 | 11.28 | 39.41 | 27.71 | 47.29 | 17.59 | 40.29 |
unwa Instrumental v1 (SDR vocals: 10.24, SDR instrum: 16.54) | 15.89 | 27.48 | 10.24 | 36.06 | 35.44 | 38.02 | 16.55 | 38.67 |
unwa Instrumental v1e (SDR vocals: 10.05, SDR instrum: 16.36) | 14.67 | 26.83 | 10.06 | 34.37 | 38.85 | 35.68 | 16.37 | 38.31 |
unwa big beta v5e (SDR vocals: 10.59, SDR instrum: 16.89) | 20.78 | 32.02 | 10.59 | 38.53 | 25.65 | 45.90 | 16.90 | 37.31 |
becruily instrum high fullness (SDR instrum: 16.47) | 15.76 | 30.15 | 10.16 | 35.84 | 33.93 | 40.55 | 16.47 | 38.86 |
becruily vocals high fullness (SDR vocals: 10.55) | 20.72 | 31.25 | 10.55 | 38.84 | 28.28 | 40.85 | 16.86 | 38.24 |