MVSep DnR v3 is a cinematic model for splitting tracks into 3 stems: music, sfx and speech. It is trained on a huge multilingual dataset DnR v3. The quality metrics on the test data turned out to be better than those of a similar multilingual model Bandit v2. The model is available in 3 variants: based on SCNet, MelBand Roformer architectures, and an ensemble of these two models. See the table below:
Algorithm name | SDR Metric on DnR v3 leaderboard |
||||
music (SDR) | sfx (SDR) | speech (SDR) | |||
SCNet Large | 9.94 | 11.35 | 12.59 | ||
Mel Band Roformer | 9.45 | 11.24 | 12.27 | ||
Ensemble (Mel + SCNet) | 10.15 | 11.67 | 12.81 | ||
Bandit v2 (for reference) | 9.06 | 10.82 | 12.29 |