Experimental model VitLarge23 based on Vision Transformers. In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.
|Algorithm name||Multisong dataset||Synth dataset||MDX23 Leaderboard
|SDR Vocals||SDR Instrumental||SDR Vocals||SDR Instrumental||SDR Vocals|
|Vit Large 23 (512px) v1||9.78||16.09||12.33||12.03||10.47|
|Vit Large 23 (512px) v2||9.90||16.20||12.38||12.08||---|