Bleedless and Fullness. These metrics use magnitude spectrogram differences between estimated and reference signals, mapped to a mel frequency scale and decibel (dB) domain to align with human auditory perception. In this framework, positive values indicate unwanted added signals (e.g., bleed, artifacts), while negative values indicate missing content (e.g., lost harmonics, missing instruments). The Bleedless metric penalizes average positive differences, while the Fullness metric penalizes average negative differences. In other words, the Bleedless metric quantifies how much of the estimated signal bleeds into the reference signal. The Fullness metric evaluates how completely the target source has been preserved. Both metrics are normalized to a range of [0, 100] with higher values indicating better performance.
Figure below illustrates these concepts through color coding. The blue areas show what is missing from the predicted stem compared to the original, resulting in a lower Fullness score. More blue indicates more missing content. The red areas show unwanted content appearing in the predicted stem that is not present in the original. This represents bleed from the other stems, resulting in a lower Bleedless score. More red indicates more interference from the other instruments.

Fullness and Bleedless are usually mutually exclusive during training: as one increases, the other decreases, and metrics can fluctuate significantly. One can have models that emphasize fuller extraction of an instrument but add some noise, and models that make extraction less noisy at the cost of losing some instrument data. The code implementation for all metrics is available on github.