MVSEP Logo
  • Home
  • News
  • Plans
  • Demo
  • Create Account
  • Login
  • Theme
    Model Selector
    Language
    • English
    • Русский
    • 中文
    • اَلْعَرَبِيَّةُ
    • Polski
    • Portugues do Brasil
    • Español
    • 日本語
    • Français
    • Oʻzbekcha
    • Türkçe
    • हिन्दी
    • Tiếng Việt
    • Deutsch
    • 한국어
    • Bahasa Indonesia
    • Italiano
    • Svenska
    • suomi
    • български език
    • magyar nyelv
    • עִבְֿרִית
    • ภาษาไทย
    • hrvatski
    • Română

Bark (Speech Gen)

Bark — is a transformer-based model created by Suno, representing not just a traditional text-to-speech tool, but a fully generative "text-to-audio" system. Its capabilities go far beyond ordinary voicing: besides creating highly realistic speech in multiple languages, Bark can generate music, background noises, and simple sound effects. A unique feature of the model is the ability to reproduce subtle non-verbal communications, such as laughter, sighs, and crying, making the resulting sound maximally alive and natural.

Striving to support the community, the developers have opened access to pre-trained checkpoints that are ready for work and allowed even for commercial use. However, it is important to consider that Bark was created primarily for research tasks. Being a fully generative model, it can behave unpredictably and sometimes deviate from the provided text prompts.

Official model repository: https://github.com/suno-ai/bark

Unlike classic TTS systems, Bark does not use SSML markup. Instead, it is trained to recognize specific text inserts (tags) as instructions for generating sounds.

Instructions for coding emotions and sounds in Bark

1. Basic Principle

All control commands are written in square brackets. Important: The tags themselves must be written in English, even if the main text you are generating is in Russian, Spanish, or any other language.

Syntax:

Text before effect [effect_tag] text after effect.

2. List of supported tags (Non-speech sounds)

Bark officially recognizes the following set of tokens for non-verbal sounds:

Tag Description Usage Example
[laughter] Loud, distinct laugh Hello! [laughter] That was so funny.
[laughs] Short chuckle, giggling Well yes, of course [laughs].
[sighs] Heavy sigh (fatigue, relief) [sighs] I'm so tired of this work.
[music] Instrumental music insertion [music] (background music playing)
[gasps] Sharp breath (fright, surprise) [gasps] I didn't expect to see you here!
[clears throat] Throat clearing (attracting attention) [clears throat] Gentlemen, may I have your attention.

Note: Variations like [man laughs] and [woman laughs] also exist, but they work most stably if the speaker's gender (Speaker History) matches the tag.

3. Generating singing and music

To make the model "sing" the text rather than read it, musical notes are used.

  • Method: Wrap the text in musical note symbols ♪ (Shift + Alt + V on Mac or Alt+13 on Win, or just copy).

  • Example: ♪ In the jungle, the mighty jungle, the lion sleeps tonight ♪

  • Tip: This works best if you use English, as the training dataset contained many English songs, but results can be achieved in other languages too.

4. Pauses and Intonation (Prosody)

Although there are no special tags for pauses (like ), Bark is sensitive to punctuation and special characters, as it perceives text as a structure.

  • Ellipsis and dash (..., —): Use an ellipsis or an em dash to create pauses, hesitations, or hitches in speech.

    • Example: I... I'm not sure that's right.

  • CAPS LOCK: Sometimes (not guaranteed) writing a word in CAPITAL LETTERS can add emphasis or increase volume.

5. Important nuances of operation (Disclaimer)

  1. Probabilistic nature: Bark is a GPT for audio. If you write [laughter], the model will with high probability generate laughter, but sometimes it may ignore the tag or generate a strange sound.

  2. Context matters: The tag [laughter] will work more naturally after a joke than in the middle of a tragic sentence. The model "understands" the semantics of the text.

  3. Whispering: There is no official [whisper] tag. However, the community has noticed that adding words like "quietly" or using specific speakers (Speaker Prompts) sometimes helps, but this is a trial and error method.

Site limitations: currently, all submitted texts are trimmed to 1000 characters.

🗎 Copy link | Use algorithm | Demo

MVSEP Logo

turbo@mvsep.com

Site information

FAQ

Quality Checker

Algorithms

Full API Documentation

Company

Privacy Policy

Terms & Conditions

Refund Policy

Cookie Notice

Extra

Help us translate!

Help us promote!