Bark — is a transformer-based model created by Suno, representing not just a traditional text-to-speech tool, but a fully generative "text-to-audio" system. Its capabilities go far beyond ordinary voicing: besides creating highly realistic speech in multiple languages, Bark can generate music, background noises, and simple sound effects. A unique feature of the model is the ability to reproduce subtle non-verbal communications, such as laughter, sighs, and crying, making the resulting sound maximally alive and natural.
Striving to support the community, the developers have opened access to pre-trained checkpoints that are ready for work and allowed even for commercial use. However, it is important to consider that Bark was created primarily for research tasks. Being a fully generative model, it can behave unpredictably and sometimes deviate from the provided text prompts.
Official model repository: https://github.com/suno-ai/bark
Unlike classic TTS systems, Bark does not use SSML markup. Instead, it is trained to recognize specific text inserts (tags) as instructions for generating sounds.
Instructions for coding emotions and sounds in Bark
1. Basic Principle
All control commands are written in square brackets. Important: The tags themselves must be written in English, even if the main text you are generating is in Russian, Spanish, or any other language.
Syntax:
Text before effect [effect_tag] text after effect.
2. List of supported tags (Non-speech sounds)
Bark officially recognizes the following set of tokens for non-verbal sounds:
Note: Variations like [man laughs] and [woman laughs] also exist, but they work most stably if the speaker's gender (Speaker History) matches the tag.
3. Generating singing and music
To make the model "sing" the text rather than read it, musical notes are used.
-
Method: Wrap the text in musical note symbols
♪(Shift + Alt + V on Mac or Alt+13 on Win, or just copy). -
Example:
♪ In the jungle, the mighty jungle, the lion sleeps tonight ♪ -
Tip: This works best if you use English, as the training dataset contained many English songs, but results can be achieved in other languages too.
4. Pauses and Intonation (Prosody)
Although there are no special tags for pauses (like ), Bark is sensitive to punctuation and special characters, as it perceives text as a structure.
-
Ellipsis and dash (
...,—): Use an ellipsis or an em dash to create pauses, hesitations, or hitches in speech.-
Example:
I... I'm not sure that's right.
-
-
CAPS LOCK: Sometimes (not guaranteed) writing a word in CAPITAL LETTERS can add emphasis or increase volume.
5. Important nuances of operation (Disclaimer)
-
Probabilistic nature: Bark is a GPT for audio. If you write
[laughter], the model will with high probability generate laughter, but sometimes it may ignore the tag or generate a strange sound. -
Context matters: The tag
[laughter]will work more naturally after a joke than in the middle of a tragic sentence. The model "understands" the semantics of the text. -
Whispering: There is no official
[whisper]tag. However, the community has noticed that adding words like "quietly" or using specific speakers (Speaker Prompts) sometimes helps, but this is a trial and error method.
Site limitations: currently, all submitted texts are trimmed to 1000 characters.