MiniMax Speech (via Fal.ai)¶

MiniMax Speech text-to-speech models hosted on Fal.ai, supporting high-quality speech synthesis with interjection support, voice customization, and 34+ languages.

Quick Example¶

from tarash.tarash_gateway import generate_tts
from tarash.tarash_gateway.models import AudioGenerationConfig, AudioOutputFormat, TTSRequest

config = AudioGenerationConfig(
    provider="fal",
    model="fal-ai/minimax/speech-2.8-hd",
    api_key="YOUR_FAL_KEY",
)

request = TTSRequest(
    text="Hello! (laughs) This is a test of the MiniMax Speech model.",
    voice_id="Wise_Woman",
    output_format=AudioOutputFormat(format="mp3", sample_rate=44100, bitrate=128),
)

response = generate_tts(config, request)
print(f"Duration: {response.duration}s")

Supported Models¶

Model	Quality	Notes
`fal-ai/minimax/speech-2.8-hd`	HD	Highest quality, interjection support
`fal-ai/minimax/speech-2.8-turbo`	Turbo	Faster generation with streaming
`fal-ai/minimax/speech-2.6-hd`	HD	Previous HD version
`fal-ai/minimax/speech-2.6-turbo`	Turbo	Previous turbo version

Parameters¶

Parameter	TTSRequest field	Notes
Text	`text`	Supports `<#x#>` pauses (0.01–99.99s) and interjections: `(laughs)`, `(sighs)`, `(coughs)`, etc.
Voice	`voice_id`	Built-in voices: `Wise_Woman`, `Friendly_Person`, `Inspirational_girl`, etc.
Output format	`output_format`	`AudioOutputFormat(format="mp3", sample_rate=44100, bitrate=128)`, etc.
Language	`language_code`	Maps to `language_boost`: `English`, `French`, `Japanese`, `Chinese`, etc. (34+ languages)
Speed	`voice_settings={"speed": 1.2}`	Range: 0.5–2.0
Volume	`voice_settings={"vol": 0.8}`	Range: 0.01–10
Emotion	`voice_settings={"emotion": "happy"}`	`happy`, `sad`, `angry`, `fearful`, `disgusted`, `surprised`, `neutral`
Pitch	`voice_settings={"pitch": 3}`	Range: -12 to 12

Extra Parameters¶

Advanced features go through extra_params:

request = TTSRequest(
    text="Hello world",
    voice_id="Wise_Woman",
    voice_settings={"speed": 1.1, "emotion": "happy", "pitch": 3},
    extra_params={
        "voice_modify": {"pitch": 10, "intensity": 20, "timbre": -5},
        "pronunciation_dict": {"tone_list": ["hello/(heh-loh)"]},
    },
)

Extra param	Type	Notes
`voice_modify`	`dict`	Fine-grained pitch/intensity/timbre control (-100 to 100)
`pronunciation_dict`	`dict`	Custom pronunciation replacements
`normalization_setting`	`dict`	Loudness normalization (enabled, target_loudness, etc.)

Interjections¶

The model supports natural interjections embedded in text:

(laughs) — laughter
(sighs) — sighing
(coughs) — coughing
(clears throat) — throat clearing
(gasps) — gasping
(sniffs) — sniffing
(groans) — groaning
(yawns) — yawning

request = TTSRequest(
    text="Well (sighs) I suppose we should get started. (clears throat) Hello everyone!",
    voice_id="Wise_Woman",
)

Pauses¶

Insert pauses with <#x#> syntax where x is seconds (0.01–99.99):

request = TTSRequest(
    text="First point. <#1.5#> Second point. <#0.5#> And finally, the conclusion.",
    voice_id="Wise_Woman",
)