Fine-Tuning Text to Speech with Voice Director (How-To Series - Part 2)

In this guide we'll show you how you can use Replica's fine-grained control in Text-to-Speech (TTS) and Voice AI. Whether you’re a developer, content creator, or audio enthusiast, this guide will help you unlock advanced fine-tuning techniques that bring your digital voices to life with precision and style.

1. Fine-grained control of Voice AI

Take command of your TTS outputs with our robust, fine-grained controls. Our advanced settings let you tailor every nuance of your voice synthesis, ensuring your digital narrations are not just heard, but felt. Here’s how you can fine-tune your Text-to-Speech:

Pitch: Adjust the tonal quality of your audio with precision, ranging from -12st to +12st.*‍

Pace: Control the rhythm of your speech by adjusting the pace from 50% (slower, dramatic effect) to 150% (fast, energetic delivery). This flexibility is perfect for emphasizing key points or creating engaging narratives.‍

Volume: Set the perfect loudness with our volume control, ranging from -6dB to +6dB.*‍

Breaks: Enhance clarity and impact by inserting pauses anywhere between 0.1 second and 2 seconds. Use breaks to create dramatic suspense or ensure every word lands with precision.‍

Say-as: Control how numbers, date and potential acronyms are interpreted.

* “st” stands for semitones, which are musical intervals that help define the pitch. A change of 12st equals one octave, so this range allows you to significantly deepen or brighten your voice.
“dB” refers to decibels, a unit of measurement for sound intensity. Adjusting within this range ensures your voice is delivered with just the right impact without distortion.

‍

2. Control the Prosody of your Text to Speech

Highlight the section of text you want to modify, and a floating toolbar will appear. Adjust the pitch, pace, volume, using the intuitive controls. Listen to the changes, ensuring your adjustments perfectly complement your content.‍

Adjusting the pitch during fine-tuning. — Adjusting the **pitch**

Adjusting the pace during fine tuning — Adjusting the **pace**

Adjusting the volume during fine tuning. — Adjusting the **volume**

3. Inserting Breaks

You can insert breaks at a particular before or after a selection. Highlight either a word, or a space/comma character to engage the toolbar. Choose to insert the break either before or after the selection.

Adding a break before a word. — Selecting a word to add a break

4. Using Say-as to control interpretation.

spell-out: Pronounce each letter individually. So you choose for WHO to be pronounced as the word 'who' or to spell-out each character as 'W H O', as in the World Health Organization.

Specifying the interpretation of WHO — Specifying that WHO should be spelled out.

date: Ensure dates like 11/11/24 are read as “11th of November 2024” or “November 11th 2024.”

‍currency: Convert symbols like $10 into natural language (“ten dollars”)

number: Choose from:

• cardinal: “2025” becomes “two thousand and twenty five.”

• digits: “2025” as “two zero two five.”

• year: “2025” as “twenty twenty five.”

Specify the interpretation of dates — Alternate ways to pronounce dates.

5. Resetting Fine-Tuning

‍Made an adjustment you’d like to undo? No problem! Simply click on the marked-up word or phrase and select Reset. This action clears all fine-tuning adjustments, returning your text to its original, unaltered state—ready for a fresh start.

6. SSML with our API

Check out our API docs to see how to use SSML to fine-tune your API requests. https://docs.replicastudios.com‍‍‍

7. Best Practices for Fine-Tuning

Global vs. Fine-tuned Controls

‍Where appropriate, prefer our global controls for pitch, pace, and volume when adjusting the entire text. This provides a consistent, high-quality output, while local controls are ideal for spotlighting specific words or phrases.

Keep It Simple

‍While our system supports complex nested fine-tuning, simpler configurations often yield the best results. Overcomplicating your adjustments might lead to unexpected outcomes.

Mind the Limits

‍Some voices may distort at the extreme ends of pitch or pace. Experiment within recommended ranges to maintain audio clarity and naturalness. If you really need to push pitch in a particular direction, consider combining our global and fine-tune pitch controls.

Find the best place to break

‍Inserting a break in the middle of a group of words may have a jarring effect on the result generation. Consider first using a comma or full stop to get a more natural result. If the pause length is still not long enough, commas or full stops make an ideal place to add additional silence breaks.

Harness the Power of SSML

‍Our API supports SSML, offering unparalleled control over your TTS output. Dive into our API documentation for advanced customization and truly next-level Voice AI integration.

‍Conclusion: Transform Your Text-to-Speech with Fine-Grained Control

‍By leveraging these fine-tuning techniques and SSML capabilities, you can transform ordinary TTS outputs into dynamic, engaging audio experiences. Whether you’re producing content for marketing, education, or entertainment, mastering these controls will help you deliver clear, compelling, and emotionally resonant messages. Embrace the future of Voice AI with our comprehensive guide, and let every word you produce resonate with precision and passion. Get ready to elevate your audio content and make every narrative count! Happy fine-tuning, and enjoy creating brilliant, high-impact Text-to-Speech experiences!