Tutorial: Using Apple’s AVSpeechSynthesizer for offline text-to-speech

Text-to-speech (TTS) technology has become a cornerstone of modern mobile apps, enabling accessibility, hands-free interaction, and richer user experiences. Apple’s AVSpeechSynthesizer is a powerful, built-in framework that allows developers to convert text into spoken words directly on iOS devices—no internet connection required. This makes it a go-to choice for privacy-conscious apps, offline functionality, and seamless integration with Apple’s ecosystem.

Why Offline Text-to-Speech Matters #

Imagine using a navigation app in a remote area with no signal, or a language learning tool that reads aloud without relying on the cloud. Offline TTS ensures your app remains functional and private, even when connectivity is spotty or unavailable. Unlike cloud-based TTS services, which send your text to remote servers, AVSpeechSynthesizer processes everything on the device. This means your data never leaves the phone, making it ideal for apps that prioritize user privacy.

For example, apps like Personal LLM leverage on-device processing to keep all AI interactions private and offline. Just as Personal LLM runs large language models locally, AVSpeechSynthesizer lets your app speak without exposing user data to the internet.

How AVSpeechSynthesizer Works #

At its core, AVSpeechSynthesizer is like a digital narrator. You give it a piece of text, and it reads it aloud using a synthesized voice. The process is straightforward:

Create a synthesizer: This is your main tool for generating speech.
Prepare an utterance: This is the text you want spoken, along with optional settings like pitch, speed, and volume.
Speak the utterance: The synthesizer converts your text into audio and plays it.

Here’s a simple analogy: Think of the synthesizer as a stage, the utterance as a script, and the voice as the actor. You write the script (text), choose the actor (voice), and the stage (synthesizer) brings it to life.

Step-by-Step Implementation #

1. Setting Up the Synthesizer #

First, create an instance of AVSpeechSynthesizer. This is your entry point for all speech synthesis tasks.

import AVFoundation

let synthesizer = AVSpeechSynthesizer()

2. Creating an Utterance #

Next, create an AVSpeechUtterance with the text you want spoken. You can customize how it sounds by adjusting properties like pitch, rate, and volume.

let utterance = AVSpeechUtterance(string: "Hello, world!")
utterance.pitchModifier = 1.25  // Higher pitch
utterance.rate = AVSpeechUtteranceDefaultSpeechRate  // Normal speed
utterance.volume = 1.0  // Full volume

3. Speaking the Utterance #

Finally, pass the utterance to the synthesizer to start speaking.

synthesizer.speak(utterance)

You can queue multiple utterances for longer texts, allowing for pauses or emphasis between sections.

Customizing Voices and Settings #

AVSpeechSynthesizer supports over 30 voices, including country-specific variations. You can set the voice for each utterance to match your app’s needs.

if let voice = AVSpeechSynthesisVoice(language: "en-US") {
    utterance.voice = voice
}

You can also adjust the speech rate, pitch, and volume to create expressive or dramatic effects. For example, a slower rate might be used for educational apps, while a higher pitch could make a voice sound more cheerful.

Advanced Features: SSML and Personal Voices #

Apple has expanded AVSpeechSynthesizer with support for Speech Synthesis Markup Language (SSML), which lets you add expressive elements like pauses, emphasis, and even custom voices. SSML is like a script for your digital narrator, allowing you to fine-tune how text is spoken.

In iOS 17 and later, you can also integrate Personal Voice, a feature that lets users create a voice that sounds like them. This is especially useful for accessibility apps, where users want their device to speak in their own voice.

AVSpeechSynthesizer.requestPersonalVoiceAuthorization { status in
    if status == .authorized {
        let personalVoices = AVSpeechSynthesisVoice.speechVoices().filter { $0.voiceTraits.contains(.isPersonalVoice) }
        // Use the personal voice in your utterance
    }
}

Addressing Common Misconceptions #

Myth: Offline TTS sounds robotic.
Reality: Modern voices, especially with SSML and Personal Voice, can sound natural and expressive.
Myth: Only Apple devices support offline TTS.
Reality: While AVSpeechSynthesizer is Apple’s solution, other platforms like Android have similar APIs, and cross-platform tools like Flutter or React Native offer plugins for offline TTS.
Myth: Offline TTS is limited to basic voices.
Reality: With Personal Voice and SSML, you can create highly customized and lifelike speech.

Privacy and Offline Benefits #

The biggest advantage of AVSpeechSynthesizer is privacy. Since all processing happens on the device, your app never sends user data to the cloud. This is crucial for apps that handle sensitive information, like health or finance tools.

Apps like Personal LLM take this a step further by running entire AI models locally. Just as Personal LLM keeps your chats private and offline, AVSpeechSynthesizer ensures your text-to-speech features are secure and independent.

Real-World Use Cases #

Accessibility: Helping visually impaired users navigate apps.
Education: Reading aloud lessons or stories in language learning apps.
Navigation: Providing turn-by-turn directions without internet.
Entertainment: Bringing audiobooks or games to life.

Getting Started #

To start using AVSpeechSynthesizer, you’ll need Xcode and a basic understanding of Swift. Apple’s documentation and WWDC videos provide detailed guides and code samples. For cross-platform apps, consider using plugins like flutter_tts or react-native-tts, which wrap AVSpeechSynthesizer for easier integration.

Conclusion #

AVSpeechSynthesizer is a powerful, privacy-focused tool for adding text-to-speech to your iOS apps. Whether you’re building an accessibility feature, an educational tool, or a privacy-first app like Personal LLM, offline TTS ensures your users can interact with your app anytime, anywhere—without compromising their data. With expressive voices, customizable settings, and advanced features like Personal Voice, Apple’s framework makes it easy to create rich, engaging experiences that respect user privacy.