Tutorial: Using Apple's NLP framework for on-device text embeddings

Overview: On-Device Text Embeddings with Apple’s NLP Framework #

Apple’s Natural Language framework provides developers with powerful, privacy-preserving tools for processing and understanding text directly on iOS, macOS, and other Apple devices. One of its most compelling features is the ability to generate text embeddings—numerical representations of words, sentences, or documents that capture semantic meaning. These embeddings enable a wide range of applications, from search and recommendation systems to classification and clustering, all while keeping user data on the device.

This guide explores how to use Apple’s Natural Language framework to create and leverage on-device text embeddings. It covers the fundamentals of embeddings, the built-in and custom embedding options available, and practical steps for integrating them into your app.

What Are Text Embeddings? #

Understanding Embeddings #

Text embeddings are vector representations of text, where each word, sentence, or document is mapped to a point in a high-dimensional space. The key idea is that semantically similar texts are close together in this space, while dissimilar ones are farther apart. This allows algorithms to reason about text in a way that goes beyond simple keyword matching.

For example, the words “car” and “automobile” might have very similar vectors, while “car” and “banana” would be far apart. This property makes embeddings useful for tasks like finding synonyms, clustering related documents, or classifying text.

Why On-Device Embeddings Matter #

On-device embeddings offer several advantages:

Privacy: User data never leaves the device.
Performance: No network latency; processing is fast and responsive.
Offline Capability: Apps can work without an internet connection.

Apple’s Natural Language framework is designed to make these benefits accessible to developers, with both built-in and custom embedding options.

Built-In Embeddings with NLEmbedding #

Getting Started #

The Natural Language framework provides built-in word and sentence embeddings for several languages, including English, Spanish, French, Italian, German, Portuguese, and Simplified Chinese. These embeddings are pre-trained and ready to use, making them a great starting point for many applications.

To use a built-in embedding, you first request an instance of NLEmbedding for the desired language:

import NaturalLanguage

if let embedding = NLEmbedding.wordEmbedding(for: .english) {
    // Use the embedding
}

Working with Word Embeddings #

Once you have an embedding, you can get the vector representation of a word:

if let vector = embedding.vector(for: "apple") {
    print(vector)
}

You can also compute the distance between two words, which is a measure of their semantic similarity:

let distance = embedding.distance(between: "apple", and: "orange")
print(distance)

The distance is a cosine value, where smaller values indicate greater similarity. If a word is not found in the embedding, the distance is typically returned as 2.0.

Finding Nearest Neighbors #

A common use case is finding the most similar words to a given query. The framework provides a method to enumerate the nearest neighbors:

let neighbors = embedding.neighbors(for: "apple", maximumCount: 5)
for (word, distance) in neighbors {
    print("\(word): \(distance)")
}

This is useful for autocomplete, synonym suggestion, and other search-related features.

Sentence Embeddings #

For sentence-level embeddings, you can use NLEmbedding.sentenceEmbedding(for:):

if let sentenceEmbedding = NLEmbedding.sentenceEmbedding(for: .english) {
    if let vector = sentenceEmbedding.vector(for: "This is a sentence.") {
        print(vector)
    }
}

Sentence embeddings can be used for tasks like document similarity, clustering, or as input features for machine learning models.

Custom Embeddings with Create ML #

When to Use Custom Embeddings #

While built-in embeddings are convenient, they may not cover all your needs. For example, you might want embeddings for a domain-specific vocabulary, a language not supported by Apple, or a custom model trained on your own data. In these cases, you can create custom embeddings using Create ML.

Creating a Custom Embedding #

To create a custom embedding, you need a dictionary of words and their corresponding vectors. These vectors can be generated using techniques like Word2Vec, GloVe, or BERT. Once you have your vectors, you can create an MLWordEmbedding object and save it as a Core ML model:

import CreateML

let customVectors: [String: [Float]] = [
    "apple": [0.1, 0.2, 0.3],
    "orange": [0.4, 0.5, 0.6],
    // ... more words
]

let embedding = try MLWordEmbedding(dictionary: customVectors)
try embedding.write(to: URL(fileURLWithPath: "/tmp/CustomEmbedding.mlmodel"))

Using Custom Embeddings in Your App #

To use a custom embedding in your app, load the Core ML model and pass it to NLEmbedding:

let customEmbedding = try NLEmbedding(contentsOf: URL(fileURLWithPath: "/tmp/CustomEmbedding.mlmodel"))

You can then use the custom embedding just like a built-in one, getting vectors and finding neighbors.

Advanced: Multilingual and Contextual Embeddings #

Multilingual Models #

Apple’s Natural Language framework supports multilingual embeddings, allowing you to work with text in multiple languages. This is particularly useful for apps that serve a global audience. The framework can handle up to 27 different languages across three scripts (Latin, Cyrillic, and Arabic), making it possible to build applications that understand and process text in a wide range of languages.

Contextual Embeddings with NLContextualEmbedding #

For more advanced use cases, such as text classification or word tagging, you can use contextual embeddings. These embeddings take into account the context in which a word appears, providing more nuanced representations than static embeddings.

To use contextual embeddings, you can leverage the NLContextualEmbedding class. This allows you to load a pre-trained model, apply it to a piece of text, and get the resulting embedding vectors:

let contextualEmbedding = NLContextualEmbedding(modelIdentifier: "your-model-id")
let result = contextualEmbedding.embed(text: "Your text here")
for vector in result.vectors {
    print(vector)
}

These vectors can then be used as input to your own machine learning models, such as those built with PyTorch or TensorFlow, and converted to Core ML models for on-device inference.

Practical Applications #

Search and Recommendation #

Embeddings are ideal for building search and recommendation systems. By representing queries and items as vectors, you can quickly find the most relevant results based on semantic similarity.

Text Classification #

Embeddings can be used as features for text classification models. For example, you might classify emails as spam or not spam, or categorize news articles by topic.

Clustering and Topic Modeling #

By clustering documents based on their embeddings, you can discover topics and group related content together. This is useful for organizing large collections of text, such as user reviews or social media posts.

Privacy-Preserving AI #

On-device embeddings enable privacy-preserving AI applications, where sensitive user data is processed locally and never sent to a server. This is particularly important for apps that handle personal or confidential information.

Conclusion #

Apple’s Natural Language framework provides a robust set of tools for working with on-device text embeddings. Whether you’re using built-in embeddings for common languages, creating custom embeddings for specialized needs, or leveraging advanced contextual embeddings for complex tasks, the framework makes it easy to build powerful, privacy-preserving NLP applications. By understanding the concepts and APIs covered in this guide, you can unlock new possibilities for your apps and deliver better experiences to your users.