AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

Prerequisites

If you wish to use the tokenizers for Jamba Mini or Jamba Large, you will need to request access to the relevant model's HuggingFace repo:
- Jamba Mini
- Jamba Large

Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Basic Usage

from ai21_tokenizer import Tokenizer

# Create tokenizer (defaults to Jamba Mini)
tokenizer = Tokenizer.get_tokenizer()

# Encode text to token IDs
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(f"Encoded: {encoded}")

# Decode token IDs back to text
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Specific Tokenizer Selection

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

# Jamba Mini tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_MINI_TOKENIZER)

# Jamba Large tokenizer
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_LARGE_TOKENIZER)

Async Usage

import asyncio
from ai21_tokenizer import Tokenizer

async def main():
    tokenizer = await Tokenizer.get_async_tokenizer()

    text = "Hello, world!"
    encoded = await tokenizer.encode(text)
    decoded = await tokenizer.decode(encoded)

    print(f"Original: {text}")
    print(f"Encoded: {encoded}")
    print(f"Decoded: {decoded}")

asyncio.run(main())

Advanced Token Operations

# Convert between tokens and IDs
tokens = tokenizer.convert_ids_to_tokens(encoded)
print(f"Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDs: {ids}")

Direct Class Usage

from ai21_tokenizer import SyncJambaTokenizer

# Using local model file
model_path = "/path/to/your/tokenizer.model"
tokenizer = SyncJambaTokenizer(model_path=model_path)

text = "Hello, world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

Async Direct Class Usage

from ai21_tokenizer import AsyncJambaTokenizer

async def main():
    model_path = "/path/to/your/tokenizer.model"
    tokenizer = await AsyncJambaTokenizer.create(model_path=model_path)

    text = "Hello, world!"
    encoded = await tokenizer.encode(text)
    decoded = await tokenizer.decode(encoded)

asyncio.run(main())

For more examples, please see our examples folder.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github		.github
ai21_tokenizer		ai21_tokenizer
ci_scripts		ci_scripts
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.yamllint.yaml		.yamllint.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
init.sh		init.sh
poetry.lock		poetry.lock
poetry.toml		poetry.toml
publish.sh		publish.sh
pyproject.toml		pyproject.toml
setup.py		setup.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI21 Labs Tokenizer

Prerequisites

Installation

pip

poetry

Usage

Basic Usage

Specific Tokenizer Selection

Async Usage

Advanced Token Operations

Direct Class Usage

Async Direct Class Usage

About

Uh oh!

Releases 33

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

AI21Labs/ai21-tokenizer

Folders and files

Latest commit

History

Repository files navigation

AI21 Labs Tokenizer

Prerequisites

Installation

pip

poetry

Usage

Basic Usage

Specific Tokenizer Selection

Async Usage

Advanced Token Operations

Direct Class Usage

Async Direct Class Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages