Skip to content
1 min read · 130 words

Variable: TokenEncoding

ts
const TokenEncoding: readonly [
  "gpt2",
  "r50k_base",
  "p50k_base",
  "p50k_edit",
  "cl100k_base",
  "o200k_base",
  "gemini",
  "llama2",
  "claude",
];

The set of supported token encoding identifiers.

Remarks

Each value maps to a specific estimation backend:

  • gpt2, r50k_base, p50k_base, p50k_edit, cl100k_base, o200k_base — exact counts via js-tiktoken (OpenAI / tiktoken-compatible models).
  • gemini — exact counts via @lenml/tokenizer-gemini, which embeds Gemini's actual SentencePiece vocabulary locally with no API call required.
  • llama2 — exact counts via llama-tokenizer-js (Llama 1 and 2). Llama 3+ uses a different vocabulary and should use the llama3 identifier once a suitable sync backend is available.
  • claude — heuristic approximation using Anthropic's published ~3.5 chars/token ratio. No local tokenizer is available for Claude 3+ models; the Anthropic SDK's messages.countTokens() API is the only exact path but requires a network call.

When adding a new encoding, add a case to Tokenizable.estimateTokens.