Skip to main content

Command Palette

Search for a command to run...

AI - Fundamentals : Part 1 : LLM Tokens

Understanding Tokens, BPE, and LLM Costs

Updated
8 min readView as Markdown
AI - Fundamentals : Part 1 : LLM Tokens

There has been a lot of buzz around AI recently. Everywhere you look, people are talking about it, building with it, or trying to figure out how to use it. And, like many of you, we all want to jump on the AI train. 🚂

But before we start building AI-powered applications or experimenting with the latest models, it's important to understand the core mechanics. A strong foundation makes learning advanced concepts much easier and helps you understand why things work instead of just how to use them.

That’s why, in this multi part series, we are going to cover all the essential AI fundamentals step by step, giving you a solid launchpad for your engineering journey.

Let’s kick things off with the absolute baseline of how language models process information. Part 1: Tokens.

Token

Language models do not read whole words like humans do, nor do they look at individual characters one by one. Instead, they process text in chunks called tokens. A token can be an entire word, a part of a word (like a syllable), or even a single punctuation mark or space.

Think of it like LEGO bricks. Instead of manufacturing a unique brick for every single object in the world (words), or forcing you to build everything out of microscopic 1x1 pegs (characters), we use a standard set of pre made bricks of various sizes (tokens) to build everything efficiently.

Byte Pair Encoding

Modern models use an algorithm called Byte-Pair Encoding (BPE) to build their vocabulary of tokens. Here is the simple intuition of how BPE works from scratch:

  • It starts by treating every individual character as a token.

  • It looks at a massive dataset of text and finds the most frequently occurring pair of characters side-by-side (e.g., "t" and "h").

  • It merges that frequent pair into a new, single token: "th".

  • It repeats this process millions of times, constantly merging the most common adjacent pieces into larger tokens (like "the", or "ing").

Because of this, common words like the or infrastructure usually become a single token. Rare words, typos, or complex code blocks get broken down into multiple smaller tokens.

🪙 The Golden Rule of AI Cost

As an engineer, this is where the architecture meets the budget: Models bill you per token, not per word or character.

Generally, 1 token is roughly equal to 0.75 words in English. However, unexpected things can inflate your token count and therefore your cloud bill.

To see this in action, let's look at how spaces and punctuation change things. Consider these two inputs:

  • Input A: Identify the error.

  • Input B: Identify the error . (with extra spaces before and after the period)

In Input A, the punctuation might be cleanly bundled. In Input B, the tokenizer has to create distinct, individual tokens for each unexpected space and isolated punctuation mark because that specific sequence isn't a common "merged brick" in its vocabulary.

💳 The Token-to-Cost Ratio

When building production infrastructure, token counts translate directly to operational costs. LLM providers bill you based on two distinct metrics:

  1. Input Tokens (Prompt): The text you send to the model.

  2. Output Tokens (Completion): The text the model generates back to you.

Output tokens are almost always more expensive than input tokens because the model has to expend more compute power generating words one by one.

Model Type

Input Cost (per 1M tokens)

Output Cost (per 1M tokens)

Lightweight / Fast (e.g., GPT-4o-mini)

$0.15

$0.60

Premium / Powerful (e.g., GPT-4o)

$2.50

$10.00

🔍 How Spaces and Indentations Multiply Costs

In standard code formatting, engineers use spaces or tabs for readability. However, many tokenizers process individual spaces or pairs of spaces as distinct tokens if they don't match a common pattern. Look at these two ways of sending the exact same variable payload to an LLM:

Format A (Pretty Printed JSON):

{
    "status": "unhealthy",
    "replica_count": 0
}

Format B (Minified JSON):

{"status":"unhealthy","replica_count":0}

Format A includes multiple newline characters and 4-space indentations. In many tokenizers (like the one used for GPT-4), Format A takes 19 tokens, while Format B takes only 11 tokens.

By stripping whitespace (minifying) before sending data to an API, you can cut your input token footprint by 40% or more for large data structures like JSON logs or configuration dumps.

🍓 The Missing Example: The "Strawberry" Mystery

Every AI user eventually discovers that if you ask an LLM, "How many 'r's are in the word strawberry?", the model will confidently answer "Two." People assume the AI is hallucinating or bad at math. But with the foundation you’ve already built in your blog, you can give your readers the real engineering answer: It is a tokenization limitation.

You can add a section like this:

🍓 Why AI Can’t Count the "R"s in Strawberry

When a human reads the word "strawberry," we see a sequence of 10 letters: s-t-r-a-w-b-e-r-r-y.

But an LLM never sees the letters. Because of Byte-Pair Encoding, the tokenizer bundles the word into chunks. To the model's vocabulary, "strawberry" is split into just two tokens: ["straw", "berry"].

Once it gets converted into token numbers say [301, 8396]the raw spelling structure is completely hidden. The model is trying to count threads in a rope without untwisting it first. There is simply no "letter R" in its active memory to inspect!

  • Token Blindness: The LLM doesn't see the letters; it only sees two opaque boxes: [Token: "straw"] and [Token: "berry"].

  • The Hidden 'R': The model has no idea what letters are hidden inside the "straw" token. To the AI, that token is just a random identification number (like 301). It registers 0 'r's here.

  • The Known 'R's: However, during its massive training, the model has seen plenty of text discussing the spelling of the standalone word "berry" (b-e-r-r-y). Its neural network has a strong statistical association that the "berry" token involves 2 'r's.

  • The Flawed Math: The model does the only math it can with the pieces it sees:

0 (from "straw") + 2 (from "berry") = 2

Vocabulary Size (The GPT-4 vs GPT-4o Shift)

When OpenAI released GPT-4o, they upgraded their tokenizer algorithm from a vocabulary size of roughly 100,000 tokens (cl100k_base) to 200,000 tokens (o200k_base).

By doubling the vocabulary "brick set," the model became massively more efficient, especially at processing code and non English text. For instance, the token footprint for South Asian languages like Tamil or Hindi dropped by over 60%, drastically lowering the cloud bill for developers building global apps.

Special Tokens (The "Invisible" Code)

It’s worth noting that not all tokens represent human language. Tokenizers include Special Tokens that act as architectural control signals. For example:

  • <|endoftext|> tells the model the document or prompt is officially over.

  • <|im_start|> and <|im_end|> separate the developer’s system instructions from the user's prompt.

If a user manages to trick your app into outputting or passing these special tokens unexpectedly, it can bypass security guardrails (a concept known as token smuggling or prompt injection).

🚀 Going Deeper: Beyond the Basics

I don't want to bombard you with more technical jargon as we are just getting started. If you are more interested, just explore these topics as well:

  • Token IDs (From Strings to Numbers): While we talk about tokens as text chunks (like "straw" or "berry"), the AI model itself still doesn't understand text. Once the tokenizer chops up your prompt, it maps every single token to a unique integer ID (e.g., "the" becomes 464). It is this sequence of numbers that is actually passed into the neural network.

  • The Context Window Wall: Every model has a hard limit on the total number of tokens it can hold in its memory at one time (input + output combined). If a model has a 128k token context window, exceeding that limit will cause the application to crash or forget the earliest parts of the conversation. (We will cover this in upcoming blogs)

  • "Glitch Tokens": Because tokenizers are trained on massive web datasets, their vocabularies sometimes include highly specific, bizarre strings (like specific Reddit usernames or obscure code strings) that the actual AI model rarely saw during its training phase. When an LLM encounters these "glitch tokens," it can get confused and behave erratically!

MLOps and AI

Part 1 of 1

MLOps and AI