Token Counting Guide

A practical walkthrough of BPE tokenizers, why Korean is more expensive, and how to budget prompt cost.

작성 김지광 (운영자)마지막 업데이트 balpekr 마이크로 SaaS

1. What is a token?

A token is the smallest unit an LLM sees. Modern frontier models use Byte-Pair Encoding (BPE) variants that fuse common character sequences into a single ID. English prose averages about 4 characters per token, while Korean Hangul often lands near 1 character per token — which is why Korean prompts are materially more expensive. Tokens determine three things at once: the bill you pay your model vendor, the context window you can fit, and the wall-clock latency of the response. Every optimisation in this guide ultimately reduces to counting tokens earlier and counting fewer of them.

2. The four tokenizer families on this page

  • OpenAI cl100k_base — GPT-3.5 Turbo and legacy GPT-4. 100k vocabulary, ported to JavaScript via js-tiktoken. Exact counts.
  • OpenAI o200k_base — GPT-4o family and o-series. 200k vocabulary with more multilingual merges; Korean is 20-30% cheaper than on cl100k. Exact counts.
  • Anthropic Claude — Anthropic's BPE behaves very similarly to cl100k for Latin text. No browser tokenizer ships; use the count_tokens API for exact billing. We approximate on this page.
  • Google Gemini — SentencePiece BPE with 256k vocabulary, trained on heavily multilingual data. Korean and Japanese are close to native efficiency. Google provides a countTokens REST endpoint for exact values.
  • Meta Llama 3 — SentencePiece BPE, 128k vocab, mostly byte-level for non-Latin scripts. Worst Korean ratio among the four; included for reference.

3. Why Korean eats tokens

Korean syllabic blocks (가, 나, 다…) are composed of 2-3 Unicode code points worth of jamo, and BPE tokenizers trained on English-heavy corpora often split each Hangul syllable into two to three byte-level tokens. On cl100k you typically see 1.5-2.0 tokens per Hangul character; on o200k this drops to ~1.0; Gemini gets close to 0.7 for clean prose. If you serve Korean users, the tokenizer choice is arguably a bigger cost lever than the model tier itself.

4. Pricing math cheat-sheet

Each vendor lists prices per 1 million tokens. The per-request cost is:

cost_usd = input_tokens / 1_000_000 * input_price
         + output_tokens / 1_000_000 * output_price

Output tokens are 3-5x more expensive than input on every major vendor. Always cap generation with max_tokens, and stream so you can early-abort.

5. Cutting cost without hurting quality

  • Enable prompt caching (Claude / GPT-4o) — 90% off cached prefix.
  • Split pipelines by difficulty: route cheap turns to Haiku / Flash / mini.
  • Strip markdown tables, emoji runs, and repeated whitespace before sending.
  • For RAG, truncate retrieved chunks to the top-K that actually answers.
  • Use JSON mode with tight schemas to shorten outputs.

6. Using this counter — five concrete workflows

People reach the page from very different starting points. The most common one is "I have a prompt that I think is too expensive." Paste the prompt into the textarea at the top of the home page; within roughly 300 milliseconds (the debounce window) you will see every supported model's token count and the projected US-dollar cost for that single request. If your prompt is a 6,000-character Korean system message, you will typically see GPT-4o at around 4,300 tokens and Claude Haiku at roughly 5,900 — those numbers are the cost lever you can actually negotiate.

The second workflow is tokenizer shopping for a Korean product. Use the "Korean efficiency" panel: it tokenises the same fixed Korean paragraph through every model and shows tokens-per-character. Gemini 1.5 Flash and GPT-4o mini almost always win this comparison; cl100k-based models and Llama 3 are systematically worse for Korean and you can see the ratio at a glance.

The third workflow is cost projection for a feature that has not shipped yet. Multiply the per-request cost by your expected monthly volume in a spreadsheet. Most teams forget that output tokens are 4-5x more expensive than input; a 200-token completion appended to a 2,000-token prompt can cost the same as the prompt itself.

The fourth workflow is budget governance. Save the dollar number for a representative prompt and use it as a per-request budget ceiling in your code. If a user request crosses that line, route to a cheaper model or refuse with a clear message. This is a one-day engineering task that pays for itself within a week on any consumer-facing LLM product.

The fifth workflow is multilingual A/B testing. Some teams write their system prompt in English even when serving Korean users because the English version uses fewer tokens. Paste both versions into the counter and you will see the real cost delta; in our experience a well-written English system prompt is 30-40% cheaper at runtime than the equivalent Korean one, with no quality loss when the user messages remain Korean.

7. Practical examples — three real scenarios

Example A — Korean customer-support chatbot. A typical retrieval chunk is 800 Korean characters, system prompt 1,200 characters, and the user message 300 characters. On cl100k this lands near 4,500 input tokens per turn; on o200k it drops to ~2,400; on Gemini Flash to ~1,800. At 20,000 turns per day, the difference between cl100k and Gemini Flash is roughly 54M tokens per day — a four-figure monthly invoice difference for the same product.

Example B — Long-document summariser. A 200-page PDF is around 120,000 Korean characters. On Claude Sonnet (cl100k-equivalent for Korean) that is ~210,000 input tokens, which exceeds the 200k context window. The same document fits comfortably in Gemini 1.5 Pro at ~130,000 tokens. Choosing the right tokenizer family is sometimes the difference between a feature working and not working at all.

Example C — Prompt caching for a SaaS team. A B2B agent product re-uses a 6,000-token system prompt across every user request. Without caching, every request pays full price. With Claude prompt caching at a 90% discount on cached tokens, the marginal cost of an additional user message drops by an order of magnitude. Use the counter to verify the system prompt's exact token count before you enable caching, because Anthropic bills cache writes once and cache reads at the discounted rate.

8. Common pitfalls

  • Counting just text and ignoring the chat envelope. Chat-completion APIs add 3-7 tokens for role/message framing plus another ~3 for the assistant priming reply. This page counts raw text; the real invoice will be a touch higher.
  • Confusing tokens with words or characters. An English word is roughly 1.3 tokens on average; a Korean character is roughly 1.0-1.5 tokens. Mixing these units in spreadsheets is the most common cause of wrong cost estimates.
  • Forgetting output cost. If your product generates long answers, your invoice is mostly output, not input. Capping max_tokens at the shortest acceptable length is the single highest-leverage cost optimisation.
  • Assuming all "GPT-4" models use the same tokenizer. GPT-3.5 Turbo and the original GPT-4 use cl100k; GPT-4o and o-series use o200k. Switching between them changes Korean token counts by 20-30%.
  • Treating Claude's approximate count as exact. Anthropic does not ship a browser tokenizer; this page approximates. For invoices that need to match the cent, call the count_tokens API.

9. Squeezing more out of every Korean prompt — six tactics

Compressing Korean prompts is not just budget hygiene: shorter context also reduces attention dilution, so model answers often get crisper as the prompt gets leaner. The following six techniques have each shown a steady 5-20% saving in our own measurements without any quality regression.

  • Drop redundant particles. Trimming colloquial fillers like "것이다" down to "이다" can shave 3-5% across an entire system message with zero answer impact.
  • Use plain form in system prompts. Honorifics are invisible to the user inside a system message. Replacing "해 주십시오" with "하라" meaningfully cuts tokens; user-visible replies can still be polite via an explicit instruction.
  • Avoid markdown tables. Tables tokenise badly under cl100k-class tokenizers; converting the same data into bullet lists or JSON commonly saves over 30%. Paste both versions into the counter to confirm.
  • Shorten identifier names in code samples. When attaching code, keep only the relevant function bodies; trim helper functions to their signatures.
  • Cap few-shot examples at 3-5. Marginal quality drops sharply past five examples while token cost keeps climbing. The counter makes this trade-off quantifiable in seconds.
  • Consolidate overlapping instructions. System prompts often contain two or three lines that mean the same thing ("be polite", "use formal tone"). Merge them and you typically reclaim 8-12% of the system message.

10. Update cadence and trust

LLM list prices change about once per quarter on average and occasionally on a major model launch. This page is refreshed quarterly and on every notable release; the snapshot date is shown at the top. Tokenizers themselves change far less often, but because billing depends on the price column, always check the snapshot date before quoting a number from this page in a budget document.

The tool is also fully neutral. Every model is rendered through the same pipeline, there is no advertising weighting, and the Korean-efficiency ranking is recomputed from the live tokenizers and the current price table on each render. No one can promote a specific vendor by tilting the numbers.

11. References

Back to the counter.