Explore how Google’s Gemini processes text using subword tokenization. Use this tool to inspect SentencePiece log-likelihood scores for common and rare tokens.
Search engine optimization professionals diving into machine learning often need to understand how models like Google’s Gemini process text. It all starts with subword tokenization, which breaks words down into smaller pieces called tokens. This approach strikes a balance. Common words stay whole, while rare words are split into smaller, recognizable pieces so the model can still understand them.
Gemini uses a system called SentencePiece, which features a vocabulary of two hundred and fifty-six thousand distinct tokens. During training, every single token is assigned a score. This score represents how essential that piece was for reconstructing the training data. It is a global, context-independent measure of how common a token is.
For example, extremely common English subwords, like the word the or the suffix i n g, receive very high scores. On the other hand, obscure symbols, rare emojis, and special control tokens sit at the very bottom of the scale.
When you analyze these scores, you are seeing the underlying building blocks of the language model. Understanding which tokens are common and which are rare helps you see exactly how Gemini interprets and processes the text you throw at it.
As a technical SEO, you might be diving into machine learning (ML) to understand how tools like Google’s Gemini process text. One foundational concept is subword tokenization—breaking words into smaller pieces called “tokens.” While tokens themselves are context-agnostic (they don’t consider surrounding words), they do carry an inherent bias: each token’s likelihood reflects how prominent that subword was in the training data. In other words, tokens that appeared frequently during training end up with higher scores, and this directly influences downstream ML models.
By using the following tool, you can inspect which subwords are common or rare, helping you anticipate how Google’s Gemini might treat certain tokens in content, prompts and search queries.

https://dejan.ai/tools/gemini-tokenizer
This tool is not a simulation. It uses Gemini’s actual trained SentencePiece model.
Before diving into scores, it helps to recall why we use subword tokenization at all:
SentencePiece’s unigram approach proceeds roughly as follows:
These learned log-likelihoods are the “raw scores” we’ll explore. In many applications (like our Streamlit demo), we normalize them across the entire vocabulary so that end users can see a “percentage-style” bar indicating each token’s relative importance during training.
It is tempting to read “log-likelihood” as simply “how often did this exact subword occur in the training data?” In reality, SentencePiece’s unigram training infers each piece’s probability by optimizing corpus reconstruction. Concretely:
[math]
\text{maximize } \prod_{w \in \text{corpus}} \sum_{\text{tokenizations } t \rightarrow w} \prod_{u \in t} P(u).
[/math]
During this optimization, each subword piece [math]u[/math] gets assigned a probability [math]P(u)[/math]. Taking the log yields the “log-likelihood” or “score” used internally.
When presenting these scores to readers or end users, it’s helpful to describe them as a “likelihood of the token appearing in the training data”, with these caveats:
[math]
\text{Normalized}(u) = \frac{\log P(u) – \min \log P}{\max \log P – \min \log P}.
[/math]
Render “Normalized” as a percentage (0 % = least likely piece; 100 % = most likely piece).
Avoiding Misinterpretation
Because some readers might confuse this with “the probability a model would generate this token next,” emphasize:
“These are unnormalized log-probabilities from tokenizer training (unigram), not the conditional probabilities you’d get from a full language model.”
Framing as “Importance”
You can say, for instance:
> “A higher-scoring token was more central to reconstructing the training data and thus was retained in the final vocabulary.”
In other words, “importance during tokenizer training” and “likelihood of appearing” are two sides of the same coin under the unigram model.
Token Likelihood (Unigram Score).
Each subword piece in our SentencePiece-based Gemini tokenizer carries a unigram log-likelihood—a number learned during tokenizer training to maximize the model’s ability to reconstruct the corpus. Intuitively, tokens that appeared more frequently (or that helped reconstruct many different words) receive higher log-probabilities. In our visualization, we then linearly map these raw log-scores into a [math][0,1][/math] range and display them as percentages (0 % = lowest “importance,” 100 % = highest). Note that this is a global, context-agnostic measure: it does not depend on what comes before or after. Rather, it reflects how “likely” that piece was under the SentencePiece unigram model of the training data.
#### Token Likelihoods in Action
When you type a sentence like “The quick brown fox jumps over the lazy dog”, our interface will break it into subword pieces such as:
[“ĠThe”, “Ġquick”, “Ġbrown”, “Ġfox”, “Ġjumps”, “Ġover”, “Ġthe”, “Ġlazy”, “Ġdog”]
For each subword, we look up its learned unigram log-likelihood (e.g., [math]“Ġthe”[/math] might have [math]\log P = -2.1[/math], [math]“Ġquick”[/math] [math]\log P = -5.3[/math], [math]“Ġfox”[/math] [math]\log P = -6.2[/math]). After computing the global min and max over all ~50 K tokens, we map these values into [math][0,1][/math]. Suppose:
[math]
\text{Normalized} = \frac{-2.1 – (-9.8)}{-1.5 – (-9.8)} = \frac{7.7}{8.3} \approx 0.928 \,(\approx 92.8\%).
[/math]
For [math]“Ġfox”[/math]:
[math]
\text{Normalized} = \frac{-6.2 – (-9.8)}{-1.5 – (-9.8)} = \frac{3.6}{8.3} \approx 0.434 \,(\approx 43.4\%).
[/math]
Visually, [math]“Ġthe”[/math] will show a long, nearly full bar (indicating it was extremely common), while [math]“Ġfox”[/math] will be roughly halfway (moderately common).
Framing these SentencePiece scores as a “likelihood of the token appearing in the training data” is accurate when you emphasize:
By clarifying these points in your article, readers will gain a clear understanding of why some subword pieces are deemed more “important,” how the normalization step works, and what these bars truly signify. This transparent framing helps set proper expectations and prevents misinterpretation: the bars represent global importance during tokenizer training, not “the probability that your model will output this next.”
Below is an in-depth look at the actual gemini-1.5-pro-002.spm.model file (a SentencePiece “unigram” tokenizer).
We’ll cover:
.spm.model FileWhen you load gemini-1.5-pro-002.spm.model with SentencePieceProcessor (using sp.Load("…/gemini-1.5-pro-002.spm.model")), you discover:
sp.GetPieceSize() ➔ 256000
In other words, this tokenizer defines 256000 distinct “subword” pieces.
<pad> (ID 0)<unused0>, <unused1>, …, <unused99><0x5E>, <0x6A>, etc. zero_count = sum(1 for i in range(sp.GetPieceSize()) if sp.GetScore(i) == 0.0)
# zero_count ➔ 506
Any piece with a score of 0.0 is reserved (not “learned” from the corpus) and typically used for padding, special markers, or placeholders.
Each subword piece u in a SentencePiece unigram model carries a log-likelihood \log P(u). In this particular .spm.model, the raw score range is:
In Python one can confirm:
import numpy as np
scores = np.array([sp.GetScore(i) for i in range(sp.GetPieceSize())], dtype=float)
min_score, max_score = float(scores.min()), float(scores.max())
# min_score ➔ –255494.0
# max_score ➔ 0.0
mean_score = float(scores.mean()) # ≈ –127494.9991
median_score = float(np.median(scores)) # ≈ –127494.5
When you display these as “percentages” in a UI, you usually normalize:
[math]Normalized(u) = ( log P(u) – (–255494) ) / ( 0 – (–255494) )
= ( log P(u) + 255494 ) / 255494[/math]
After normalization, the most frequent/important token(s) map to 100 %, while the rarest mapped pieces approach 0 %.
If you sort all 256000 pieces by their raw score descending (i.e. most common first), you’ll find that the very highest log-score (0.0) belongs to special control tokens, for example:
[('<pad>', 0.0),
('<unused99>', 0.0),
('<0x5E>', 0.0),
… (total of ~506 pieces with 0.0) …]
However, ignoring control tokens, the most frequent real subwords (highest negative log-score closest to 0.0) might look like:
(“the”, –702.0)
(“ing”, –758.0)
(“and”, –810.5)
(“ of”, –825.2)
(“ to”, –841.9)
…
For example:
# Find index/score for “the” (no leading “Ġ”, since this model uses raw pieces):
idx = pieces.index("the") # ➔ 1175
score_the = sp.GetScore(idx) # ➔ –702.0
[math]Normalized → \frac{-702.0 – (-255494)}{0 – (-255494)} \approx \frac{254792}{255494} \approx 0.997\ (\approx 99.7\%).[/math]
At the other extreme, the rarest or least “useful” subwords—often obscure Unicode glyphs or extremely rare sequences—have scores around –255494.0. For instance:
('𝕳', –255494.0)
('𝕏', –255493.0)
('𖧵', –255492.0)
('𓂸', –255491.0)
('𐍆', –255490.0)
('↑', –255489.0)
('﹅', –255488.0)
('כּ', –255487.0)
('שׂ', –255486.0)
('', –255485.0)
These are typically either:
.spm.model FileA SentencePiece .spm.model is a Protocol Buffer that contains two main sections:
vocab Liststring piece (the text of the subword),float score (the learned log-likelihood for that piece).<unk>, <s>, </s>, etc.).When you call:
sp = spm.SentencePieceProcessor()
sp.Load("gemini-1.5-pro-002.spm.model")
internally SentencePiece deserializes the Protocol Buffer into:
ModelProto object (containing every piece + its log-score),Under the hood, each piece’s log-probability was learned by the Unigram LM trainer:
The resulting binary file is about 4.24 MB on disk. When sphere-packed into memory, it occupies slightly more, but SentencePieceProcessor is extremely efficient about lookups and decoding.
log_score = 0.0, including <pad>, <unused#>, <0x##> code‐point markers, etc.log_score ≈ –702.0, which normalizes to ~99.7 %).In other words, this section peels back the curtain on Gemini’s SentencePiece vocabulary: each token has a learned log-likelihood (reflecting global frequency/importance) and a unique textual form (including standard English subwords, punctuation, Unicode code‐points, and special placeholders). Understanding these internal stats helps you see exactly which building blocks Gemini will use when it tokenizes any text you throw at it.
Sign in with Google to comment.