How Long Are Web Pages?

Research

An analysis of 44,684 web pages reveals a median content length of 3,201 tokens and an average of 10,403 tokens, highlighting implications for AI systems.

Listen

How big is the average web page? A recent analysis of nearly forty-five thousand web pages reveals a surprising gap between what we expect and what is actually out there.

The median web page contains about thirty-two hundred tokens, which is roughly twenty-four hundred words. But the average is much higher, coming in at over ten thousand tokens. This is because the web has a very long tail of massive documents. While half of all pages sit between one thousand and five thousand tokens, the top one percent exceed one hundred and forty thousand tokens.

This distribution has major implications for artificial intelligence. If you are building Retrieval-Augmented Generation systems, ninety-five percent of web pages will fit comfortably within a standard context window. But because of those massive outliers, average processing costs can be three times higher than the median.

Most people underestimate how much content is on a typical page. To build efficient systems, you need to design for the typical three-thousand-token article, while making sure your system can handle the occasional giant document.

A Token Count Analysis of 45,000 Real-World URLs

We recently analyzed 44,684 web pages and measured their content length using Gemini’s token counter. The results reveal fascinating insights about the true scale of web content—and why it matters for AI applications.

MetricValueTotal Pages Analyzed44,684Page Content Tokens464,854,727Total Tokens (all)541,062,817

The median web page contains roughly 3,200 tokens—equivalent to about 2,400 words or approximately 5 pages of text. However, the average is significantly higher at 10,400 tokens, indicating a strong right-skew from lengthy documents.

MetricTokensMedian3,201Average10,40325th percentile1,39675th percentile8,207

Distribution Breakdown

Half of all web pages fall between 1,000 and 5,000 tokens. This represents the “typical” article, blog post, or informational page.

Token RangePagesPercentage0 – 1,0006,22913.9%1,000 – 5,00022,29949.9%5,000 – 10,0006,62914.8%10,000 – 50,0008,04818.0%50,000 – 100,0008061.8%100,000 – 500,0006571.5%500,000+160.04%

Nearly 1 in 5 pages (18%) contain between 10,000 and 50,000 tokens—these are longer articles, comprehensive guides, or pages with significant supplementary content.

The Long Tail

Percentile analysis reveals the extreme outliers:

PercentileTokens90th21,83995th35,85299th141,410Maximum3,004,502

The top 1% of pages exceed 140,000 tokens—roughly 100+ pages of text. These are typically:

Full PDF documents (research papers, reports)
Documentation sites
Long-form educational content
Scraped book chapters

The largest page in our dataset contained over 3 million tokens—equivalent to approximately 4-5 full-length novels.

Implications for AI Systems

Context Window Considerations

With major LLMs offering context windows from 32K to 2M tokens, our findings suggest:

95% of web pages fit comfortably in a 128K context window
The median page (3,201 tokens) leaves ample room for multi-page retrieval
Only 0.04% of pages exceed typical context limits

RAG System Design

For Retrieval-Augmented Generation systems:

Chunk wisely: The median page is ~3K tokens—consider this when designing chunk sizes
Handle outliers: The 99th percentile is 44x the median. Long-form content needs different treatment
Budget for variety: A 10-document retrieval could range from 14K tokens (medians) to 350K+ tokens (90th percentiles)

Methodology Notes

Pages were processed using Gemini’s url_context tool
Token counts reflect the model’s native tokenization
Sample includes a diverse mix of content types: articles, academic papers, product pages, documentation, and PDFs
Zero-token pages (5 total) represent failed fetches or blocked content

While the typical page sits around 3,000 tokens, the distribution has a remarkably long tail. AI systems consuming web content need to account for this variance—both for context management and cost optimization.

For practical applications:

Design for the median (3K tokens) but handle the 99th percentile (140K tokens)
Expect high variance between sources
Budget conservatively—average costs will be 3x median costs due to outliers

What Did People Guess?

Before publishing this analysis, I ran a poll on LinkedIn asking people to predict the average page size in tokens:

GuessVotesPercentage1002721%1,0005038%10,0004534%100,00097%

131 people voted. The most popular answer was 1,000 tokens (38%), followed closely by 10,000 tokens (34%). The actual answer? 10,403 tokens on average.

Only a third of respondents got it right. The majority underestimated—perhaps expecting a page of text to be shorter than it actually is when tokenized. What’s interesting is that the median (3,201 tokens) would have made “1,000” a more defensible answer, but averages get skewed heavily by those outlier documents.

The 7% who guessed 100,000 weren’t entirely wrong either—they just described the 99th percentile rather than the average.

Dan Petrovic · Dec 14, 14:11