← back

How Long Are Web Pages?

An analysis of 44,684 web pages reveals a median content length of 3,201 tokens and an average of 10,403 tokens, highlighting implications for AI systems.

Listen

How big is the average web page? A recent analysis of nearly forty-five thousand web pages reveals a surprising gap between what we expect and what is actually out there.

The median web page contains about thirty-two hundred tokens, which is roughly twenty-four hundred words. But the average is much higher, coming in at over ten thousand tokens. This is because the web has a very long tail of massive documents. While half of all pages sit between one thousand and five thousand tokens, the top one percent exceed one hundred and forty thousand tokens.

This distribution has major implications for artificial intelligence. If you are building Retrieval-Augmented Generation systems, ninety-five percent of web pages will fit comfortably within a standard context window. But because of those massive outliers, average processing costs can be three times higher than the median.

Most people underestimate how much content is on a typical page. To build efficient systems, you need to design for the typical three-thousand-token article, while making sure your system can handle the occasional giant document.

A Token Count Analysis of 45,000 Real-World URLs

We recently analyzed 44,684 web pages and measured their content length using Gemini’s token counter. The results reveal fascinating insights about the true scale of web content—and why it matters for AI applications.

MetricValueTotal Pages Analyzed44,684Page Content Tokens464,854,727Total Tokens (all)541,062,817

The median web page contains roughly 3,200 tokens—equivalent to about 2,400 words or approximately 5 pages of text. However, the average is significantly higher at 10,400 tokens, indicating a strong right-skew from lengthy documents.

MetricTokensMedian3,201Average10,40325th percentile1,39675th percentile8,207

Distribution Breakdown

Half of all web pages fall between 1,000 and 5,000 tokens. This represents the “typical” article, blog post, or informational page.

Token RangePagesPercentage0 – 1,0006,22913.9%1,000 – 5,00022,29949.9%5,000 – 10,0006,62914.8%10,000 – 50,0008,04818.0%50,000 – 100,0008061.8%100,000 – 500,0006571.5%500,000+160.04%

Nearly 1 in 5 pages (18%) contain between 10,000 and 50,000 tokens—these are longer articles, comprehensive guides, or pages with significant supplementary content.

The Long Tail

Percentile analysis reveals the extreme outliers:

PercentileTokens90th21,83995th35,85299th141,410Maximum3,004,502

The top 1% of pages exceed 140,000 tokens—roughly 100+ pages of text. These are typically:

  1. Full PDF documents (research papers, reports)
  2. Documentation sites
  3. Long-form educational content
  4. Scraped book chapters

The largest page in our dataset contained over 3 million tokens—equivalent to approximately 4-5 full-length novels.

Implications for AI Systems

Context Window Considerations

With major LLMs offering context windows from 32K to 2M tokens, our findings suggest:

  1. 95% of web pages fit comfortably in a 128K context window
  2. The median page (3,201 tokens) leaves ample room for multi-page retrieval
  3. Only 0.04% of pages exceed typical context limits

RAG System Design

For Retrieval-Augmented Generation systems:

  1. Chunk wisely: The median page is ~3K tokens—consider this when designing chunk sizes
  2. Handle outliers: The 99th percentile is 44x the median. Long-form content needs different treatment
  3. Budget for variety: A 10-document retrieval could range from 14K tokens (medians) to 350K+ tokens (90th percentiles)

Methodology Notes

  1. Pages were processed using Gemini’s url_context tool
  2. Token counts reflect the model’s native tokenization
  3. Sample includes a diverse mix of content types: articles, academic papers, product pages, documentation, and PDFs
  4. Zero-token pages (5 total) represent failed fetches or blocked content

While the typical page sits around 3,000 tokens, the distribution has a remarkably long tail. AI systems consuming web content need to account for this variance—both for context management and cost optimization.

For practical applications:

  1. Design for the median (3K tokens) but handle the 99th percentile (140K tokens)
  2. Expect high variance between sources
  3. Budget conservatively—average costs will be 3x median costs due to outliers

What Did People Guess?

Before publishing this analysis, I ran a poll on LinkedIn asking people to predict the average page size in tokens:

GuessVotesPercentage1002721%1,0005038%10,0004534%100,00097%

131 people voted. The most popular answer was 1,000 tokens (38%), followed closely by 10,000 tokens (34%). The actual answer? 10,403 tokens on average.

Only a third of respondents got it right. The majority underestimated—perhaps expecting a page of text to be shorter than it actually is when tokenized. What’s interesting is that the median (3,201 tokens) would have made “1,000” a more defensible answer, but averages get skewed heavily by those outlier documents.

The 7% who guessed 100,000 weren’t entirely wrong either—they just described the 99th percentile rather than the average.

Dan Petrovic · Dec 14, 14:11

The more I learn from your research studies, the less I think OpenAI can win (I’m saying this with $ on the table)

Thank you for doing both median, mode, and explaining those edge cases.

It’s funny how the outlier data you see in school – the ones you’d throw out – are in practice the hard problems that tell you what organizations are shooting to be that 99th percentile…

Because when you operate in the trillions, your errors are in the millions.

Scale always messes with my mind.

It’ll be very interesting to see what comes next on the web now that machine translations become more acceptable. Will we see model collapse or will there finally be better multilingual results and answers?

Thank you for sharing the kind of 99th percentile work as well.

*No AI was used and I should get back to sleep.

Victor Pan · SupportsSuggests · · Dec 14, 09:58 ·

Yeah those million token URLs really broke my pipeline and I was wondering if there was bug in my code, spent days trying to figure it out and then I LOOKED AT THE DATA and was like… oh…..

Dan Petrovic · Supports · · Dec 25, 05:10

I wonder if cost is the same always or maybe the initial chunking cost more but – as long as they don’t want to rerank the answer – it doesn’t cost at all.

Łukasz Rogala · QuestionsSuggests · · Dec 19, 19:50

I’d say their chunking pipeline is the most efficient one on the planet and would love to get my hands on it 🙂

Dan Petrovic · Supports · · Dec 25, 05:11