← back

Michael Rübcke

Your article’s conclusion—that LLMs interact with the web through highly curated text chunks rather than full HTML pages—is a foundational technical fact for modern AI search.

Questions:
Was this analysis performed exclusively using the native Web Search tool within the OpenAI Assistants API, or does it reflect broader observations across different proprietary LLM grounding systems? Are the open() and snippet size limitations consistent across models like GPT-4, GPT-4o, and other vendors?”

“When the model calls open(), how is the window/chunk determined? Is it based purely on a fixed line count, or does the system use any semantic chunking (e.g., stopping the chunk at the nearest H-tag or the end of a complete paragraph) to ensure the returned window is meaningful?”

Can you elaborate on the quality of the plaintext extraction? Are complex elements like semantic tables (not just visual layout tables) or structured data (Schema markup) reliably translated into a usable, consumable plaintext format for the model?”

“As LLMs become truly multimodal, how do you foresee them incorporating image/audio/video context? Will the tool evolve to also return an image/audio/video URL and a descriptive ALT text snippet alongside the main text chunk, or will the visual aspect remain entirely separate?”

“If content producers deliberately try to ‘spam’ or over-optimize the plaintext extraction to force their content into snippets, what defensive measures do you anticipate the LLM providers will implement to ensure quality and relevance?”

on: A Technical Walkthrough of Web Search, Snippets, Expansions, Context Sizes,...
SupportsQuestions · · Nov 22, 07:08

My opinion is that your central thesis is correct – perhaps to add a few approaches that I consider relevant:

➡️ From Rankings to Retrieval: The future of SEO will be less about the traditional SERP and more about “getting cited” by AI. This means the focus will shift from keyword density and backlink profiles (though these will still be important) to semantic authority, structured data, and content comprehensiveness. Content that is clearly written, factually accurate, and well-organized will be prioritized by the retrieval layer of AI models.

➡️ The Importance of Structured Content: The “Community” quotes in your article are particularly telling. The emphasis on “hyper-curated” inputs and the use of tools suggests a future where content creators will need to think like data architects. Implementing schema markup, creating clear FAQ sections, and using structured headings will be more important than ever to help AIs understand and extract information.

➡️ A “Quality First” Mindset: Your article’s premise reinforces a long-standing SEO principle: quality content wins. If an LLM’s retrieval system is designed to find the most accurate, comprehensive, and authoritative source to answer a user’s question, then the content that embodies those qualities will be the most successful. This moves SEO away from manipulative tactics and toward a focus on genuine expertise and value creation.

➡️ New Metrics and Measurement: As pointed out, traditional metrics like API data scraping may become less useful for benchmarking visibility. SEOs will need new tools and frameworks to measure their influence. This could involve tracking how often their content is cited by LLMs, analyzing the “knowledge graphs” of models, and understanding how a brand’s narrative is being represented in AI-generated answers.

➡️ + Local AI Solution: By running an AI model on a personal computer, a user’s data—be it a personal journal, confidential business documents, or a code repository—never leaves their device. This is crucial for sectors with strict data protection regulations like HIPAA in healthcare or GDPR in Europe.

🔺 Implications for SEO: The privacy-first nature of local AIs means that SEO professionals will need to consider how their content is accessed and used. A local AI might be trained on a user’s personal documents and their web history, meaning the “grounding” for a query could be a mix of public and private data. For businesses, this means that providing a local, trustworthy, and well-documented AI solution (e.g., a fine-tuned model for internal use) becomes a competitive advantage.

🔺 Custom Knowledge Bases: A company can train a local AI on its internal knowledge base, including detailed product documentation, sales data, customer support tickets and feedback, and proprietary research. This creates an “expert” AI that has access to information no public model can replicate. For example, a financial firm could have a local AI that understands its specific investment strategies and provides insights that a general-purpose model would miss.

🔺 Demonstrating E-E-A-T: For SEO, this suggests a new way to demonstrate expertise. Instead of just creating publicly visible blog posts, a company can create a “local knowledge pack” or a downloadable, fine-tuned model that users can run on their own devices. This would be a tangible and highly effective way to showcase deep expertise and trustworthiness, as the user is in direct control of the data. The “Experience” component of E-E-A-T is particularly relevant here, as an AI grounded in a specific, firsthand dataset (e.g., a doctor’s medical research) can provide insights that a generalist AI cannot.

➡️ SEO’s Role in the Hybrid Model: The SEO professional’s job will be to ensure their client’s public content is optimized for the retrieval layer of the large, generalist LLMs, while also advising on how to build and maintain the proprietary knowledge bases that power the local, specialist AIs. This means a dual-pronged strategy:

🔺 Public Optimization: Ensuring that public-facing content is a primary, authoritative source for AI “grounding” on the open web.

🔺 Private Optimization: Creating well-structured, clean, and machine-readable data for local and enterprise-level AI systems, turning internal company data into a strategic asset.

on: OpenAI’s latest model is trained to be intelligent, not knowledgeable.Wait,...
SupportsSuggests · · Aug 11, 07:07