Technical Foundations for AI Search: Site IA, Entity Mapping, Crawlability, Rendering, Performance

Q: What are semantic triples and why do they matter for content?

A semantic triple is a subject-predicate-object structure: 'BeCited [subject] runs audits across [predicate] four AI engines [object].' Vector retrieval and entity-extraction systems read content more reliably when claims are written as triples than as compound or implied statements. The practical writing rule: name the subject, name the action, name the object. Avoid pronouns that obscure the subject. Avoid burying the action in passive voice. Specific triples produce extractable facts; vague prose produces no extractable facts.

TL;DR

RAG (retrieval-augmented generation) is the architecture under most AI engines: retrieve passages, embed them as vectors, rank by similarity, synthesize an answer. The retrieval layer decides who is even eligible to be cited.
Hybrid retrieval (lexical + semantic, fused via reciprocal rank fusion) is the production default. About 95% of SEO software still only does lexical analysis.
Five technical pillars determine retrievability: site information architecture, entity mapping with structured data, crawlability, rendering strategy, and performance.
JSON-LD with @graph and @id is the structured-data pattern that produces a navigable entity graph for the engine, instead of disconnected schema fragments.
Pure client-side rendering is the highest-risk pattern. The text the model needs has to be in the initial HTML response, not assembled by JavaScript on the client.

The pieces in this series so far have been about how AI engines work: the retrieval pipeline, fan-out, dense ranking, the patents underneath. This piece is about what your site has to look like for the pipeline to be able to use it.

iPullRank's "Technical Foundations and Setup for AI Search" lays out a framework around RAG (retrieval-augmented generation) and five technical pillars. Most of the content below is a tighter, opinionated version of that framework, with notes on what to audit and how BeCited measures the same signals.

RAG, in one paragraph

An AI engine receives a query. It retrieves a candidate set of documents from an index using a hybrid of lexical search (keyword overlap, typically BM25) and semantic search (vector cosine similarity over embeddings). It fuses the two ranked lists, often with reciprocal rank fusion. It feeds the top-ranked passages, together with the query, into a language model that composes the answer. Citations are selected from the passages that informed the answer.

The model can only cite what made it into the candidate set. Everything else (no matter how authoritative, no matter how well-linked) is invisible to that query.

"95% of SEO tools still only perform lexical analysis."
— iPullRank, Technical Foundations and Setup for AI Search

~95%

Of SEO tools only perform lexical (keyword) analysis — ignoring the dense-vector half of every modern RAG pipeline.
iPullRank

The vendor gap matters because most teams still measure their content against tools that only see half the system. A passage can be invisible in a keyword-overlap audit and dominant in vector retrieval, or the reverse. Auditing only one side gives a false signal.

The five pillars

Site information architecture

Treat the site as a navigable knowledge graph. Three things matter most:

Hierarchical organization with topic clustering. Pages on related topics should live near each other in the URL structure and link to each other directly.
Internal linking with descriptive anchor text. Anchor text is a free signal about what the linked page covers. "Click here" wastes it; "managed Postgres pricing" earns it.
Horizontal links between sibling pages. Sibling links expose topic relationships in a way hierarchical links cannot. They also produce more crawlable paths through the cluster.

The framing iPullRank uses: structure content readable for readers and for machine-readable crawlers. Both audiences should be able to follow the relationships.

Entity mapping and structured data

Language models incorporate structured data into RAG pipelines with high confidence because it tokenizes cleanly. JSON-LD is the dominant format. The patterns that matter:

Use JSON-LD, not Microdata or RDFa, for new schema. JSON parses faster and embeds cleaner.
Use @graph to declare multiple connected entities (Organization, Person, Product, Article) in one block.
Give every entity a stable @id URI so other JSON-LD blocks across the site can reference it.
Maintain consistent markup across templates. Drift between templates produces conflicting entity claims, which the engine resolves by trusting whichever it saw last (or worst, whichever has the most external citations).
Validate with Schema.org's validator AND Google's Rich Results Test. They catch different errors.

The five entity types that matter most across the typical site: organization (the brand itself), person (authors, executives, advisors), product or service (what you sell), event or location (where you operate), and article (the page itself).

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Organization",
      "@id": "https://example.com/#org",
      "name": "Example Inc.",
      "url": "https://example.com",
      "sameAs": [
        "https://en.wikipedia.org/wiki/Example_Inc",
        "https://www.linkedin.com/company/example"
      ]
    },
    {
      "@type": "Article",
      "@id": "https://example.com/post#article",
      "headline": "...",
      "author": { "@id": "https://example.com/team/jane#person" },
      "publisher": { "@id": "https://example.com/#org" }
    }
  ]
}

The pattern to notice: the article references the publisher and the author by @id rather than restating their attributes. The graph is one document; the relationships are explicit.

Crawlability

Crawlability for AI search is harder than for classical search because the bot population is larger and the failure modes are different. Items to audit:

robots.txt configuration at both the origin and the CDN. CDN-level rules can block bots the origin's robots.txt allows, and vice versa.
AI bot policy specifically. GPTBot, Google-Extended, and ClaudeBot are training bots; OAI-SearchBot, ChatGPT-User, PerplexityBot, and Claude-SearchBot are retrieval bots. Common anti-pattern: blocking retrieval bots while allowing training bots, which produces invisibility in answers without preventing data scraping.
Canonicalization and hreflang. Duplicate content fragments split signals across URLs.
Crawl depth and budget. Pages buried four clicks deep get crawled less often. Pages two clicks from the root get refreshed.
XML sitemap with <lastmod> values that actually update when content changes. A sitemap with stale lastmod values is worse than no sitemap.

Rendering

Rendering strategy is the highest-leverage technical decision for AI search. The model needs the answer in the initial HTML response.

Rendering strategies

Strategy	Best for	AI search risk
Static (SSG)	Evergreen content, blog posts, docs	Lowest. HTML is complete on first response.
Server-side rendering (SSR)	Dynamic pages with fresh data	Low. HTML is complete; fresh data still arrives in the initial response.
Hybrid (SSR + hydration)	Apps with critical above-the-fold content	Low to medium. As long as the citable content is server-rendered.
Pure client-side (CSR)	Internal apps, dashboards	Highest. Many crawlers do not execute JS. Googlebot defers JS-rendered content to a second-pass index.
Dynamic rendering for bots	Legacy SPA where SSR migration is expensive	Medium. Works, but adds infrastructure and divergent-content risk.

The rendering rule. Whatever you want the model to cite has to be in the initial HTML response. If the answer arrives via fetch after page load, the answer is invisible to half the crawlers in the population.

Performance

Slow elements produce three compounding problems for AI search:

User experience degrades, which feeds back into Google's NavBoost-style behavioral signals.
Crawlers, including AI crawlers, may abandon pages that take too long to respond. Retrieval bots typically have shorter timeouts than indexing bots.
AI ranking pipelines deprioritize sites with poor Core Web Vitals as a quality signal.

The thresholds (per Google's web.dev guidance): LCP under 2.5s, INP under 200ms, CLS under 0.1. Above 4s LCP, 500ms INP, or 0.25 CLS, the page is failing the standard most engines reference.

2.5s / 200ms / 0.1

Core Web Vitals pass thresholds (LCP, INP, CLS). Above 4s / 500ms / 0.25, the page fails the standard most AI ranking pipelines reference as a quality signal.
Google web.dev

Writing for synthesis: structural and content rules

iPullRank distinguishes structural rules (how pages and entities are organized) from content rules (how sentences and paragraphs are written). Both matter, but the content rules are the ones most teams undertrain on.

Structural rules

Hierarchical organization at the site, page, and section level.
Defined entities with consistent identifiers across the site.
Machine-followable internal navigation flow.
Structured data on every template.
Internal linking that reflects topical relationships, not just navigational convenience.

Content rules

Semantic triples. Subject, predicate, object. "BeCited [subject] runs audits across [predicate] four AI engines [object]." Triples are extractable; vague prose is not.
Proprietary data. Original numbers, original studies, internal benchmarks. The engine has nowhere else to get them, which makes them high-value citations.
Short, punchy, specific sentences. Long sentences with multiple clauses dilute the extractable claim. One claim per sentence is the easy heuristic.

The triples rule. Read your most important paragraphs and underline every subject-predicate-object claim you can extract. If the paragraph yields zero clean triples, rewrite it. Models retrieve and cite the triples; they ignore the connective tissue.

The path forward

"Your answers are going to come from everywhere — traditional search, chatbots, voice assistants, Reddit, YouTube."
— iPullRank, Technical Foundations and Setup for AI Search

The implication is that technical foundations are not a Google problem; they are a multi-engine problem. The signals that make a site retrievable to ChatGPT search, Perplexity, Claude, and Gemini overlap, but each engine has its own crawler population and its own indexing cadence. Getting the foundations right benefits all four; getting them wrong fails all four together.

How BeCited audits these foundations

Our site readiness check covers 15 signals across six tiers, mapped against the five-pillar framework above:

Crawlability and discovery: robots.txt with AI bot training-vs-retrieval classification, llms.txt and llms-full.txt, sitemap.xml with lastmod validation.
Structured metadata: JSON-LD schema.org coverage, OpenGraph and meta tags, heading structure.
Content extraction signals: quotable claims, semantic HTML versus div soup, FAQ content format (FAQPage schema or <details> pattern), E-E-A-T signals (author markup, credential language).
Content quality: freshness audit (Last-Modified, dateModified, sitemap lastmod alignment), quotability score (paragraph length distribution, answer-first patterns, stat density).
Entity and agent signals: Wikipedia and Wikidata presence, Organization schema with sameAs, brand consistency across templates. Agentic readiness for SaaS clients (AGENTS.md, OpenAPI spec, MCP manifest).
Core Web Vitals: LCP, INP, and CLS via Google PageSpeed Insights API, with pass/warn/fail bands per the official thresholds.

Each check returns pass, warn, or fail and contributes to a weighted letter grade. Profile-level overrides change weights per business type (local services drop agentic readiness to zero; SaaS bumps it to ten).

Frequently asked questions

What is RAG and why does it matter for SEO?

Retrieval-augmented generation (RAG) is the architecture behind most AI answer engines. The model retrieves relevant documents from an index, embeds them and the query as vectors, ranks by similarity, and feeds the top passages to a language model that composes the answer. The implication for SEO is that the retrieval layer matters as much as the ranking layer. Sites that are not crawlable, not chunked into clean passages, or not structurally clear cannot enter the retrieval set, which means they cannot be cited regardless of authority signals.

What is hybrid retrieval and reciprocal rank fusion (RRF)?

Hybrid retrieval combines lexical retrieval (keyword matching, typically BM25) with semantic retrieval (vector similarity over embeddings) into one ranked list. Reciprocal rank fusion merges the two ranked lists by summing 1/(k+rank) for each document across the lists. The fused ranking captures exact-match queries (where lexical wins) and semantic queries (where vector wins) without requiring the system to choose between them.

Why is JSON-LD with @graph and @id important for AI search?

JSON-LD is the structured-data format AI engines parse most cleanly because it is JSON, which embeds and tokenizes well, and it expresses graph relationships explicitly. The @graph keyword lets you declare multiple connected entities (organization, person, product, article) in one block. The @id keyword gives each entity a stable URI that other JSON-LD blocks across the site can reference. The combination produces a navigable entity graph the engine can traverse.

What rendering strategy is best for AI crawlers?

Static site generation is best for evergreen content because the HTML contains the full text at fetch time. Server-side rendering works for dynamic pages that need fresh data; the HTML is still complete on first response. Pure client-side rendering is the highest-risk pattern for AI search: many crawlers do not execute JavaScript, and the ones that do (Googlebot) treat JS-rendered content as a deferred second-pass index. The safe rule: the answer the model needs should be in the initial HTML response.

What are semantic triples and why do they matter for content?

A semantic triple is a subject-predicate-object structure: "BeCited [subject] runs audits across [predicate] four AI engines [object]." Vector retrieval and entity-extraction systems read content more reliably when claims are written as triples than as compound or implied statements. The practical writing rule: name the subject, name the action, name the object. Avoid pronouns that obscure the subject. Specific triples produce extractable facts; vague prose produces no extractable facts.

What technical signals does BeCited audit?

BeCited's site readiness check covers 15 technical signals across six tiers: crawlability and discovery (robots.txt with AI-bot classification, llms.txt, sitemap), structured metadata (JSON-LD schema.org, OpenGraph meta tags, heading structure), content extraction signals (quotable claims, semantic HTML, FAQ format, E-E-A-T signals), content quality (freshness audit, quotability score), entity and agent signals (Wikipedia/Wikidata presence, Organization schema, sameAs, agentic readiness for SaaS), and Core Web Vitals (LCP, INP, CLS via the PageSpeed Insights API).

15 technical signals, graded

Audit your site against the five technical pillars in 60 seconds.

BeCited's free site scan grades every signal in this article (robots, schema, freshness, quotability, Core Web Vitals) with pass/warn/fail bands and a weighted letter grade. No sales call required.

Run Free Site Scan See $2k Full Audit

Next in this series

Article 09 — GEO measurement frameworks: presence, recommendation, position weight, and the metrics that actually matter