The 15 Site Readiness Signals Every Brand Needs

TL;DR

BeCited grades every audited site against 15 AI-readiness checks across six tiers, summing to a 100-point letter grade.
The single biggest technical mistake we see is allowing training bots (GPTBot, Google-Extended) while blocking retrieval bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot). 71% of sites have this anti-pattern.
Schema.org JSON-LD (12 points) and entity readiness (8 points) are the highest-weighted single levers. Both are usually fixable in a week.
Core Web Vitals (LCP, INP, CLS) are part of the audit at 6 points; SaaS profiles bump it to 8 because web app perf is a primary user signal.
Profile-specific overrides matter: local services zero out agentic readiness; SaaS bumps it from 5 to 10. Apply weights to your context.

Site readiness is the technical layer of GEO. It is what determines whether AI crawlers can reach your content, parse it cleanly, quote it accurately, and understand who you are as an entity.

BeCited runs 15 checks before any prompts are captured. The result is a letter grade with check-by-check pass/warn/fail signals, weighted contributions, and prioritized recommendations. The 15 checks are grouped into six tiers, mapped to the three pillars of GEO (retrievability, citability, recognizability).

The full list with weights

All weights default to a 100-point scale, but each business profile can override individual weights. Setting a weight to 0 excludes the check from the grade denominator entirely — the check still runs and reports, but does not affect the grade. Local services, for instance, zero out agentic readiness. SaaS bumps agentic readiness from 5 to 10 because MCP and agent-discovery signals are critical for SaaS visibility, and bumps Core Web Vitals from 6 to 8 because web app performance is a primary user signal.

Check	Tier	Weight
robots.txt	Crawlability	8
llms.txt / llms-full.txt	Crawlability	4
sitemap.xml	Crawlability	5
JSON-LD schema.org	Structured metadata	12
OpenGraph & meta tags	Structured metadata	8
Heading structure	Structured metadata	5
Quotable claims	Content extraction	8
Semantic HTML	Content extraction	7
FAQ content format	Content extraction	3
E-E-A-T signals	Content extraction	5
Content freshness	Content quality	8
Quotability score	Content quality	8
Entity readiness	Entity & agent	8
Agentic readiness	Entity & agent	5
Core Web Vitals	Performance	6
Total		100

The default weights sum to 100. Profile overrides change the denominator (a zeroed-out check is removed entirely; a bumped check increases the available points), and each effective grade denominator is reported alongside the score so the math stays defensible. The grade itself is a percentage of points earned over points available, mapped to a letter (A–F).

Tier 1: Crawlability & discovery

If AI crawlers cannot reach your content, nothing else matters. This tier is the foundation of retrievability.

1. robots.txt — AI crawler classification

8 pts

BeCited classifies AI bots as training (GPTBot, Google-Extended, ClaudeBot, CCBot) versus retrieval (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot). Blocking training bots is a defensible policy choice. Blocking retrieval bots makes you invisible to AI search.

The most common anti-pattern: brands block GPTBot to "protect" content but accidentally block OAI-SearchBot too. Per a BuzzStream analysis, 71% of sites have this misconfiguration.

71%

Of sites that block training bots also block retrieval bots like OAI-SearchBot and PerplexityBot — making them invisible to AI search.
BuzzStream, 2025

2. llms.txt & llms-full.txt

4 pts

An emerging standard for telling LLMs which content is most quotable, structured for fast ingestion. Adoption is still low and the spec is informal, but the upside is real: a clean llms.txt gives engines a curated index of your most authoritative pages.

3. sitemap.xml

5 pts

The classic sitemap is still load-bearing for AI crawlers. We check it exists, that it is reachable from robots.txt, and that lastmod values are populated — freshness signals feed directly into the freshness check below.

Tier 2: Structured metadata

Once a crawler has your page, structured metadata tells it what the page is about. This is the highest-weighted single tier in the audit.

4. JSON-LD schema.org

12 pts

The biggest single technical lever in the audit. Independent studies put the citation uplift from proper structured data at 1.8–3.2x. We check for type-appropriate schema (LocalBusiness or Service for local; SoftwareApplication for SaaS; Product for consumer goods) plus Organization, FAQPage, AggregateRating, and Review where relevant.

1.8–3.2x

Citation uplift for pages with proper JSON-LD schema.org markup vs. pages without.
Princeton GEO & LSEO

5. OpenGraph & meta tags

8 pts

Title, description, og:title, og:description, og:image. AI engines often quote meta description verbatim when the page is cited. A weak description is a wasted billboard.

6. Heading structure

5 pts

One H1 per page, descriptive H2s, and a consistent hierarchy. Headings phrased as questions tend to earn 40% more citations because they map directly to user prompts.

Tier 3: Content extraction signals

This tier is research-backed against the Princeton GEO paper and LSEO content extraction studies. It governs whether your content is actually quotable, not just present.

7. Quotable claims

8 pts

Self-contained 50–150-word chunks with answer-first structure. AI engines pull blocks, not paragraphs. A page full of conversational prose with no extractable claims will lose to a competitor with one well-formed answer block.

8. Semantic HTML

7 pts

main, article, section, lists, tables — not div soup. Engines parse semantic elements faster and trust them more. We count semantic tags vs unstructured div containers and flag the imbalance.

9. FAQ content format

3 pts

FAQPage schema, native HTML <details> elements, or Q&A-style headings. FAQs are over-represented in AI citations because their structure aligns with the prompt-and-answer pattern of generative search.

10. E-E-A-T signals

5 pts

Person/author schema, byline markup, credential language ("certified", "licensed", "N years of experience"). Per AI Overview research, 96% of cited pages have strong E-E-A-T signals. A position-6 page with E-E-A-T markup beats a position-1 page without.

96%

Of pages cited in Google AI Overviews carry strong E-E-A-T signals — author markup, credential language, Person schema.
AI Overview research

Tier 4: Content quality

Two checks that govern the quality of the content itself, beyond formatting.

11. Content freshness audit

8 pts

Last-Modified HTTP headers, JSON-LD dateModified, and sitemap lastmod. Perplexity weights content under 30 days old at roughly 3.2x. Half of all AI citations come from content less than 11 months old. A site that refreshes nothing gets less of the answer.

3.2x

Citation weight Perplexity applies to content updated within the last 30 days vs. older content.
Digital Bloom & AirOps

12. Quotability score

8 pts

A composite of paragraph length distribution, answer-first pattern detection, statistic density, and self-contained chunk count. Engines reward content that is easy to lift; this score quantifies how lift-friendly your pages are.

Tier 5: Entity & agent signals

This tier covers whether engines understand who you are at the entity level — and, for SaaS, whether agentic systems can interact with your product.

13. Entity readiness

8 pts

Wikipedia presence, Wikidata entry, Organization schema with sameAs links to authoritative profiles, and consistent brand naming across the web. AI engines are reluctant to recommend entities they cannot disambiguate.

14. Agentic readiness (SaaS only)

5 pts

AGENTS.md, OpenAPI spec, public API documentation, and MCP manifest. As Anthropic's Model Context Protocol, Google's UCP, and Visa's Agentic Ready standards mature, this signals to AI engines that your product can be invoked, not just described. Local service businesses zero out this check; SaaS profiles bump it to 10.

Tier 6: Performance (Core Web Vitals)

The newest tier. Performance is a primary user signal that applies across every business profile, but it carries different weight depending on whether your site is a brick-and-mortar marketing page or a web app.

15. Core Web Vitals

6 pts

BeCited measures the three Core Web Vitals via the Google PageSpeed Insights API: Largest Contentful Paint (LCP), Interaction to Next Paint (INP), and Cumulative Layout Shift (CLS). Pass thresholds (per Google web.dev) are LCP under 2.5s, INP under 200ms, and CLS under 0.1. Warn thresholds are 4s, 500ms, and 0.25; above those, fail.

Network or rate-limit failures degrade to warn rather than breaking the audit. SaaS profiles bump this check from 6 to 8 because web app perf is a primary user signal; local service profiles keep it at 6.

How the grade is calculated

The grade is a weighted percentage, not a checklist score. Each check returns pass / warn / fail. Pass earns full weight, warn earns half, fail earns zero. Total points earned divided by sum of effective weights gives a 0–100 score, which maps to A (85+), B (70–84), C (55–69), D (40–54), or F (under 40).

Profile overrides change the denominator, not the grading curve. If a check is set to weight 0, it is excluded entirely from the grade calculation; the check still runs and surfaces in the report, but does not affect the letter.

The output of all 15 checks lives in site-readiness.json alongside the prompt-based audit data, so technical fixes and content fixes are graded together. A brand can have an A+ GEO Score on prompts but a D on site readiness if their robots.txt blocks retrieval bots; we would still flag that as the top action item.

Where most brands fail first

Across audits we have run, three patterns recur:

Robots.txt over-blocks. Either the site blocks all AI bots indiscriminately, or (more often) it blocks the wrong subset. A retrieval-bot block is a self-inflicted invisibility cloak.
Schema is missing or wrong-typed. Many sites have JSON-LD for WebSite or BreadcrumbList but nothing for the entity that matters — LocalBusiness, SoftwareApplication, or Product. AI engines cannot ground claims to a missing entity.
Quotability is low. Long meandering paragraphs, no answer-first structure, no statistics. The competitor with shorter, denser blocks gets quoted; the long-form site gets ignored even when its content is better.

None of the 15 checks are theoretical. Each one corresponds to a measurable change in citation behavior we (and the broader research community) can document. Fixing the highest-weighted, lowest-effort checks first is usually the fastest way to move a GEO Score in 60 days.

Frequently asked questions

What is site readiness in a GEO audit?

Site readiness is the technical layer of GEO. It is the set of signals that determine whether AI crawlers can reach your content, parse it cleanly, quote it accurately, and understand who you are at the entity level. BeCited grades every audited site against 15 checks across six tiers, each contributing weighted points to a 100-point letter grade.

What is the most common technical mistake brands make?

Allowing training bots like GPTBot and Google-Extended while accidentally blocking retrieval bots like OAI-SearchBot, ChatGPT-User, PerplexityBot, and Claude-SearchBot. A BuzzStream analysis found 71% of sites have this misconfiguration. The retrieval-bot block makes the site invisible to AI search even when the brand is well known.

Why does JSON-LD schema carry the highest weight?

JSON-LD schema.org markup is weighted at 12 points, the largest single check, because independent studies put the citation uplift from proper structured data at 1.8 to 3.2x. AI engines use schema to ground claims to specific entities. Without LocalBusiness, SoftwareApplication, or Product schema, the engine has no entity to anchor the claim to and is far less likely to recommend the brand by name.

What is the difference between training and retrieval AI bots?

Training bots crawl content for model training. The major ones are GPTBot, Google-Extended, ClaudeBot, and CCBot. Retrieval bots fetch live content to answer a user query in real time. The major ones are OAI-SearchBot, ChatGPT-User, PerplexityBot, and Claude-SearchBot. Blocking training bots is a defensible policy choice. Blocking retrieval bots makes you invisible to AI search at the moment of the query.

Are Core Web Vitals part of AI readiness?

Yes. Core Web Vitals are the 15th check, weighted at 6 points. BeCited measures Largest Contentful Paint (LCP), Interaction to Next Paint (INP), and Cumulative Layout Shift (CLS) via the Google PageSpeed Insights API. Pages that pass under 2.5 seconds, 200 milliseconds, and 0.1 respectively earn full credit; warn and fail thresholds follow Google web.dev. Performance matters for AI readiness because retrieval bots and human users converge on the same page; a page that times out is invisible to both.

How do profile-specific weight overrides work?

Each business profile can override individual check weights. Setting a weight to 0 excludes the check from the grade denominator entirely; the check still runs and reports, but does not affect the letter. Local services zero out agentic readiness because brick-and-mortar businesses do not need AGENTS.md or MCP manifests. SaaS bumps agentic readiness from 5 to 10 because MCP and agent-discovery signals are critical for SaaS visibility, and bumps Core Web Vitals from 6 to 8 because web app performance is a primary user signal.

Want a real audit?

The article lists the checks. The audit tells you which ones you fail.

Every BeCited audit runs all 15 site readiness checks plus 100–300 buying-intent prompts across ChatGPT, Gemini, Perplexity, and Claude. Results come with a calibrated rubric (Cohen's κ = 0.722) and a prioritized action plan.

Run Free Site Scan See $2k Full Audit

Next in this series

Article 03 — The three pillars of GEO: retrievability, citability, recognizability

The full list with weights

Tier 1: Crawlability & discovery

1. robots.txt — AI crawler classification

2. llms.txt & llms-full.txt

3. sitemap.xml

Tier 2: Structured metadata

4. JSON-LD schema.org

5. OpenGraph & meta tags

6. Heading structure

Tier 3: Content extraction signals

7. Quotable claims

8. Semantic HTML

9. FAQ content format

10. E-E-A-T signals

Tier 4: Content quality

11. Content freshness audit

12. Quotability score

Tier 5: Entity & agent signals

13. Entity readiness

14. Agentic readiness (SaaS only)

Tier 6: Performance (Core Web Vitals)

15. Core Web Vitals

How the grade is calculated

Where most brands fail first

Frequently asked questions

The article lists the checks. The audit tells you which ones you fail.

Related reading