What the 2024 Google API Leak Taught Us About Ranking Signals

TL;DR

14,014 ranking attributes across 2,596 modules leaked from Google's Content Warehouse API in May 2024. Mike King of iPullRank published the first detailed analysis.
The leak directly contradicted Google's public statements on five fronts: domain authority (it exists, called siteAuthority), clicks as a ranking signal (NavBoost uses goodClicks, badClicks, lastLongestClicks), Chrome data, sandbox for new sites, and dwell-time measurement.
Reranking happens through Twiddlers: small filter functions (FreshnessTwiddler, QualityBoost, NavBoost, dozens of demotions) that adjust the order after the core ranker runs.
The same trust signals that drive classical ranking (site authority, click engagement, topical cohesion, freshness tiers) drive AI search visibility, because retrieval pipelines reuse them as source-quality input.
If your content strategy is built around what Google's spokespeople have said publicly, parts of it are pointed at the wrong target.

On May 27, 2024, Mike King published one of the most detailed dissections of Google's ranking system ever made public. He did not have to break in. The internal documentation for Google's Content Warehouse API had been accidentally pushed to a public GitHub repository, googleapis/elixir-google-api, on March 27 and removed on May 7. By the time the deletion happened, an external documentation service had cached the entire schema. Apache 2.0 license. 14,014 attributes. 2,596 modules.

14,014

Ranking attributes exposed across 2,596 modules when Google's Content Warehouse API documentation accidentally went public.
Google API leak, May 2024

It is not a leak of training data or scoring weights. It is the equivalent of finding a parts catalog for an engine you have only ever heard running. You cannot see the tuning, but you can see exactly what parts exist.

What the leak actually contains

The schema names every module that participates in Google's pipeline, the attributes those modules read and write, and in many cases a one-line description of what each attribute is for. King grouped them into five functional layers.

Crawling & indexing

Trawler manages crawl queues. Alexandria is the primary index. SegIndexer places documents into tiers. TeraGoogle is long-term disk storage.

Rendering

HtmlrenderWebkitHeadless executes JavaScript (the engine moved from WebKit to Chromium).

Link processing

LinkExtractor reads outbound and inbound links. WebMirror handles canonicalization and deduplication.

Ranking

Mustang is the primary scorer. Ascorer is the pre-ranking pass. NavBoost reranks on click signals. FreshnessTwiddler adjusts for recency.

Serving & assembly

SuperRoot routes queries. SnippetBrain generates snippets. Glue assembles the SERP using user-behavior data. Cookbook generates runtime signals.

Quality & demotions

QualityBoost, RealTimeBoost, Baby Panda V2, and a long list of demotion modules (anchor mismatch, SERP demotion, exact-match domain, product review).

The architecture matters because it shows that ranking is not one model. It is a pipeline. Mustang produces a base score, Ascorer feeds the candidate set, and then a series of Twiddlers reorder, boost, and demote. Each Twiddler is small, easy to ship, and easy to roll back. That is how Google can experiment quickly without rebuilding the index.

Five things Google denied that turned out to be true

The most widely-quoted finding from the leak is the gap between public statements and the schema.

Public statement vs. leaked schema

Public claim	Spokesperson	Status	Evidence in leak
"There is no domain authority"	Gary Ilyes, John Mueller	Contradicted	`siteAuthority` in CompressedQualitySignals
"We don't use clicks for rankings"	Ilyes, Paul Haahr, Mueller	Contradicted	`NavBoost` with `goodClicks`, `badClicks`, `lastLongestClicks`, `unicornClicks`
"There is no sandbox for new sites"	John Mueller	Contradicted	`hostAge` attribute used "to sandbox fresh spam"
"We don't use Chrome data for search"	Matt Cutts, Mueller	Contradicted	`chromeInTotal` at site level; "Chrome Visits" feeding RealTimeBoost
"Dwell time is made-up crap"	Gary Ilyes	Partially contradicted	`lastLongestClicks` measures extended engagement

"Lied is harsh, but it's the only accurate word to use here."
— Mike King, iPullRank

King's framing is sharp on purpose. The implication for content strategists is concrete: if you architected your program around the official narrative, parts of it are pointed at the wrong signals. Sites that were demoting click-friendly headlines because "clicks don't matter" were actively suppressing their own NavBoost score.

The signals that matter most (and why GEO inherits them)

Once you accept that the leaked schema is real, several signals jump from "speculative" to "load-bearing." They matter for classical SEO. They matter even more for GEO because AI engines that use Google Search Grounding (Gemini and indirectly Google AI Overviews) inherit the same retrieval and quality layers, and independent engines like ChatGPT and Perplexity reuse the underlying logic.

Site authority is real

The siteAuthority attribute lives inside Compressed Quality Signals, the same module that holds Panda demotions and other site-level adjustments. Two related attributes, homepagePagerankNs and homePageInfo (with trust levels NOT_HOMEPAGE, NOT_TRUSTED, PARTIALLY_TRUSTED, FULLY_TRUSTED), apply homepage authority to new pages on the same site.

The practical takeaway: building authority on your homepage and hub pages is not vanity. It is what gives every new page a head start on retrieval.

Clicks and engagement are reranking input

NavBoost is the click-based reranker. It uses goodClicks, badClicks, lastLongestClicks, unsquashedClicks (pre-normalized), and unicornClicks (clicks from a high-trust user segment). It runs on a 13-month rolling window. Reference query counts from NavBoost feed into Panda and other systems.

13 mo

Rolling click window NavBoost uses for reranking — despite years of public denials that clicks are a ranking signal.
Google API leak, May 2024

The implication for AI search: the same engagement attestation pulls a source into AI answers. If your pages get clicked and read, they get cited. If they get clicked and bounced, they get demoted, and they get less likely to be selected as a source in the first place.

Topical cohesion: `siteFocusScore` and `siteRadius`

The schema includes a topic embedding for each site. siteFocusScore measures how concentrated a site is around a topic. siteRadius measures how far each page deviates from that topic embedding. Unfocused sites get penalized. Niche authority gets rewarded.

Why this matters for AI. Vector retrieval pipelines look for the strongest passage embedding match. Sites with tight topical cohesion produce passages that cluster more tightly in vector space, which makes them easier to retrieve consistently. The leak gives a name to a pattern AI engines reinforce.

Freshness has multiple date signals

Google does not pick one date for a document. It uses bylineDate (explicit publication date), syntacticDate (extracted from URL or title), semanticDate (content-derived estimate), and "last good click date" as a content decay signal. The presence of all four implies that the system actively distrusts a single declared date.

For AI search visibility: keeping dateModified in your JSON-LD aligned with actual content changes, and updating long-form pages on a real schedule, holds the freshness signal across both classical and AI pipelines.

Document truncation

The leak documents that Mustang caps token count and truncates documents past a threshold. The corollary: top-of-page placement of your most important claims is critical. A specific, quotable fact buried in section 11 may not even be in the indexed copy of the page.

This is also true for AI extraction. Models that pull passages preferentially weight earlier tokens. If your TL;DR contains the answer, it has a much higher chance of being lifted than a paragraph deep in the article.

The Panda formula, finally explainable

The leak made one thing visible that SEO professionals have been guessing at for over a decade: the Panda update is not a black box. It is approximately:

(Independent Links / Reference Queries) × Modifier

Calculated at domain, subdomain, and subdirectory level. Reference queries come from NavBoost's rolling click window. "Panda refreshes" are window updates: when the rolling window shifts, the ratio shifts, and some sites move.

That gives an interpretable definition of "low-quality site": one that has received many links but few queries that result in successful clicks, relative to its peers. The same definition translates almost word-for-word to AI search: a site that other sources reference but that users do not actually engage with is a site that retrieval systems eventually deprioritize.

What to change, concretely

Three actions transfer directly from the leak's findings to GEO performance.

Treat siteAuthority as real. Concentrate brand-building work where the leak says authority is calculated: the homepage, hub pages, and the trusted sub-sections that link out to deeper content. Do not rely on the homepage to rank for everything; rely on it to lend credit to everything else.
Optimize for engagement, not just clicks. lastLongestClicks measures post-click dwell. A high CTR with a fast bounce is a worse signal than a moderate CTR with long sessions. Restructure pages so users find what they came for early and stay longer.
Tighten topical cohesion. If siteRadius measures deviation from your topic embedding, every off-topic post pulls the radius wider. Either commit to a topic and prune drift, or split unrelated lines of business onto separate properties.

The leak did not invent any of these levers. It confirmed them. For BeCited audits, the most useful effect of the leak was that it ended the debate about whether engagement, authority, and topical cohesion matter. They are in the source code of the system. They matter for Google. And because every major AI engine reuses pieces of that pipeline, they matter for AI search.

Frequently asked questions

What was the 2024 Google API leak?

In March 2024, internal documentation for Google Search's Content Warehouse API was accidentally published to a public code repository (googleapis/elixir-google-api) under an Apache 2.0 license. It exposed 14,014 attributes across 2,596 modules describing Google's ranking, indexing, and reranking systems. Mike King of iPullRank published the first detailed analysis on May 27, 2024.

What is siteAuthority and why does it matter?

siteAuthority is a site-wide authority score that appears in the leaked Compressed Quality Signals module. Its existence directly contradicts years of Google statements that there is no such thing as overall domain authority. The signal informs how new pages on a site are scored before they accumulate their own page-level data, and it matters for AI search because the same trust signals influence which sources retrieval systems lean on.

Does Google use clicks for rankings?

Yes. The leak documents NavBoost, a click-based reranking system that uses goodClicks, badClicks, lastLongestClicks, unsquashedClicks, and unicornClicks (clicks from a high-trust user segment). This contradicts repeated public statements from Google representatives that clicks are not used in rankings. NavBoost has its own 13-month rolling click window and feeds reference query counts into other systems.

What are Twiddlers?

Twiddlers are reranking functions that operate after the Ascorer pre-ranking pass and before the final SERP is assembled. They behave like WordPress filters: each one can boost or demote results based on a specific signal (freshness, quality, click data, diversity). Twiddlers include FreshnessTwiddler, QualityBoost, RealTimeBoost, NavBoost, and dozens of demotion modules.

What does the leak imply for AI search visibility?

AI engines that use Google Search Grounding (notably Gemini and indirectly Google AI Overviews) inherit the same ranking and trust signals exposed in the leak: site authority, NavBoost click signals, Chrome engagement data, fresh content tiers, and entity embeddings. Independent AI engines like ChatGPT and Perplexity use different retrieval pipelines, but the same underlying logic applies: third-party trust signals, click-attested engagement, and topical cohesion are what pull a source into the answer.

What should we change in our SEO strategy because of the leak?

Three things. First, treat siteAuthority as real and concentrate brand-building work where the leak says authority is calculated (homepage and trusted sub-sections). Second, optimize for engagement, not just clicks: lastLongestClicks suggests post-click dwell matters. Third, tighten topical cohesion: siteRadius measures deviation from your site's topic embedding, and unfocused sites get penalized.

From signals to scoring

The leak names the signals. Our audits measure them on your site.

BeCited audits score your site against the trust, engagement, and cohesion signals that drive both Google ranking and AI source selection. 95% confidence intervals. Root-cause attribution on every gap.

Run Free Site Scan See $2k Full Audit

Next in this series

Article 06 — Relevance Engineering: the discipline replacing classical SEO