How the Massive Google SEO Leak Plays Into the Marketplace for Search

Utsav Gandhi discusses the findings of the May 2024 Google SEO leak, which gave analysts a novel, albeit speculative, look into how Google might choose to promote and demote content. The findings have possible implications for businesses and news organizations struggling to compete for views and suggest that transparency could become an increasing factor in the future search market as new, artificial intelligence-powered competitors enter the market.

From the launch of Google in 1998 to the subsequent widespread adoption of mobile technology, the rise of voice assistants, and today’s rapidly evolving artificial intelligence landscape, the experience of searching for information online seems to have remained consistent and reliable. But perhaps for the first time, Google Search, which has dominated the internet search market for over two decades, is at a pivotal crossroads and could upend before our eyes in the next handful of years.

Google faces several pitfalls in the suddenly evolving market for its core product: It has come under intense scrutiny for monopolizing the search market, reducing clicks to news websites in favor of AI-generated summaries, and playing loose with data privacy. In addition, users, particularly businesses that rely on Google Search for publicity and content dissemination, have long been frustrated by the secretive structure of Google Search’s underlying architecture and the signals they must send to help users find their content, or “search engine optimization” (SEO).

Playing a crucial role in how information is ranked, presented, and displayed to consumers, SEO has spawned a $74 billion industry, mainly in consulting and marketing practices that leverage Google’s rules to organize the web’s content. The rules and algorithms determining Google’s SEO often result in make-or-break moments for small businesses or media organizations. Over the years, Google has remained mostly tight-lipped about how SEO functions, but a recent leak of internal documents (code dated March 2024) could provide some of the most comprehensive revelations yet. While the leak only has variable names used in the raw data for SEO but nothing on the underlying algorithms per se, it could have implications for the future of Google Search in a market that is flirting with the introduction of new and serious competitors. Google, for its part, has confirmed the leak’s authenticity but expressed that some documents might be outdated or incomplete.

The insights from the leak can be thought of as the ingredients of how Google operates SEO rather than the recipe of how these different ingredients are weighed in the final outcome. More than 2,500 modules (or pages representing different components of SEO) were leaked in API code documentation from Google’s internal “Content API Warehouse,” shedding light on more than 14,000 attributes (features or signals that Google may use to determine ranking). Documentation such as this “warehouse” exists at almost every technology company, helping familiarize internal staff on a project with the data available for it. However, it is rarely seen by the public. SEO experts Rand Fishkin and Mike King first revealed information on the leak and independently published analyses of the documents and their contents.

The new information reveals attributes that marketers and SEO experts suspected existed, as well as items they didn’t even know could be tracked:

1. It confirmed the existence of “twiddlers”—re-ranking algorithms that can boost or demote content with penalties or rewards. It seems intuitive that these algorithms define how the internet is structured and presented to us, but the issue here is the lack of further transparency in the nature, scope, and impact of these algorithms. If these are “reranking” algorithms, when and how exactly are they deployed? When precisely is content demoted or boosted?

2. Some attributes exposed in the leak suggest that Google detects how commercial a page or document is, and this can be used to prevent a page from being considered for a query with informational intent. Take, for example, a user searching for the “Stanley Cup” (the trophy awarded to the winner of the American National Hockey League championship) versus searching for Stanley cups (giant tumblers that have gone viral on TikTok). This seems useful, but further data on error rates (false positives and false negatives) would be helpful, especially for researchers.

3. The leak confirmed the importance of previously known ranking factors, such as content quality (“EAT” or experience-authoritativeness-trust as described in the SEO world), backlinks (hyperlinks from one website to another), regularly updated content, and user interaction metrics (clicks, time spent on the site, etc.) The higher the number of these factors that a website page includes, the higher its ranking. The leak also showed that Google keeps track of the topics that a web page publishes (for example, ProMarket publishing extensively on antitrust) and how much each page (on ProMarket, for example) differs from that larger topic (antitrust). Again, these factors are where the value of the leak is notable, even if limited to the “ingredients” but not the “recipe.”

The leak also reveals information contrary to several claims about SEO that Google has made over the years:

1. Historically, Google has denied that click-through rates are important in SEO ranking (i.e., if the third result on a page is clicked more often than the first, it will, over time, rise to the second or first result). The leak, say SEO analysts, suggests otherwise. This could have consequences (at least in the pre-AI search preview era) for “clickbait” because essentially, what users click on a search results page is the title of web pages.

2. Google has claimed that website ranking does not follow a “sandbox” pattern—that there is no rule for newer websites to wait to be ranked higher. The leak suggests otherwise by revealing a metric called “hostAge.” Why would Google collect data on a website’s age if it doesn’t use it?

3. Google has also claimed that if lots of people are clicking on a website, its webpage will rank highly and that it doesn’t use data from Google Chrome for ranking. The leak suggests otherwise: That there have been plenty of mechanisms in place for Google to collect Chrome data for years, and raises questions about the purpose of collecting data if not to use it. For example, the initial motivation behind launching Google Chrome was to gather more clickstream data, a detailed log of a user’s activity, including the pages they visit, how long they spend on each page, and where they go next. Recent research has also shed more light on how Chrome helps Google solidify its dominance.

4. Perhaps more importantly for smaller websites, the leaked documents indicate that while Google is not necessarily torching their visibility, it is also not going out of its way to value them highly. In a piece published recently, a business called HouseFresh that evaluates and reviews air purifiers describes how it has “virtually disappeared” from search results: search traffic has decreased 91 percent in recent months, from around 4,000 visitors a day in October 2023 to 200 a day in 2024. This drop in traffic to HouseFresh has coincided with a series of Google algorithm modifications, after which HouseFresh reviews started getting buried below recommendations from brand-name publications. “It seemed as if media companies were making a grab for affiliate revenue without the expertise that sites like HouseFresh had worked hard to cultivate—and it looked as if Google was rewarding them for doing so,” explains this analysis.

Other scholars have raised important concerns about Google’s chokehold on political and health information simply because of its dominance in search. Relatedly, the leak revealed that during the Covid-19 pandemic, Google employed whitelists for websites that could appear high in the results for Covid-related searches. Similarly, during democratic elections, Google has employed whitelists for sites that should be shown (or demoted) for election-related information. There are references in several places to flags for “isCovidLocalAuthority” and “isElectionAuthority” in the documentation. Again, the leak does not provide further information on how these authorities are determined.

The leak underscores the complexity and opaqueness that small business owners and media organizations must navigate to maintain an online presence and generate revenue. As the search engine (with over 90% of the search market in the United States), Google determines what news content and businesses people see. Google’s SEO is one of the main determinants of competition in the online and offline economy. Without access to its rules, businesses play a guessing game to compete with one another.

The leak also raises concerns about taking corporate messaging at face value and the need for marketers to continue experimenting in coordination with user experience design and content communications. It raises questions whether the same rules apply to Google’s own web properties, such as Travel, Shopping, and Flights.

The leak also has ramifications for Google’s precarious position within the rapidly evolving search landscape. A group of new rivals have emerged that run on nascent AI software. These include OpenAI’s newly announced direct competitor to Google (“SearchGPT”), Microsoft’s long-ignored search engine Bing (now powered by ChatGPT) and its AI assistant Copilot, and Perplexity, a high-profile AI-powered chatbot. The quality of these new search engines varies, but they inject competition into a long-stagnant market. Google’s lack of transparency over its search SEO rules has frustrated its users. For over a decade, users had little choice in their search engine, and Google could potentially manipulate its SEO without losing consumers. That may no longer be the case.

At the recent Stigler Center antitrust and competition conference, author and advocate Cory Doctorow revealed that he has started paying $10 monthly to use a new search engine called Kagi. Instead of monetizing users for targeted marketing, Kagi offers three monthly pricing tiers that offer an ad-free, personalized search experience where the “incentives of your information provider are aligned with what’s best for you, not what’s best for advertisers.” Doctorow said at the conference, “The problem isn’t that Google scrapes us. The problem is that we can’t scrape Google.” In other words, we don’t know why Google shows us the information it does. Google’s SEO leak has not produced outrage, but it has raised questions about transparency. Even if Google ignores the leak and the questions it raises, it will have a much harder time ignoring the new search competitors who may offer higher quality and perhaps more transparent services to consumers.

Author Disclosure: the author reports no conflicts of interest. You can read our disclosure policy here.

Articles represent the opinions of their writers, not necessarily those of the University of Chicago, the Booth School of Business, or its faculty.

How the Massive Google SEO Leak Plays Into the Marketplace for Search

Utsav Gandhi

Popular This Week

Regulating the Digital Network Industry

Preventing AI Oligopoly and Digital Enclosure Via Compulsory Access

The False Hope of Content Licensing at Internet Scale

Content Licensing Agreements Will Concentrate Markets Without Standardized Access

LATEST NEWS

Regulating the Digital Network Industry

The Politics of the Status Quo

Preventing AI Oligopoly and Digital Enclosure Via Compulsory Access

Content Licensing Agreements Will Concentrate Markets Without Standardized Access

The False Hope of Content Licensing at Internet Scale

Anticompetitive Acquiescence in AI Content Licensing

Novo Nordisk’s Offer To Acquire Metsera Constitutes Attempted Monopolization

Why the Controversy Behind ExxonMobil’s New Retail Voting Program?

The Economics of Zohran Mamdani

How Should We Address the Amazon Web Services Outage?