Crawlers, search engines and the sleaze of generative AI companies

The boom of generative AI products over the past few months has prompted many websites to take countermeasures.

The basic concern goes like this:

AI products depend on consuming large volumes of content to train their language models (the so-called large language models, or LLMs for short), and this content has to come from somewhere. AI companies see the openness of the web as permitting large-scale crawling to obtain training data, but some website operators disagree, including Reddit, Stack Overflow and Twitter.

This answer to this interesting question will no doubt be litigated in courts around the world.

This article will explore this question, focusing on the business and technical aspects. But before we dive in, a few points:

Although this topic touches on, and I include in this article, some legal arguments, I am not a lawyer, I am not your lawyer, and I am not giving you any advice of any sort. Talk to your favorite lawyer cat if you need legal advice.
I used to work at Google many years ago, mostly in web search. I do not speak on behalf of Google in any way shape or form, even when I cite some Google examples below.
This is a fast-moving topic. It is guaranteed that between the time I’ve finished writing this and you are reading it, something major would have happened in the industry, and it’s guaranteed I would have missed something!

The ‘deal’ between search engines and websites

We begin with how a modern search engine, like Google or Bing, works. In overly simplified terms, a search engine works like this:

The search engine has a list of URLs. Each URL has metadata (sometimes called “signals”) that indicate the URL may be important or useful to show in the search engine’s results pages.
Based on these signals, the search engine has a crawler, a bot, which is a program that fetches these URLs in some order of “importance” based on what the signals indicate. For this purpose, Google’s crawler is called Googlebot and Bing’s is Bingbot (and both have many more for other purposes, like ads). Both bots identify themselves in the user-agent header, and both can be verified programmatically by websites to be sure that the content is being served to the real search engine bot and not a spoof.
Once the content is fetched, it is indexed. Search engine indices are complicated databases that contain the page content along with a huge amount of metadata and other signals used to match and rank the content to user queries. An index is what actually gets searched when you type in a query in Google or Bing.

Modern search engines, the good polite ones at least, give the website operator full control over the crawling and indexing.

The Robots Exclusion Protocol is how this control is implemented, through the robots.txt file, and meta tags or headers on the web page itself. These search engines voluntarily obey the Robots Exclusion Protocol, taking a website’s implementation of the Protocol as a directive, an absolute command, not just a mere hint.

Importantly, the default position of the Protocol is that all crawling and indexing are allowed – it is permissive by default. Unless the website operator actively takes steps to implement exclusion, the website is deemed to allow crawling and indexing.

This gives us the basic framework of the deal between the search engines and websites: By default, a website will be crawled and indexed by a search engine, which, in turn, point searchers directly to the original website in their search results for relevant queries.

This deal is fundamentally an economic exchange: the costs of producing, hosting, and serving the content are incurred by the website, but the idea is that the traffic it gets in return pays that back with a profit.

Note: I’m intentionally ignoring a whole slew of related arguments here, about who has more power in this exchange, who makes more money, fairness, and much more. I’m not belittling these – I just don’t want to distract from the core topic of this article.

This indexing for traffic approach comes up elsewhere, for example when search engines are allowed to index content behind a paywall. It’s the same idea: the website shares content in return for being shown in search results that point searchers back to the website directly.

And at each step of the process of this deal, if the publisher wants to block all or some crawling or indexing in any way, then the publisher has several tools using the Robots and Exclusion Protocol. Anything still allowed to be crawled and indexed is because the website gets a direct benefit from being shown in the search results.

This argument in some form has actually been used in courts, in what has become to be known as the “robots.txt defense” and has been basically held up; see this short list of court cases, many involving Google, and this write-up from 2007 that’s not entirely happy about it.

LLMs are not search engines

It should now be very clear that an LLM is a different beast from a search engine.

A language model’s response does not directly point back to the website(s) whose content was used to train the model. There is no economic exchange like we see with search engines, and this is why many publishers (and authors) are upset.

The lack of direct source citations is the fundamental difference between a search engine and an LLM, and it is the answer to the very common question of “why should Google and Bing be allowed to scrape content but not OpenAI?” (I’m using a more polite phrasing of this question.).

Google and Bing are trying to show source links in their generative AI responses, but these sources, if shown at all, are not the complete set.

This opens up a related question: Why should a website allow its content to be used to train a language model if it doesn’t get anything in return?

That’s a very good question – and probably the most important one we should answer as a society.

LLMs do have benefits despite the major shortcomings with the current generation of LLMs (such as hallucinations, lying to the human operators, and biases, to name a few), and these benefits will only increase over time while the shortcomings get worked out.

But for this discussion, the important point is to realize that a fundamental pillar of how the open web functions right now is not suited for LLMs.

The sleaziness

That’s apparently not a problem for AI companies who are interested in training large models only for their own economic benefit.

OpenAI used several datasets as training data inputs (details here for GPT3), and OpenAI intentionally does not disclose the training datasets for GPT4.

Although OpenAI uses many arguments to justify not disclosing information about GPT4’s training data (discussed here), the key point for us remains: We don’t know which content was used to train it, and OpenAI does not show that in the ChatGPT responses.

Does OpenAI’s data collection obey the Robots Exclusion Protocol? Does it include copyrighted text, like textbooks or other books? Did they get permission from any website or publisher? They don’t say.

Brave Software’s super shady approach

If OpenAI’s approach is problematic, Brave Software (the maker of Brave browser and the Brave search engine) takes an even more problematic approach and stance when it comes to search and AI training data.

The Brave search engine depends heavily on what’s called the Web Discovery Project. The approach is quite elaborate and documented here, but I’ll highlight one key fact: Brave does not appear to have a centralized crawler they operate, and none of the crawls identify themselves as crawlers for Brave, and (sit down for this) Brave sells the scraped content with rights that Brave gives the buyer for AI training.

There is a lot in that sentence, so let’s parse it out.

Brave search uses the Brave browser as a distributed crawler. As documented in this help article, there is this FAQ question and answer:

Is the Web Discovery Project a crawler?

In a way, yes. The Web Discovery Project processes fetch jobs from Brave’s web crawler. Every few seconds or minutes, the browser might be instructed to fetch a webpage and send the HTML back to Brave. However, this fetching has no impact on your browsing history or cookies—it’s done as a private fetch API call. For extra safety, the fetch job domains are pre-selected from a small set of innocuous and reputable domains.

What is the Web Discovery Project? – Brave Search

The Fetch API is a web standard functionality built into modern browser engines, including the one Brave uses. Its common use is to fetch content to show to users in the browser. For our purposes, we immediately know it is a user’s browser requesting the website’s content on behalf of Brave’s search engine.

Interestingly, a Reddit thread from June 2021 adds more details and confusion. One reply from a Brave representative is very interesting (highlights mine):

We have our own crawler, but it doesn’t contain an user-agent string (just as Brave, the browser, also does not contain a unique user-agent string) to avoid potential discrimination. That said, we have talked about potentially identifying the crawler to admins would who would like to know when/where it lands on their properties. We also respect robots.txt too, so if you don’t want Brave Search crawling your site, it won’t.

This a goldmine of facts:

They have their own crawler, which may be referring to a centralized one or the distributed browser-based Web Discovery Project.
This crawler doesn’t identify itself as a crawler, but it somehow obeys the Robots Exclusion Protocol (in the form of the robots.txt file). How can a website operator write a robots exclusion directive if the browser does not identify itself? Which user agent token (as it’s called) would be used in the robots.txt file to specify directives specific for Brave’s crawler? I have not been able to find any documentation from Brave.
What they’re calling discrimination is actually how publishers would control crawling. The Robots Exclusion Protocol is a mechanism for publishers to discriminate between what users and crawlers are allowed to access, and discriminate between different crawlers (for example, allow Bingbot to crawl but not Googlebot). By claiming they want to avoid discrimination, Brave is actually saying that they get to decide what they crawl and index, not the publisher.

Going back to the Fetch API: By default, the Fetch API uses the browser’s user-agent string. We already know that the Brave browser does not identify itself with a unique user-agent header, using, instead, the generic user-agent string produced by the underlying browser engine.

The user-agent string can be customized, for the browser in general and the Fetch API, yet I have not found any indication that Brave does that (and indeed, the Reddit reply cited above explicitly says there is no unique identifier).

Furthermore, Brave goes on to sell the scraped data specifically for AI training, not just as search results (for example, to power a site search feature).

Visiting the Brave Search API homepage shows several price tiers, including some called “Data for AI”. These data plans include options for “Data with storage rights” that allow the subscriber to “Cache/store data to train AI models”, with the data including “Extra alternate snippets for AI” and with “Rights to use data for AI inference.”

In summary, based on Brave’s public statements and lack of documentation, Brave crawls the web in a stealthy way, without an obvious way to control or block it, and goes on to resell the crawled content for AI training.

Or to rephrase this more bluntly, Brave has appointed itself as a for-profit distributor of copyrighted content without license or permission from website publishers.

Is this acceptable? I see it as a sleazy scraper as a service.

Google’s Publisher Controls initiative

There may be a new type of web crawler coming soon, one specifically for generative AI.

It appears that Google has recognized the incompatibility discussed above, that using the content Googlebot fetched for web search may not be suitable for training AI models.

Google has announced they want to start a community discussion to create AI Web Publisher Controls (hey, Google, I signed up, let me in please!). I wholeheartedly support having this conversation, and well done Google for opening the door to having this conversation.

As we’re in the early days, it’s important to flag that the defaults and capabilities of such controls will be critical to their success or failure. I suspect many publishers and authors will have strong opinions that we need to hear about how these AI controls should work.

What about open-source LLMs?

An important aspect of the argument above is the economic exchange. But what if the organization behind the language model releases the model freely without benefit to itself?

There are many such open-source models, and they are trained on datasets that substantially overlap the datasets used to train commercial proprietary models. Many open-source models are good enough for some use cases right now, and they’re only getting better.

Still: Is it right that a website’s content is used without permission to train an open-source LLM?

That’s possibly a trickier question, and I think the answer currently rests on what the Robots Exclusion Protocol allows. It is possible that a better answer emerges in the form of a well-designed approach from Google’s AI Web Publisher Controls or some other similar initiative.

Watch this space.

So what can a publisher do now?

This current situation is one which many publishers neither want nor accept. What can they do?

Here we need to go back to old-school crawler/bot blocking. There are generally two types of crawlers:

Crawlers that identify themselves. They may or may not obey the Robots Exclusion Protocol, but at least the server has an identifier to check to decide whether to block the request or not. Examples include Googlebot and Bingbot.
Stealth crawlers, which are not used for polite search engines. They don’t identify themselves and/or do not obey the Robots Exclusion Protocol. Examples are any script kiddie’s spam scraper or Brave Search’s crawler.

There are two complementary things you can do:

If the crawler obeys the Robots Exclusion Protocol, you can block it if you think the content it crawls feeds into AI training data. There are two approaches here:
- Block all crawlers and allow only the ones you want to allow for your needs (like Googlebot and Bingbot). This is dangerous for a website’s performance in organic search. You need to be extremely careful with it, but it is effective for these crawlers.
- Allow all crawling and block the ones you want to block. This more permissive approach is less dangerous, but of course your content may be scraped by AI or other crawlers you may not want.
Use a server-side stealth bot detector, and use it to block such crawlers. Many products can do this. If you’re using a content distribution network (CDN) like many publishers do, it’s likely this kind of functionality is available through that (e.g. Akamai, Cloudflare, Fastly).

The approach I’m starting to take with the websites I operate, and discuss with clients, is a combination of options (1a) and (2), namely to use a restrictive robots.txt file along with CDN controls.

This may not be the best approach for each publisher, but I think it’s worth seriously considering.

What does all this mean?

We’re living through times that will go down as one of the most influential in history. People are literally predicting the doom of humanity from AI. We all have a part to play in shaping the future.

For our part as creators of original content, we need to think about how to respond, and keep up and adapt to this fast-moving part of the industry. Deciding how the content we author gets created, distributed, and consumed is now a complicated mix of strategy, technology, finances, ethics and more.

However you respond, you are taking a stance at a historic moment. I feel your burden.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

Post Views: 221