Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?

bicx · on May 22, 2024

Google scrapes like a maniac. And for profit. Many others do the same.

A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.

The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...

If you’re trying to prevent scraping of your data, your best option is to not make it public.

herbst · on May 22, 2024

I have a website that has around 5k Bing visits every single day. Basically my most expensive user. Compared to Google with about 70 visits daily.

I randomly block their IPs, tried some stuff with robots.txt and even completely banned it in the past as I thought this must be something else. It would just show up with new IPs and proceed.

The few times I checked it looked like official IPs. If I knew how I would sue Microsoft. They have no business in scraping my website 3-5 times a day when they send me basically no traffic

Edit:// it's also not my only website where bing goes crazy. And it's not new, this is going on for several years now (so no AI scraping I guess)

Nextgrid · on May 22, 2024

> The few times I checked it looked like official IPs

Considering Microsoft now runs a cloud service, it may very well be their cloud provider users and not official Bing scrapers.

tetris11 · on May 22, 2024

They absolutely should be paying you if they're single-handedly abusing your resources

themoonisachees · on May 22, 2024

There is a YouTube video about a content creator who got a huge AWS bill out of byte spider (bytedance's scraper) indexing his site incredible numbers of times.

Nextgrid · on May 22, 2024

Or maybe just don't host your stuff in such a way that any internet user can bankrupt you just from sending GET requests?

matthews2 · on May 22, 2024

If GET requests increase my AWS bill, does that make them not idempotent any more?

qup · on May 22, 2024

This sounds like a scoping issue

herbst · on May 23, 2024

This so much. While it's annoying to waste resources for nothing. All I really had to do to make this a non issue is upgrading my VPS to a slightly larger box.

If I would pay per requests I couldn't sleep anymore

Nextgrid · on May 22, 2024

On the other hand the amount of traffic he's mentioning is about 3.4 requests per minute which is effectively background noise. They're probably getting much more hits from actually malicious scanners than Bing's misbehaving spider.

wildrhythms · on May 22, 2024

Usually it's robots.txt that 'prohibits' such use, but you're right it's not enforceable.

Nextgrid · on May 21, 2024

If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.

tangentstar · on May 21, 2024

The first distinction I can think of is, “Who has agreed to the terms of service of the site being visited?”

kalleboo · on May 22, 2024

If I visit a website and there is a tiny link to the terms of service on the bottom of it, there is no reasonable interpretation that I have ever agreed to them.

cchance · on May 22, 2024

Because the stupid terms of service link being on a site doesn't mean anyone agreed to them, they dont actually hold up in a court of law. Now if you signup for an account and agree, then MAYBE then they can be enforced.

So no if its on the internet and its publicly viewable, i don't see why a bot like chatgpt should somehow be blind to a site that a human can see lol, hell microsofts made their new AI system see your screen, do you also want the AI's to somehow black out the screen area that has the website open and ... know theirs a TOS somewhere on the page

gtirloni · on May 22, 2024

The CEO of the scraping company. Are we good?

Tepix · on May 22, 2024

These terms are legally void

hnroo99 · on May 22, 2024

That's fair. Given that you can't do this programatically with their api (it disallows scraping prompts), it feels less prone to abuse. And even if a bad actor tried to leverage their web api instead of their official api to get around prompt limitations they could easily just ban you.

persedes · on May 21, 2024

I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:

https://www.sigmaaldrich.com/robots.txt

icedchai · on May 21, 2024

My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.

brianjking · on May 21, 2024

You can opt out.

https://platform.openai.com/docs/gptbot

routerl · on May 21, 2024

Hey, that's great!

But wait, we already had a working mechanism to signal exactly this type of opting out[1] so let me rephrase the OP question: why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

It certainly does seem as if they're trying to position themselves as a new standard against which content owners have to actively opt-out, and thus disregarding the already existing active opt-out signals. But that would mean that they don't actually care about privacy, and their opt-out signal is disengenuous! That can't be right, can it?? Surely everything they do is in good faith, just like every other corporation ever!

Anyway, the fact that they disregard existing privacy standards and rolled out their own privacy standard definitely gives me a lot of confidence that they will forever follow the privacy standards they themselves created!

Now excuse me, but I have to go get treatment for terminally metastasized sarcasm.

[1] https://en.m.wikipedia.org/wiki/Robots.txt

squigz · on May 21, 2024

I'm... pretty sure OpenAI respects robots.txt, as explained in the link GP shared?

cuddlyogre · on May 22, 2024

Whether it respects robots.txt is irrelevant if its existence is secret for the entire time it's doing the scraping.

squigz · on May 22, 2024

I'm sorry, but I'm confused by this comment. What exactly is secret?

cuddlyogre · on May 22, 2024

The several years that they were scraping the web to build their models and they weren't telling anyone about it.

squigz · on May 22, 2024

I don't think that's how robots.txt or scraping really works. Do you expect scrapers to announce every bot they run? Do you expect webmasters to add a robots rule for every bot?

If someone didn't want OpenAI or anyone else scraping their site, whether OpenAI or anyone else announces they're scraping doesn't matter, if they respect robots.txt, and you have rules to catch unannounced scrapers.

cuddlyogre · on May 22, 2024

What I'm saying is that it doesn't matter if you disallow them access now, because they've already gotten everything they want, whether you wanted it or not.

The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

When the little guy does it, that's called piracy and theft. When billion dollar corporate entities do it, it's called a technological marvel.

squigz · on May 23, 2024

> The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

This doesn't seem accurate at all. Plenty of businesses are built on scraping data; see: Google.

> The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

I think the questions of fair use might keep us busy for hours.

> There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

I think a more fair comparison would be if someone used my website as reference/inspiration/etc when writing a book.

routerl · on May 22, 2024

I'll accept my reading comprehension mistake if you can quote the passages you're referring to, but I don't see what you're talking about in the GP link

rob · on May 22, 2024

https://platform.openai.com/docs/gptbot

  Disallowing GPTBot

  To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:

  User-agent: GPTBot
  Disallow: /

routerl · on May 22, 2024

I stand corrected. Thank you.

gtirloni · on May 22, 2024

*> why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

1) because they are not in law

2) because you too can ignore robots.txt

NemoNobody · on May 22, 2024

Am I missing something?

I thought is was obvious that Microsoft is clearly about to establish the next "standard" with near windows level of ubiquity, it will end up our primary starting point to use Microsoft stuff - we won't open apps, Copilot will.

Actions speak louder than words tho - look at how obvious they are being

Copilot is included with windows, they added a button for it to all keyboards made here on out, built it into Edge, Office, is a standalone app, their search engine and now their Xbox games NPCs will be AI powered, prolly open to all their game pass studios.

If it goes the way I expect Microsoft will be essentially done positioning themselves for the world we talk to and expect to listen to us - and organize, track and recall anything I talk to about it. Perfect for the smart glasses we all about to buy

Tbh, I think this will be the end of computing as we conceive it now - just not for the reason I expected originally.

Folders for example - I think Copilot will end folders and all the file organization stuff for normal users. I shouldn't need to ever kno where that stuff is on my PC after a future date, or manage it in any way.

Instead we'll have "real-time" folders, created from our own saved content, assembled to our inquiry and according to our preferences all named, topic labeled, and dated - but not by us.

Stored and retrieved by AI - lots like human memory actually.

Bc we'd then NEED Copilot just to access our stuff - I think that is most definitely coming sooner than later

cchance · on May 22, 2024

Because robots.txt is a standard people can choose to follow, its not a law

tripplyons · on May 21, 2024

Scraping and violating TOS are not illegal to do, but they can get you blocked.

HeatrayEnjoyer · on May 21, 2024

"Not illegal" requires a jurisdiction reference.

FrenchDevRemote · on May 22, 2024

Nope, using a web browser is not illegal. If you don't want your website to be accessed don't put it on the internet.

HeatrayEnjoyer · on May 26, 2024

Which jurisdiction are you referring to? The internet is not above the law.

speedylight · on May 21, 2024

Not illegal in the US!

xcasperx · on May 22, 2024

I believe this is current precedent around scraping:

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

brudgers · on May 22, 2024

Terms of service enforcement is a matter of civil law.

Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.

mensetmanusman · on May 21, 2024

Preventing scraping also entrenches google for eternity.

rl3 · on May 22, 2024

The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.

8note · on May 22, 2024

Why? It's another user agent. Curl does the same thing, as does chrome and firefox