Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?
28 points by hnroo99 on May 21, 2024 | hide | past | favorite | 45 comments
The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?


Google scrapes like a maniac. And for profit. Many others do the same.

A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.

The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...

If you’re trying to prevent scraping of your data, your best option is to not make it public.


I have a website that has around 5k Bing visits every single day. Basically my most expensive user. Compared to Google with about 70 visits daily.

I randomly block their IPs, tried some stuff with robots.txt and even completely banned it in the past as I thought this must be something else. It would just show up with new IPs and proceed.

The few times I checked it looked like official IPs. If I knew how I would sue Microsoft. They have no business in scraping my website 3-5 times a day when they send me basically no traffic

Edit:// it's also not my only website where bing goes crazy. And it's not new, this is going on for several years now (so no AI scraping I guess)


> The few times I checked it looked like official IPs

Considering Microsoft now runs a cloud service, it may very well be their cloud provider users and not official Bing scrapers.


They absolutely should be paying you if they're single-handedly abusing your resources


There is a YouTube video about a content creator who got a huge AWS bill out of byte spider (bytedance's scraper) indexing his site incredible numbers of times.


Or maybe just don't host your stuff in such a way that any internet user can bankrupt you just from sending GET requests?


If GET requests increase my AWS bill, does that make them not idempotent any more?


This sounds like a scoping issue


This so much. While it's annoying to waste resources for nothing. All I really had to do to make this a non issue is upgrading my VPS to a slightly larger box.

If I would pay per requests I couldn't sleep anymore


On the other hand the amount of traffic he's mentioning is about 3.4 requests per minute which is effectively background noise. They're probably getting much more hits from actually malicious scanners than Bing's misbehaving spider.


Usually it's robots.txt that 'prohibits' such use, but you're right it's not enforceable.


If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.


The first distinction I can think of is, “Who has agreed to the terms of service of the site being visited?”


If I visit a website and there is a tiny link to the terms of service on the bottom of it, there is no reasonable interpretation that I have ever agreed to them.


Because the stupid terms of service link being on a site doesn't mean anyone agreed to them, they dont actually hold up in a court of law. Now if you signup for an account and agree, then MAYBE then they can be enforced.

So no if its on the internet and its publicly viewable, i don't see why a bot like chatgpt should somehow be blind to a site that a human can see lol, hell microsofts made their new AI system see your screen, do you also want the AI's to somehow black out the screen area that has the website open and ... know theirs a TOS somewhere on the page


The CEO of the scraping company. Are we good?


These terms are legally void


That's fair. Given that you can't do this programatically with their api (it disallows scraping prompts), it feels less prone to abuse. And even if a bad actor tried to leverage their web api instead of their official api to get around prompt limitations they could easily just ban you.


I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:

https://www.sigmaaldrich.com/robots.txt


My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.



Hey, that's great!

But wait, we already had a working mechanism to signal exactly this type of opting out[1] so let me rephrase the OP question: why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

It certainly does seem as if they're trying to position themselves as a new standard against which content owners have to actively opt-out, and thus disregarding the already existing active opt-out signals. But that would mean that they don't actually care about privacy, and their opt-out signal is disengenuous! That can't be right, can it?? Surely everything they do is in good faith, just like every other corporation ever!

Anyway, the fact that they disregard existing privacy standards and rolled out their own privacy standard definitely gives me a lot of confidence that they will forever follow the privacy standards they themselves created!

Now excuse me, but I have to go get treatment for terminally metastasized sarcasm.

[1] https://en.m.wikipedia.org/wiki/Robots.txt


I'm... pretty sure OpenAI respects robots.txt, as explained in the link GP shared?


Whether it respects robots.txt is irrelevant if its existence is secret for the entire time it's doing the scraping.


I'm sorry, but I'm confused by this comment. What exactly is secret?


The several years that they were scraping the web to build their models and they weren't telling anyone about it.


I don't think that's how robots.txt or scraping really works. Do you expect scrapers to announce every bot they run? Do you expect webmasters to add a robots rule for every bot?

If someone didn't want OpenAI or anyone else scraping their site, whether OpenAI or anyone else announces they're scraping doesn't matter, if they respect robots.txt, and you have rules to catch unannounced scrapers.


What I'm saying is that it doesn't matter if you disallow them access now, because they've already gotten everything they want, whether you wanted it or not.

The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

When the little guy does it, that's called piracy and theft. When billion dollar corporate entities do it, it's called a technological marvel.


> The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

This doesn't seem accurate at all. Plenty of businesses are built on scraping data; see: Google.

> The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

I think the questions of fair use might keep us busy for hours.

> There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

I think a more fair comparison would be if someone used my website as reference/inspiration/etc when writing a book.


I'll accept my reading comprehension mistake if you can quote the passages you're referring to, but I don't see what you're talking about in the GP link


https://platform.openai.com/docs/gptbot

  Disallowing GPTBot

  To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:

  User-agent: GPTBot
  Disallow: /


I stand corrected. Thank you.


*> why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

1) because they are not in law

2) because you too can ignore robots.txt


Am I missing something?

I thought is was obvious that Microsoft is clearly about to establish the next "standard" with near windows level of ubiquity, it will end up our primary starting point to use Microsoft stuff - we won't open apps, Copilot will.

Actions speak louder than words tho - look at how obvious they are being

Copilot is included with windows, they added a button for it to all keyboards made here on out, built it into Edge, Office, is a standalone app, their search engine and now their Xbox games NPCs will be AI powered, prolly open to all their game pass studios.

If it goes the way I expect Microsoft will be essentially done positioning themselves for the world we talk to and expect to listen to us - and organize, track and recall anything I talk to about it. Perfect for the smart glasses we all about to buy

Tbh, I think this will be the end of computing as we conceive it now - just not for the reason I expected originally.

Folders for example - I think Copilot will end folders and all the file organization stuff for normal users. I shouldn't need to ever kno where that stuff is on my PC after a future date, or manage it in any way.

Instead we'll have "real-time" folders, created from our own saved content, assembled to our inquiry and according to our preferences all named, topic labeled, and dated - but not by us.

Stored and retrieved by AI - lots like human memory actually.

Bc we'd then NEED Copilot just to access our stuff - I think that is most definitely coming sooner than later


Because robots.txt is a standard people can choose to follow, its not a law


Scraping and violating TOS are not illegal to do, but they can get you blocked.


"Not illegal" requires a jurisdiction reference.


Nope, using a web browser is not illegal. If you don't want your website to be accessed don't put it on the internet.


Which jurisdiction are you referring to? The internet is not above the law.


Not illegal in the US!


I believe this is current precedent around scraping:

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Terms of service enforcement is a matter of civil law.

Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.


Preventing scraping also entrenches google for eternity.


The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.


Why? It's another user agent. Curl does the same thing, as does chrome and firefox




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: