One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent...

MetaWhirledPeas · on Dec 30, 2024

> Cloudflare also has a feature to block known AI bots and even suspected AI bots

In addition to other crushing internet risks, add wrongly blacklisted as a bot to the list.

kmeisthax · on Dec 31, 2024

This is already a thing for basically all of the second[0] and third worlds. A non-trivial amount of Cloudflare's security value is plausible algorithmic discrimination and collective punishment as a service.

[0] Previously Soviet-aligned countries; i.e. Russia and eastern Europe.

shark_laser · on Jan 1, 2025

Yep. Same for most of Asia too.

Cloudflare's filters are basically straight up racist.

I have stopped using so many sites due to their use of Cloudflare.

brianwawok · on Jan 2, 2025

If 90% of your problem users come from 1-2 countries, seems pretty sensible to block that country. I know I have 0 paying users in those countries, so why deal with it? Let them go fight it out doing bot wars in local sites

lazide · on Jan 2, 2025

Keep in mind, this is literally why stereotypes and racism exists. It’s the exact same process/reasoning.

qeternity · on Jan 2, 2025

No, racism would be “I won’t deal with customers of Chinese ethnicity irrespective of their country of operation”.

Blocking Chinese (or whatever) IPs because they are responsible for a huge amount of malicious behavior is not racist.

Frankly I don’t care what the race of the Chinese IP threat actor is.

lazide · on Jan 2, 2025

You really might want to re-read my comment.

lazide · on Jan 2, 2025

Well, not racist per-se - if you visit the countries (regardless of race) you’re screwed too.

Geo-location-ist?

ls612 · on Dec 31, 2024

People hate collective punishment because it works so well.

eckesicle · on Dec 31, 2024

Anecdatally, by default, we now block all Chinese and Russian IPs across our servers.

After doing so, all of our logs, like ssh auth etc, are almost completely free and empty of malicious traffic. It’s actually shocking how well a blanket ban worked for us.

citrin_ru · on Jan 1, 2025

Being slightly annoyed by noise in SSH logs I’ve blocked APNIC IPs and now see a comparable number of brute force attempts from ARIN IPs (mostly US ones). Geo blocks are totally ineffective against TAs which use a global network of proxies.

macintux · on Jan 1, 2025

~20 years ago I worked for a small IT/hosting firm, and the vast majority of our hostile traffic came from APNIC addresses. I seriously considered blocking all of it, but I don’t think I ever pulled the trigger.

TacticalCoder · on Jan 1, 2025

> Anecdatally, by default, we now block all Chinese and Russian IPs across our servers.

This. Just get several countries' entire IP address space and block these. I've posted I was doing just that only to be told that this wasn't in the "spirit" of the Internet or whatever similar nonsense.

In addition to that only allow SSH in from the few countries / ISPs legit trafic shall legitimately be coming from. This quiets the logs, saves bandwidth, saves resources, saves the planet.

xp84 · on Jan 2, 2025

I agree with your approach. It’s easy to empathize with innocent people in say, Russia, blocked from a site which has useful information to them. However the thing these “spirit/openness” people miss is that many sites have a narrow purpose which makes no sense to open it up to people across the world. For instance, local government. Nobody in India or Russia needs to see the minutes from some US city council meeting, or get building permit information. Likewise with e-commerce. If I sell chocolate bars and ship to US and Canada, why wouldn’t I turn off all access from overseas? You might say “oh, but what if some friend in $COUNTRY wants to order a treat for someone here?” And the response to that is always “the hypothetical loss from that is minuscule compared to the cost of serving tons of bot traffic as well as possible exploits those bots might do.

(Yes, yes, VPNs and proxies exist and can be used by both good and bad actors to evade this strategy, and those are another set of IPs widely banned for the same reason. It’s a cat and mouse game but you can’t argue with the results)

dgfitz · on Jan 1, 2025

[flagged]

brianwawok · on Jan 2, 2025

That is not at all the reason for the great firewall.

saagarjha · on Jan 1, 2025

Putting everyone in jail also works well to prevent crime.

singleshot_ · on Jan 2, 2025

Having a door with a lock on it prevents other people from committing crime in my house. This metaphor has the added benefit of making some amount of sense in context.

panic · on Jan 1, 2025

Works how? Are these blocks leading to progress toward solving any of the underlying issues?

forgetfreeman · on Jan 1, 2025

It's unclear that there are actors below the regional-conglomerate-of-nation-states level that could credibly resolve the underlying issues, and given legislation and enforcement regimes sterling track record of resolving technological problems realistically it seems questionable that solutions could exist in practice. Anyway this kind of stuff is well outside the bounds of what a single org hosting an online forum could credibly address. Pragmatism uber alles.

victorbjorklund · on Jan 2, 2025

The underlying issue is that countries like russia support abuse like this. So by blocking them perhaps the people there will demand that their govt stops supporting crimes and absuse so that they can be allowed back into the internet.

(In the case of russians though i guess they will never change)

petre · on Jan 2, 2025

> people there will demand that their govt stops supporting crimes and absuse so that they can be allowed back into the internet

Sure. It doesn't work that way, not in Russia or China. First they have to revert back to 1999 when Putin took over. Then they have to extradite criminals and crack down on cybercrime. Then maybe they could be allowed back onto the open Internet.

In my country one would be exradited to the US in no time. In fact the USSS came over for a guy who had been laundering money through BTC from a nearby office. Not a month passed and he got extradited to the US, never to be heard from again.

anonym29 · on Jan 1, 2025

Innocent people hate being punished for the behavior of other people, whom the innocent people have no control over.*

FTFY.

zdragnar · on Jan 1, 2025

The phrase "this is why we can't have nice things" springs to mind. Other people are the number one cause of most people's problems.

thwarted · on Jan 1, 2025

Tragedy of the Commons Ruins Everything Around Me.

grishka · on Jan 1, 2025

I have a growing Mastodon thread of this shit: https://mastodon.social/@grishka/111934602844613193

It's of course trivially bypassable with a VPN, but getting a 403 for an innocent get request of a public resource makes me angry every time nonetheless.

neop1x · on Jan 4, 2025

Exactly. I have to use a VPN just for this kind of bu**it. :/

QuadmasterXLII · on Jan 1, 2025

The difference between politics and diplomacy is that you can survive in politics without resorting to collective punishment.

d0mine · on Jan 1, 2025

unrelated: USSR might have been 2nd world. Russia is 3rd world (since 1991) -- banana republic

crote · on Jan 2, 2025

No, Russia is by definition the 2nd world. It's about spheres of influence, not any kind of economic status. The First World is the Western Bloc centered around the US, the Second World is the Eastern Bloc centered around then-USSR and now-Russia (although these days more centered on China), the Third World is everyone else.

d0mine · on Jan 2, 2025

By which definition? Here’s the first result in google: “The term "second world" was initially used to refer to the Soviet Union and countries of the communist bloc. It has subsequently been revised to refer to nations that fall between first and third world countries in terms of their development status and economic indicators.” https://www.investopedia.com/terms/s/second-world.asp#:~:tex....

Notice the word economic in it.

throwaway290 · on Dec 30, 2024

What do you mean crushing risk? Just solve these 12 puzzles by moving tiny icons on tiny canvas while on the phone and you are in the clear for a couple more hours!

homebrewer · on Dec 30, 2024

If you live in a region which it is economically acceptable to ignore the existence of (I do), you sometimes get blocked by website r̶a̶c̶k̶e̶t̶ protection for no reason at all, simply because some "AI" model saw a request coming from an unusual place.

benhurmarcel · on Dec 30, 2024

Sometimes it doesn’t even give you a Captcha.

I have come across some websites that block me using Cloudflare with no way of solving it. I’m not sure why, I’m in a large first-world country, I tried a stock iPhone and a stock Windows PC, no VPN or anything.

That’s just no way to know.

dannyw · on Dec 31, 2024

That’s probably a page/site rule set by the website owner. Some sites block EU IPs as the costs of complying with GDPR outweigh the gain.

throwaway290 · on Dec 31, 2024

I saw GDPR related blockage like literally twice in a few years and I connect from EU IP almost all the time

Overload of captcha is not about GDPR...

but the issue is strange. @benhurmarcel I would check if there is somebody or some company nearby abusing stuff and you got under the hammer. Maybe unscrupulous VPN company. Using a good VPN can in fact make things better (but will cost money) or if you have a place to put your own all the better. otherwise check if you can change your IP with provider or change providers or move I guess...

not to excuse CF racket but as this thread shows the data hungry artificial stupidity leaves no choice to some sites

EVa5I7bHFq9mnYK · on Jan 1, 2025

I found it's best to use VPSes from young and little known hosting companies, as their IP is not yet on the blacklists.

benhurmarcel · on Dec 31, 2024

Does it work only based on the IP?

I also tried from a mobile 4G connection, it’s the same.

throwaway290 · on Jan 1, 2025

This may be too paranoid, but if your mobile IP is persistent and phone was compromised and is serving as a proxy for bots then it could explain why your IP fell out of favor

EVa5I7bHFq9mnYK · on Jan 1, 2025

You don't get your own external IP with the phone, it's shared, like NAT.

scarface_74 · on Jan 1, 2025

I get a different IPv4 and IPv6 address every time I toggle airplane mode on and off.

lazide · on Jan 2, 2025

Externally routable IPv4, or just a different between-a-cgnat address?

scarface_74 · on Jan 2, 2025

Externally routable IPv4 as seen by whatismyip.com.

throwaway290 · on Jan 1, 2025

Depends on provider/plan

benhurmarcel · on Dec 31, 2024

One of the affected websites is a local cafe in the EU. It doesn’t make any sense to block EU IPs.

gs17 · on Dec 30, 2024

If it clears you at all. I accidentally set a user agent switcher on for every site instead of the one I needed it for, and Cloudflare would give me an infinite loop of challenges. At least turning it off let me use the Internet again.

JohnMakin · on Dec 30, 2024

These features are opt-in and often paid features. I struggle to see how this is a "crushing risk," although I don't doubt that sufficiently unskilled shops would be completely crushed by an IP/userAgent block. Since Cloudflare has a much more informed and broader view of internet traffic than maybe any other company in the world, I'll probably use that feature without any qualms at some point in the future. Right now their normal WAF rules do a pretty good job of not blocking legitimate traffic, at least on enterprise.

MetaWhirledPeas · on Dec 30, 2024

The risk is not to the company using Cloudflare; the risk is to any legitimate individual who Cloudflare decides is a bot. Hopefully their detection is accurate because a false positive would cause great difficulties for the individual.

neilv · on Dec 31, 2024

For months, my Firefox was locked out of gitlab.com and some other sites I wanted to use, because CloudFlare didn't like my browser.

Lesson learned: even when you contact the sales dept. of multiple companies, they just don't/can't care about random individuals.

Even if they did care, a company successfully doing an extended three-way back-and-forth troubleshooting with CloudFlare, over one random individual, seems unlikely.

CalRobert · on Jan 1, 2025

We’re rapidly approaching a login-only internet. If you’re not logged in with google on chrome then no website for you!

Attestation/wei enable this

neop1x · on Jan 4, 2025

And not just a login but soon probably also the real verified identity tied to it. The internet is becoming a worse place than the real world.

bodantogat · on Dec 30, 2024

I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.

echelon · on Dec 30, 2024

You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.

marcusb · on Dec 30, 2024

Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)

gagik_co · on Dec 30, 2024

This is cool! It'd have been funny for this to become mainstream somehow and mess with LLM progression. I guess that's already happening with all the online AI slop that is being re-fed into its training.

gs17 · on Dec 30, 2024

I tested it on your site and I'm curious, is there a reason why the link-maze links are all gibberish (as in "oNvUcPo8dqUyHbr")? I would have had links be randomly inserted in the generated text going to "[random-text].html" so they look a bit more "real".

marcusb · on Dec 30, 2024

Its unfinished. At the moment, the links are randomly generated because that was an easy way to get a bunch of unique links. Sooner or later, I’ll just get a few tokens from the markov generator and use those for the link names.

I’d also like to add image obfuscation on the static generator side - as it stands now, anything other than text or html gets passed through unchanged.

tivert · on Dec 30, 2024

> You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

> Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

I agree, and not just to discourage them running up traffic bills. The end-state of what they hope to build is very likely to be extremely for most regular people [1], so we shouldn't cooperate in building it.

[1] And I mean end state. I don't care how much value you say you get from some AI coding assistant today, the end state is your employer happily gets to fire you and replace you with an evolved version of the assistant at a fraction of your salary. The goal is to eliminate the cost that is our livelihoods. And if we're lucky, in exchange we'll get a much reduced basic income sufficient to count the rest of our days from a dense housing project filled with cheap minimum-quality goods and a machine to talk to if we're sad.

danlugo92 · on Jan 2, 2025

If your employer can run their companies without employees in the future it also means you can have your own company with no employees.

If anything this will level the playing field, and creativity will prevail.

tivert · on Jan 2, 2025

> If your employer can run their companies without employees in the future it also means you can have your own company with no employees.

No, you still need money. Lots of money.

> If anything this will level the playing field, and creativity will prevail.

That's a fantasy. The people that already have money will prevail (for the most part).

tyre · on Dec 30, 2024

Their problem is they can’t detect which are bots in the first place. If they could, they’d block them.

echelon · on Dec 30, 2024

Then have the users solve ARC-AGI or whatever nonsense. If the bots want your content, they'll have to solve $3,000 of compute to get it.

Tostino · on Dec 30, 2024

That only works until The benchmark questions and answers are public. Which they necessarily would be in this case.

EVa5I7bHFq9mnYK · on Jan 1, 2025

Or maybe solve a small sha2(sha2()) leading zeroes challenge, taking ~1 second of computer time. Normal users won't notice, and bots will earn you Bitcoins :)

endofreach · on Dec 30, 2024

> Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills

Or just wait for after the AI flood has peaked & most easily scrapable content has been AI generated (or at least modified).

We should seriously start discussing the future of the public web & how to not leave it to big tech before it's too late. It's a small part of something i am working on, but not central. So i haven't spend enough time to have great answers. If anyone reading this seriously cares, i am waiting desperately to exchange thoughts & approaches on this.

jorvi · on Jan 1, 2025

Very tangential but you should check out the old game “Hacker BS Replay”.

It’s basically about how in 2012, with the original internet overrun by spam, porn and malware, all the large corporations and governments got together and created a new, tightly-controlled clean internet. Basically how modern Apple & Disneyland would envision the internet. On this internet you cannot choose your software, host your own homepage or have your own e-mail server. Everyone is linked to a government ID.

We’re not that far off:

- SaaS

- Gmail blocking self-hosted mailservers

- hosting your own site becoming increasingly cumbersome, and before that MySpace and then Meta gobbled up the idea of a home page a la GeoCities.

- Secure Boot (if Microsoft locked it down and Apple locked theirs, we would have been screwed before ARM).

- Government ID-controlled access is already commonplace in Korea and China, where for example gaming is limited per day.

In the Hacker game, as a response to the new corporate internet, hackers started using the infrastructure of the old internet (“old copper lines”) and set something up called the SwitchNet, with bridges to the new internet.

llm_trw · on Dec 30, 2024

You will be burning through thousands of dollars worth of compute to do that.

lazide · on Jan 2, 2025

The biggest issue is at least 80% of internet users won’t be capable of passing the test.

araes · on Jan 3, 2025

Agree. The bots are already significantly better at passing almost every supposed "Are You Human?" test than the actual humans. "Can you find the cars in this image?" Bots are already better. "Can you find the incredibly convoluted text in this color spew?" Bots are already better. Almost every test these days is the same "These don't make me feel especially 'human'. Not even sure what that's an image of. Are there even letters in that image?"

Part of the issue, the humans all behaved the same way previously. Just slower.

All the scraping, and web downloading. Humans have been doing that for a long time. Just slower.

It's the same issue with a lot of society. Mean, hurtful humans, made mean hurtful bots.

Always the same excuses too. Company / researchers make horrible excrement, knowing full well its going harm everybody on the world wide web. Then claim they had no idea. "Thoughts and prayers."

The torture that used to exist on the world wide web of copy-pasta pages and constant content theft, is now just faster copy-pasta pages and content theft.

kmoser · on Dec 31, 2024

My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.

Capricorn2481 · on Dec 31, 2024

I don't understand this. You don't have routes your users might need in robots.txt? This article is about bots accessing resources that other might use.

IncreasePosts · on Dec 31, 2024

It seems better to put fake honeypot urls in robots.txt, and block any up that accesses those.

trod1234 · on Jan 1, 2025

Blocking will never work.

You need to impose cost. Set up QoS buckets, slow suspect connections down dramatically (almost to the point of timeout).

Capricorn2481 · on Dec 31, 2024

Ah I see

Beijinger · on Jan 1, 2025

How can I implement this?

kmoser · on Jan 1, 2025

Too many ways to list here, and implementation details will depend on your hosting environment and other requirements. But my quick-and-dirty trick involves a single URL which, when visited, runs a script which appends "deny from foo" (where foo is the naughty IP address) to my .htaccess file. The URL in question is not publicly listed, so nobody will accidentally stumble upon it and accidentally ban themselves. It's also specifically disallowed in robots.txt, so in theory it will only be visited by bad bots.

aorth · on Jan 1, 2025

Another related idea: use fail2ban to monitor the server access logs. There is one filter that will ban hosts that request non-existent URLs like WordPress login and other PHP files. If your server is not hosting PHP at all it's an obvious sign that the requests are from bots that are probing maliciously.

acheong08 · on Dec 31, 2024

TLS fingerprinting still beats most of them. For really high compute endpoints I suppose some sort of JavaScript challenge would be necessary. Quite annoying to set up yourself. I hate cloudflare as a visitor but they do make life so much easier for administrators

petre · on Jan 2, 2025

You rate limit them and then block the abusers. Nginx allows rate limiting. You can then block them using fail2ban for an hour if they're rate limited 3 times. If they get blocked 5 times you can block them forever using the recidive jail.

I've had massive AI bot traffic from M$, blocked several IPs by adding manual entries into the recidive jail. If they come back and disregard robots.txt with disallow * I will run 'em through fail2ban.

herbst · on Jan 2, 2025

Whatever M$ was doing still baffles me. I still have several azure ranges in my blocklist because whatever this was appeared to change strategie once I implemented a ban method.

petre · on Jan 2, 2025

They were hammering our closed ticketing system for some reason. I blocked an entire C block and an individual IP. If needed I will not hesitate banning all their ranges, which means we won't get any mail from Azure, M$ office 365, since this is also our mail server. But scew'em, I'll do it anyway until someone notices, since it's clearly abuse.

newsclues · on Dec 30, 2024

The amateurs at home are going to give the big companies what they want: an excuse for government regulation.

throwaway290 · on Dec 30, 2024

If it doesn't say it's a bot and it doesn't come from a corporate IP it doesn't mean it's NOT a bot and not run by some "AI" company.

bodantogat · on Dec 30, 2024

I have no way to verify this, I suspect these are either stealth AI companies or data collectors, who hope to sell training data to them

datadrivenangel · on Dec 30, 2024

I've heard that some mobile SDKs / Apps earn extra revenue by providing an IP address for VPN connections / scraping.

odo1242 · on Jan 2, 2025

Chrome extensions too

int_19h · on Jan 1, 2025

Don't worry, the governments are perfectly capable of coming up with excuses all on their own.

CoastalCoder · on Dec 30, 2024

I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.

Would that make subsequent accesses be violations of the U.S.'s Computer Fraud and Abuse Act?

betaby · on Dec 30, 2024

Crashing wasn't the intent. And scraping is legal, as I remember per Linkedin case.

azemetre · on Dec 30, 2024

There’s a fine line between scrapping and DDOS’ing I’m sure.

Just because you manufacture chemicals doesn’t mean you can legally dump your toxic waste anywhere you want (well shouldn’t be allowed to at least).

You also shouldn’t be able to set your crawlers causing sites to fail.

acedTrex · on Dec 30, 2024

intent is likely very important to something like a ddos charge

gameman144 · on Dec 30, 2024

Maybe, but impact can also make a pretty viable case.

For instance, if you own a home you may have an easement on part of your property that grants other cars from your neighborhood access to pass through it rather than going the long way around.

If Amazon were to build a warehouse on one side of the neighborhood, however, it's not obvious that they would be equally legally justified to send their whole fleet back and forth across it every day, even though their intent is certainly not to cause you any discomfort at all.

layer8 · on Dec 30, 2024

So is negligence. Or at least I would hope so.

RF_Savage · on Dec 30, 2024

So have the stressor and stress testing DDoS for hire sites changed to scraping yet?

acedTrex · on Dec 31, 2024

The courts will likely be able to discern between "good faith" scraping and a DDoS for hire masquerading as scraping.

iinnPP · on Dec 30, 2024

Wilful ignorance is generally enough.

herbst · on Jan 2, 2025

It's like these AI companies have to invent scraping spiders again from scratch. I don't know how often I have been ddosed to complete site failure and still ongoing by random scrapers just the last few months.

franga2000 · on Dec 30, 2024

If I make a physical robot and it runs someone over, I'm still liable, even though it was a delivery robot, not a running over people robot.

If a bot sends so many requests that a site completely collapses, the owner is liable, even though it was a scraping bot and not a denial of service bot.

stackghost · on Dec 30, 2024

The law doesn't work by analogy.

maximinus_thrax · on Jan 1, 2025

Except when it does https://en.wikipedia.org/wiki/Analogy_(law)

echelon · on Dec 30, 2024

Then you can feed them deliberately poisoned data.

Send all of your pages through an adversarial LLM to pollute and twist the meaning of the underlying data.

cess11 · on Dec 30, 2024

The scraper bots can remain irrational longer than you can stay solvent.

optimiz3 · on Dec 30, 2024

> I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.

Depends how much money you are prepared to spend.

jahewson · on Dec 30, 2024

No, fortunately random hosts on the internet don’t get to write a letter and make something a crime.

throwaway_fai · on Dec 30, 2024

Unless they're a big company in which case they can DMCA anything they want, and they get the benefit of the doubt.

BehindBlueEyes · on Dec 30, 2024

Can you even DMCS takedown crawlers?

throwaway_fai · on Dec 30, 2024

Doubt it, a vanilla cease-and-desist letter would probably be the approach there. I doubt any large AI company would pay attention though, since, even if they're in the wrong, they can outspend almost anyone in court.

Nevermark · on Jan 1, 2025

Small claims court?

viraptor · on Dec 30, 2024

You can also block by IP. Facebook traffic comes from a single ASN and you can kill it all in one go, even before user agent is known. The only thing this potentially affects that I know of is getting the social card for your site.

jandrese · on Dec 30, 2024

If a bot ignores robots.txt that's a paddlin'. Right to the blacklist.

nabla9 · on Dec 30, 2024

The linked article explains what happens when you block their IP.

gs17 · on Dec 30, 2024

For reference:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It's really absurd that they seem to think this is acceptable.

viraptor · on Dec 30, 2024

Block the whole ASN in that case.

therealdrag0 · on Dec 30, 2024

What about adding fake sleeps?

petee · on Dec 30, 2024

Silly question, but did you try to email Meta? Theres an address at the bottom of that page to contact with concerns.

> webmasters@meta.com

I'm not naive enough to think something would definitely come of it, but it could just be a misconfiguration

TuringNYC · on Dec 30, 2024

>> One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

Are they not respecting robots.txt?

eesmith · on Dec 30, 2024

Quoting the top-level link to geraspora.de:

> Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki.

vasco · on Dec 31, 2024

Edit history of a wiki sounds much more interesting than the current snapshot if you want to train a model.

eesmith · on Jan 1, 2025

Does that information improve or worsen the training?

Does it justify the resource demands?

Who pays for those resources and who benefits?

candlemas · on Dec 30, 2024

The biggest offenders for my website have always been from China.

tonyedgecombe · on Jan 2, 2025

[flagged]

lanstin · on Jan 2, 2025

Or invisible text to humans about such topics.

ryandrake · on Dec 31, 2024

> My solution was to add a Cloudflare rule to block requests from their User-Agent.

Surely if you can block their specific User-Agent, you could also redirect their User-Agent to goatse or something. Give em what they deserve.

globalnode · on Dec 31, 2024

cant you just mess with them? like accept the connection but send back rubbish data at like 1 bps?

PeterStuer · on Jan 2, 2025

Most administrators have no idea or no desire to correctly configure Cloudflare, so they just slap it on the whole site by default and block all the legitimate access to e.g. rss feeds.

coldpie · on Dec 30, 2024

Imagine being one of the monsters who works at Facebook and thinking you're not one of the evil ones.

Aeolun · on Dec 30, 2024

Well, Facebook actually releases their models instead of seeking rent off them, so I’m sort of inclined to say Facebook is one of the less evil ones.

echelon · on Dec 30, 2024

> releases their models

Some of them, and initially only by accident. And without the ingredients to create your own.

Meta is trying to kill OpenAI and any new FAANG contenders. They'll commoditize their complement until the earth is thoroughly salted, and emerge as one of the leading players in the space due to their data, talent, and platform incumbency.

They're one of the distribution networks for AI, so they're going to win even by just treading water.

I'm glad Meta is releasing models, but don't ascribe their position as one entirely motivated by good will. They want to win.

int_19h · on Jan 1, 2025

FWIW, there's considerable doubt that the initial LLaMA "leak" was accidental, based on Meta's subsequent reaction.

I mean, the comment with a direct download link in their GitHub repo stayed up even despite all the visibility (it had tons of upvotes).

throwaway290 · on Dec 30, 2024

Or ClosedAI.

Related https://news.ycombinator.com/item?id=42540862

throwaway_fai · on Dec 30, 2024

[flagged]

FrustratedMonky · on Dec 30, 2024

The Banality of Evil.

Everyone has to pay bills, and satisfy the boss.

EVa5I7bHFq9mnYK · on Jan 1, 2025

Yeah, super convenient, now every second web site blocks me as "suspected AI bot".

devit · on Dec 30, 2024

[flagged]

jsheard · on Dec 30, 2024

That's right, getting DDOSed is a skill issue. Just have infinite capacity.

devit · on Dec 30, 2024

DDOS is different from crashing.

And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).

Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.

adamtulinius · on Dec 30, 2024

You can doubt all you want, but none of us really know, so maybe you could consider interpreting people's posts a bit more generously in 2025.

atomt · on Dec 30, 2024

I've seen concurrency in excess of 500 from Metas crawlers to a single site. That site had just moved all their images so all the requests hit the "pretty url" rewrite into a slow dynamic request handler. It did not go very well.

markerz · on Dec 30, 2024

Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

iamacyborg · on Dec 30, 2024

I suspect if the tables were turned and someone managed to crash FB consistently they might not take too kindly to that.

ndriscoll · on Dec 30, 2024

I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.

troupo · on Dec 30, 2024

You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.

ndriscoll · on Dec 30, 2024

Beyond something like running an ML model, what web pages are expensive (enough that 1-10 requests/second matters at all) to generate these days?

x0x0 · on Dec 30, 2024

I've worked on multiple sites like this over my career.

Our pages were expensive to generate, so what scraping did is blew out all our caches by yanking cold pages/images into memory. Page caches, fragment caches, image caches, but also the db working set in ram, making every single thing on the site slow.

smolder · on Dec 30, 2024

Usually ones that are written in a slow language, do lots of IO to other webservices or databases in a serial, blocking fashion, maybe don't have proper structure or indices in their DBs, and so on. I have seen some really terribly performing spaghetti web sites, and have experience with them collapsing under scraping load. With a mountain of technical debt in the way it can even be challenging to fix such a thing.

ndriscoll · on Dec 30, 2024

Even if you're doing serial IO on a single thread, I'd expect you should be able to handle hundreds of qps. I'd think a slow language wouldn't be 1000x slower than something like functional scala. It could be slow if you're missing an index, but then I'd expect the thing to barely run for normal users; scraping at 2/s isn't really the issue there.

troupo · on Dec 30, 2024

Run a mediawiki, as described in the post. It's very heavy. Specifically for history I'm guessing it has to re-parse the entire page and do all link and template lookups because previous versions of the page won't be in any cache

ndriscoll · on Dec 30, 2024

The original post says it's not actually a burden though; they just don't like it.

If something is so heavy that 2 requests/second matters, it would've been completely infeasible in say 2005 (e.g. a low power n100 is ~20x faster than the athlon xp 3200+ I used back then. An i5-12600 is almost 100x faster. Storage is >1000x faster now). Or has mediawiki been getting less efficient over the years to keep up with more powerful hardware?

troupo · on Dec 30, 2024

Oh, I was a bit off. They also indexed diffs

> And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.

ndriscoll · on Dec 30, 2024

Does MW not store diffs as diffs (I'd think it would for storage efficiency)? That shouldn't really require much computation. Did diffs take 30s+ to render 15-20 years ago?

For what it's worth my kiwix copy of Wikipedia has a ~5ms response time for an uncached article according to Firefox. If I hit a single URL with wrk (so some caching at least with disks. Don't know what else kiwix might do) at concurrency 8, it does 13k rps on my n305 with a 500 us average response time. That's over 20Gbit/s, so basically impossible to actually saturate. If I load test from another computer it uses ~0.2 cores to max out 1Gbit/s. Different code bases and presumably kiwix is a bit more static, but at least provides a little context to compare with for orders of magnitude. A 3 OOM difference seems pretty extreme.

Incidentally, local copies of things are pretty great. It really makes you notice how slow the web is when links open in like 1 frame.

troupo · on Dec 31, 2024

> Different code bases

Indeed ;)

> If I hit a single URL with wrk

But the bots aren't hitting a single URL

As for the diffs...

According to MediaWiki it gzips diffs [1]. So to render a previous version of the page I guess it'd have to unzip and apply all diffs in sequence to render the final version of the page.

And then it depends on how efficient the queries are at fetching etc.

[1] https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture

layer8 · on Dec 30, 2024

The alternative of crawling to a stop isn’t really an improvement.

adamtulinius · on Dec 30, 2024

No normal person has a chance against the capacity of a company like Facebook

Aeolun · on Dec 30, 2024

Anyone can send 10k concurrent requests with no more than their mobile phone.

aftbit · on Dec 30, 2024

Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.

Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.

layer8 · on Dec 30, 2024

Read the article, the bots change their User Agent to an innocuous one when they start being blocked.

And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.

aftbit · on Jan 1, 2025

I did read the article. I'm skeptical of the claim though. The author was careful to publish specific UAs for the bots, but then provided no extra information of the non-bot UAs.

>If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

I'm also skeptical of the need for _anyone_ to access the edit history at 10 qps. You could put an nginx rule on those routes that just limits the edit history page to 0.5 qps per IP and 2 qps across all IPs, which would protect your site from both bad AI bots and dumb MediaWiki script kiddies at little impact.

>Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.

And caching would fix this too, especially for pages that are guaranteed not to change (e.g. an edit history diff page).

Don't get me wrong, I'm not unsympathetic to the author's plight, but I do think that the internet is an unsafe place full of bad actors, and a single bad actor can easily cause a lot of harm. I don't think throwing up your arms and complaining is that helpful. Instead, just apply the mitigations that have existed for this for at least 15 years, and move on with your life. Your visitors will be happier and the bots will get boned.