Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Every device with FB app is now DDoSing recursive DNS resolvers (twitter.com/blazejkrajnak)
460 points by doener on Oct 4, 2021 | hide | past | favorite | 109 comments


Oh great, and at some point when key ISP DNS servers crack under the load, more and more websites will appear to "go down" from a users perspective - suddenly gmail.com and outlook.com not working. More and more people reload websites, restart devices etc. and increase the load even further. People fall back to using SMS/telephone, but since it is not used to that heavy load of 2021, soon phone calls fail. With FB, WA, Email and Phone "down", engineers can't be reached to fix this. And if they can, they fail to call the Uber to get them somewhere. And even if they could, the streets are congested with people that cannot communicate remotely so try to get somewhere to convey messages in-person.

Hope they will just go back to fix FB and this is just in my head :)


I wrote this as a pure joke, but now that I learned that SERVFAIL is not cached on browsers, clients, intermediate DNS servers [0] etc. I am curiously wondering what will be going on. It is not only FB apps, it is basically every website request (that uses FB JS for ads, tracking, etc.) that triggers a DNS request, which will be forwarded 1:1 from the ISP's DNS to the null-routed FB Subnet. This should put orders of magnitude more load on resolving DNS servers than usual.

[0]: https://serverfault.com/questions/479367/how-long-a-dns-time...


> but now that I learned that SERVFAIL is not cached on [...] intermediate DNS servers [0]

I thought so too, but at least some do cache according to Cloudflare's blog post:

> Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

https://blog.cloudflare.com/october-2021-facebook-outage/


Facebook is so large that even caching the SERVFAIL with a TTL measured in seconds should cut down a lot of traffic.


As I understand my favorite resolver - unbound - has separate "infra" cache that also caches RTT times, and "down" status of name servers:

https://nlnetlabs.nl/documentation/unbound/unbound.conf/#inf...


I enjoyed the comment. You might like my short story along the same lines from a few years ago: https://bowaggoner.com/writeups/robust.html


Very fun little story. Thanks for writing it.


Isn’t there an easy fix to just add a bullshit record for FB to DNS until this blows over?

I am also thinking that all the poorly coded sites that don’t work unless the Share on Facebook button loads are also going to hemorrhage money. So are all the e-commerce sites that rely on Login with FB.

I hope this results in everyone rethinking adding all that shit to their infrastructure before the next time.


The problem isn't (wasn't) facebook.com's A records, it's that the authoritative nameservers for facebook.com are (were) unreachable. In theory someone could change the NS records for facebook on the .com nameservers to point somewhere else and serve up a fake facebook.com domain, but... 1) those NS records have a 6 hour TTL, so it would take a while be effective, and 2) who has the authority to do that?


> those NS records have a 6 hour TTL

Why on earth would any system need a TTL beyond 1 hour? The longer the TTL the longer your outage period will be if a cache is poisoned.


On the flip side, it’s the longer you can survive an outage by relying on enough clients having the cached records. So can go both ways a bit.


.com (used for Facebook authoritative DNS) and .net (used for WhatsApp) servers serve glue records with a 2 day TTL, and domain holders don't get to choose TTL. (the root servers serve glue records for the TLDs with a 2 day TTL as well).

So, two days is the TTL you get for your NS records. Of course, it takes way more than two days for people to stop querying your old records. I gave up after 30-45 days.


Heh. I remember the day when a TTL of a week was pretty standard.

Anyway, the since the NS records for a domain come from the next level up, i.e. in this case the .com nameserver which are also the root nameservers, if they had a short TTL then those root nameservers would get hit pretty hard. Also, it isn't your choice what the TTL on your NS records is, it's admins of your top-level domains.


"Why on earth would any system need a TTL beyond 1 hour?"

Load. Lower TTLs mean shorter term caches and more traffic.

This is a big deal, especially for root servers -- which is why the TTL is not customizable.


Shorter TTL increases poisoning risk. Anyway attacker can set arbitrary TTL if they succeed poisoning so longer TTL won't increase risk.


Could something with a normal 1 hour TTL be poisoned by a cache entry with a 30 day TTL?


  > The problem isn't (wasn't) facebook.com's A records, it's that the
  > authoritative nameservers for facebook.com are (were) unreachable.
The problem isn't (wasn't) DNS, it's that a large number of websites rely on a single point of failure in order to function properly.


I would assume DNS providers were on call with FB staff to manually trigger flushes across the system to prop the records ASAP. Of course this doesn't affect client TTLs but resolution failures generally aren't cached anyway.


Speaking of flushing, today I learned that Google public DNS has an open "flush" function:

https://dns.google.com/cache



Dns providers are basically every ISP in existence, it would be impractical to communicate with by phone calls


My understanding is that the Facebook network itself was unreachable from the internet because of BGP. So even if an IP was resolved from DNS, that IP wouldn’t get routed to Facebook because it withdrew its routes from its peers where it connects to ISPs via BGP


Not if the IP was 127.0.0.1...


That's precisely what my ISP did. In fact, they went a step further and appear to have hijacked all unencrypted DNS queries. I was able to run DNS queries using 'example.com' as my DNS server and get a response.


If its any consolation, I find SMS and telephone to be remarkably robust in these situations. In college during football games, the campus population would swell to probably 200k people within a few square miles each with a cell phone in a pocket. 3g and LTE would be worthless. Campus area wifi would be worthless. The only thing that would work is shutting all that off your phone and resorting to SMS and calling people over EDGE, but it worked flawlessly even with all the people stressing out a handful of towers at once.


Football game's are planned events, and the carriers plan capacity accordingly. I remember sms+cell phone going down on several music festivals with 20.000+ participants (especially when it ended at around midnight). Only some carriers supported those "sponteneous" gatherings in the middle of nowhere with mobile cell towers that would keep connectivity going - but low-cost carriers never did.


During planned events, COWs are often deployed to provide extra mobile bandwidth. Sports areas will have design features to help (e.g. dedicated installed backhaul).

https://en.wikipedia.org/wiki/Mobile_cell_sites

I had a friend who worked deploying them many years ago to events for one of the cellphone providers in NZ.


It would be cool to see Matrix's 100bps experiment allow full apps (with E2EE and other modern features, as opposed to SMS) to handle horrid network conditions.

https://matrix.org/blog/2019/03/12/breaking-the-100-bps-barr...

(I think this was posted already, and it's from 2019, but some people must have missed it - I only saw it this year)


German carrier O2 was/is notorious for offering somewhat decent-ish service in urban areas under normal conditions - but major events that happen to have lots of people moving around the city like fans congregating to a soccer game with public transport, political rallies or your average drunkard festival (=Oktoberfest)? Instant collapse...


For those you understand Swiss German: Mani Matter – I han es Zündhölzli azündt https://www.youtube.com/watch?v=PkGatIgXERI


I lit a match / And it made a fire. / And for my cigarette / I wanted to take the fire from the match. / But the match slipped from my hand and landed on the rug. / And it almost made a hole in the rug.

Well, you know what can happen / If you're not careful with fire. / And for the light on a cigarette / A rug feels rather too expensive. / And from the rug the fire, alas, / Might have spread to the whole house / And who knows what would have happened thereafter?

There would have been a fire in the district / And the fire fighters would have had to come. / They would have honked in the streets / And unloaded their pipes. / And they would have sprayed fire / And it would have been in vain / And the whole city would have burned with nothing to protect it.

And the people would have jumped around / Fearing for their possessions. / They would have thought somebody had started a fire. / They would have grabbed assault rifles. / Everyone would have shouted: "Whose fault is it?" / The whole country would have rioted. / And they would have shot at the ministers behind the lecterns.

The UN would have become involved / And the UN enemies as well. / To preserve peace in Switzerland / Both would have come with tanks. / It would have spread, little by little, / To Europe, to Africa. / There would have been a World War and humans would be no more.

I lit a match / And it made a fire. / And for my cigarette / I wanted to take the fire from the match. / But the match slipped from my hand and landed on the rug. / Thankfully, I picked it back up.


It's already the case. It's been some times since some websites randomly are hard to reach for me. Changing DNS helped, but the best move has been a VPN to a far away country.


Never having explicitly queried fb.com before I never noticed how they (face:b00c) got clever with their IPv6 address:

2a03:2880:f1ff:83:face:b00c:0:25de


Facebook also used brute force to get facebookcorewwwi.onion on Tor a while back. https://en.wikipedia.org/wiki/Facebook_onion_address


Vanity onion domains are usually created via brute force, yes, but it isn't something looked down upon. Heaps of services do this.


Is it really brute force when they had to put a big list of keywords in and make a backronym for whatever combo they happened to find first?


That's just brute forcing from both directions


Yes? From what I read they had an entire datacenter work on this for multiple days.


facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onion is the latest


I would like to know how much pollution was created for that vanity url.


Would you?

You know (or could find out) the hashing algorithm. I’m sure you could come up with a model estimating “pollution” based on hash rate, kWh per hash, “greenness” of the energy source, and estimated hashes until a prefix collision.

Excited to see what you come up with.


Is there no response code for a DNS to say "I don't have what you want right now, come back later but wait at least xxx seconds"?

I guess alternatively you could return garbage (127.0.0.1) with a 5 min ttl or so to get clients to backoff but also problematic.


> you could return garbage (127.0.0.1) with a 5 min ttl

I use 0.0.0.0, though I'm not sure if some layer in that mess would interpret it creatively. Has worked on my machine for years at least:

  $ ping facebook.com
  connect: Invalid argument
(If you do this, please set the TTL to at least a month, and preferably upwards of a decade.)


lots of ISPs straight up ignore ttl


No, the protocol didn't support that when it was first written, and that absence has continued through today. Internet protocols are typically written from the standpoint that the transport/protocol layer handles all this for them, so the real question is how DNS-over-HTTPS can help manage that, since there's no hope of altering classical DNS. Unfortunately, the MDN page for it seems incomplete, so I have no idea how browsers typically handle that header, much less how a DNS resolver's "DNS-over-HTTPS" shim library will. It might be possible to use the QoS functions of HTTP2 or HTTP3 to accept the connection with a receive maxlen of 0 and send the 503 Retry-After at that point. This also assumes a client coded to behave in a kind manner rather than in a greedy manner, which is generally not a safe bet when "time is money" is a factor.


> no hope of altering classical DNS

Does DNS support extensibility?

If so, an option could be responding with a fake answer like 0.0.0.0 + some TTL, plus an extension that tells modern clients "hey, this is actually a SERVFAIL but please cache it for the TTL".


DNS is distributed. thats the problem, you can return nonsense but it might be ignored by the legion of other DNS services that are caching your responses.

because of this there are lots of settings in most servers to ignore and sanitise shitty responses.


Clients should really be using something like exponential backoff ethernet-style.


Just imagine how annoying that would be. Your local resolver has a blip of connectivity and during the time it cannot contact an origin nameserver it burns all of the reasonable retry intervals. After a few seconds its link comes back up but the resolver is still waiting because of "backoff".

Exponential backoff is a really bad antipattern that harms users. There are other, better ways to shed load.

Also, there's no evidence that any DNS infrastructure was overloaded during this event, so what are we even discussing?


6 seconds of wait for a 3-second blip still sounds reasonable.


Not really; if I unplug a cable, connections break; but if I plug it back in they can and should resume much faster than 6 seconds.

And what about a five minute "blip"? Arguably it should resume proper operations as soon as the link goes back up, not some (unknown to user) time afterwards.


Yes but event driven is way different. That should be instant because the software can be told by the system that connectivity was restored and respond immediately since it’s aware of the hardware being plugged in.


What should people use instead of exponential backoff?


Capped Fibonacci backoff with random delay inserted at the cap?


This technology problem is a good metaphor for Facebook overall as a company. There is nothing fundamentally wrong with having your app regularly polling for DNS records when it can't find them, but that can be an actively harmful approach when you are the size of Facebook. Being that size comes with a whole swath of extra responsibilities to ensure that your behavior doesn't end up harming society as a whole.


There is something inherently wrong with that - it's why exponential backoff exists.


> Being that size comes with a whole swath of extra responsibilities

No! "that size" changes over time, so you have to be responsible all the time.

analogy: read up on the '-f' flag for ping


Isn't this "every device with an app using the FB SDK"?


Is it just me, or every time there's an issue with facebook server side, the FB SDK just absolutely damages any software with it installed as a dependency. The thought put into failover states under the hood is very lacking. It speaks to an ethos of if developers aren't a working data pipeline to facebook services, them and their own products can go pound sand.


Every app with facebook SDK used for whatever reason, like login or metrics or ads...


Excellent opportunity to set up a DNS-by-mail service. Just send me a letter with the names you want and I’ll get back to you within 3 to 5 business days!


Or go back to host files.


And even Hacker news is strangely slow to respond.


Lots of people interested in the event and normal web response queries their sql database for user data like upvotes which is their limiter.

Logout which I believe they just forced for all users sends you to the cache and it'll be fast. dang@ mentioned it during the giant S3 outage a few years ago.


Ah nice - I noticed it just got a lot more snappy to respond.


Yep, with FB and IG down, people spending time on a much more important site.... This site.


yeah I thought I was the only one, never really noticed this on HN before...


really lol? HN goes down every full moon day or something like that


I'd thank anybody who would post a tutorial/configfile to setup a DNS server (dnsmasq?), forcefully caching even failed requests to a configurable timeout, and large cache sizes. We might need them in case DNS servers going down under the load of requests from "smart devices" :)


> forcefully caching even failed requests to a configurable timeout

I've been doing ~SRE for 1.5 years and I've worked or helped on 3 outages related to negative DNS.. Please don't use negative cache, if you don't know how enough about DNS and can't monitor it


I wouldn't suggest any ISP should do that (and I am none) but probably host this for own personal usage/home networks. If recursive DNS servers go down under the load of "smart devices", having a local copy of a larger number/set of IPs I usually visit might come in handy (and none of my requests would worsen the issue of server overload).


This is 'cache', my OpenWRT router has this for thousands of records, but negative cache means: "remember this domain doesn't exist and don't retry asking other DNS providers". This is very dangerous.

Your browser AND operating system AND router already provide DNS caching, it's not something average user should even think about. You might want to consider it when things in your ISP go wrong (hello BT), or majority of computer request the same domains frequently, but then again, your router should do it already.


Well now in retrospect, Cloudflare claims to have enabled SERVFAIL caching to mitigate, and other commenters here say their ISPs did similar. You might want to overthink the negative caching strategy.


This does a good job explaining how SERVFAIL caching works: https://serverfault.com/questions/479367/how-long-a-dns-time...


From your linked SO post, the accepted answer concludes:

"In summary, SERVFAIL is unlikely to be cached, but even if cached, it'll be at most a double- or even a single-digit number of seconds."

That would be fatal right now, wouldn't it? That would mean every major ISP's DNS server right now forwards millions of identical DNS resolve requests to the (currently null-routed) Facebook DNS servers. These must be millions, as heck, every larger website uses FB tracking tools, "like buttons" etc. Are they at least smart enough to throttle based on a domain/ip hash? Else it could happen that DNS servers of major ISP are soon overloaded as (constantly failing and thus uncached) requests to FB DNS would eat up all bandwidth/ressources?


Not that fatal. I think at least some recursive servers will do 'collapsed forwarding', where additional requests to resolve the same name while the first request is in progress will wait for the first request to finish and send the same results to all clients at that point. Although, perhaps that's just wishful thinking on my part.

Then you have port limits, usually each request goes out on a new port, a recursive resolver can only have 64k requests outstanding to any given authoritative (or upstream) server IP for each IP the recursive uses. Facebook runs with 4 hostnames listed, so that's a limit of 256k requests outstanding, 512k if your recursive does IPv4 and v6 (and 1 M if they're also making whatsapp requests).

DNS services for both domains appear to be back up by the way.

On the authoritative side, it's not too hard to manage this load. If you can't handle the big crush to start with, drop all requests, and then accept all the requests from 1.0.0.0/8, and add one /8 at a time as CPU permits until you're allowing everything. Once you handle the initial crush from a resolver, it should go back to normal load, and there should be some distribution of load across the various /8s. I wouldn't expect it to be evenly distributed, but it should be even enough.

Disclosure: I worked at WhatsApp, but left August 2019. I don't know anything about this outage other than idle speculation. I don't know if FB has a procedure to slow start DNS, but the theory is simple; the practice is complicated by the DNS ips being used in Anycast.


> servers will do 'collapsed forwarding', [...] perhaps that's just wishful thinking on my part

I think it is wishful thinking, because that would basically be caching which is not allowed by the RFC. In 2017 the BIND implementation changed to a default cache time of 1s which would certainly ease the problem.

> then you have port limits, usually each request goes out on a new port, a recursive resolver can only have 64k

I'm unsure if this helps or worsens the situation, depending if the 'collapsed forwarding'/1s caching is in place. If this is not the case, ephemeral port exhaustion would kick in, at which point the DNS server will not be able to server other requests.

> On the authoritative side, it's not too hard to manage this load

Of course not, all you need to do is just present any response which will be cached by downstream resolvers. No smartphone/end user device will query the authoritative side as long as there is just any (even stale) response.


> If this is not the case, ephemeral port exhaustion would kick in, at which point the DNS server will not be able to server other requests.

You can use the same local ip/port to contact multiple server ip/ports, so filling up connections to FB ips shouldn't prevent you from connecting to others (but there are plenty of ways to do that wrong, I guess)

>> On the authoritative side, it's not too hard to manage this load

> Of course not, all you need to do is just present any response which will be cached by downstream resolvers.

You need to present a response before the resolver times out. One can certainly imagine a situation where the incoming packet processing results in enough delay that the responses arrive too late and are discarded. In the right conditions, this queuing delay would never clear and things just get worse. If it doesn't happen, great, but if it does, dropping most of the requests so you can timely handle the few you accept is a good way to get moving.



It’s not so much the load as the DNS servers having to maintain state for all those queries until they time out. Must consume tremendous RAM and servers that are not event-driven could also be generating large numbers of threads.


my pihole is rate limiting my partners phone, up 20k + requests last hour to facebook et el. The rest of my non standard dns is holding up so far time will tell though, if the big boys get overwhelmed you might be correct.


Is it just the people trying to connect or the app itself keep polling and trying to send information from devices to Facebook servers continuously.


Both.


Is this specifically an issue with the Facebook app? Or is it just a predicable consequence of DNS responses no longer being cached due to query failures for a site as popular as Facebook?


It is certainly not specific to Facebook, but the scale at which Facebook is referenced across websites and apps is pretty unique (I can only think of a few key players like Google who would cause a similar load.)

And to clarify a bit, the queries aren't "no longer being cached due to query failures", it's because their TTL expires and the resulting SERVFAIL from the next query (which fails) isn't cached at all.


Yeah, I now know every user with the FB app installed. Its just wild to watch the log of all the phones asking for facebook.com.


... or with the facebook-sdk in any installed app.

btw; going trou the (dns-)logs of ppl you personally know is somewhat rude


I'm the sysadmin looking at a number, and I don't identify people individually unless I have a witness and a direct order to do so based on the policy manual. I take that very seriously. I will not break people's trust. I was asked about how it was affecting us.


my apologies then!


this does not make sense to me. ttls tend to be short lived, but as i understand expirys are typically measured in days for this exact reason. if the ttl expires, and the upstream cannot be contacted, retry every retry seconds, and in the meantime, if you have the records (every isp has/had records for fb before the outage) that have not expired, then serve them.

why did the public user dns servers (isps/google/cloudflare/etc) drop the facebook records? dns is designed to be able to handle an unreachable server. (or was some mitigation for some sort of abuse added along the line somewhere that invalidates zones if they're associated with completely unroutable addresses?)


At this point people hope it will just restart so that we can all resume a normal life. There can only be 2 options. It will be fix very soon or it will be a hell of a night with a lot of coffee.


Is it possible for facebook to instead rely on an anycast IP rather than DNS for their (non-web) phone apps?


No. But FB's DNS is anycasted.

FB's eggs were all in one basket, and the basket broke.


Yes and no. Yes they could technically hardcode an anycasted IP address, however it'd be less reliable. Also you'd run into issues with TLS certificates. It'd be very inflexible and would probably result in more outages.

But even if they did hardcode an IP, the underlying infrastructure for Facebook was also down not just DNS resolution of facebook.com. So even if the FB app didn't need to resolve a hostname, it would still be broken.


So, if Facebook becomes sentient we can’t shut it down because it will break the internet?


The whole point of the doomsday machine is lost if you keep it a secret!


Could anyone explain this for people with no DNS knowledge?


The internet phone book to convert .coms to ip addresses only scales up to the load level of hte internet because results are cached at multiple layers.

your browser asks your computer which asks your router which asks your isp which asks .com's dns servers which ask facebook's dns servers for facebook's ip address.

each layer will cache the results so say, even if 100,000 people in seattle want to know facebook.com's ip address, only the 5 or isps who provide internet in seattle have to ask for facebook.com's ip address, so 100,000 requests, but only 5 actual requests.

even the per-device and per-home cache is helpful, because 500 page loads in 15 minutes still only results in 1 actual dns request.

Here's the issue:

Failures aren't cached.

so while 100,000 people in 1 second trying to get facebook's ip only resulted in 5 requests going to the core dns servers, now results in 100,000 requests in 1 second trying going to the core servers.


> Failures aren't cached.

i thought so too but according to Cloudflare's blog post, at least some of them do:

> Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

https://blog.cloudflare.com/october-2021-facebook-outage/


BGP is like the mail service.

DNS is a translation from human readable addresses to machine addresses.

BGP determines how to find those addresses from your server to theirs.


Is it just the people trying to connect or the app itself keep polling and trying to send information from devices to Facebook servers continuously.


ISPs keep floating charging Netflix for their own customers' data traffic; by the same logic DNS operators should be charging Facebook.


Could anyone explain this so people with no DNS knowledge could understand?


Looks like they moved fast...

AND BROKE THINGS!


Ddos is a strong way to put this. Are we talking malware that is sending thousands of requests per device, or a bug from a connectivity issue?


It's still a DDoS, just not an attack. Slashdotting a site is a DDoS, but usually not intended as a deliberate attack. Now take every WhatsApp, Facebook, Instagram, and Facebook Messenger user, every app that uses FB for user authentication, every site that does the same, every app and site that serves FB ads, every app and site that uses FB for metrics, and we have an unintentional DDoS just waiting to happen.


The word "attack" has multiple definitions, including "act harmfully on" e.g., a heart attack. It does not require intent or aggression.


If you're an ISP that is anything bigger than a mom-and-pop operation, you should have at least 3 or 4 geographically distributed anycast recursive resolvers.

Recursive DNS is pretty easy to do for really large volumes on a $600 1U server. It's not like the days of 15 years ago...


Ah, but what about all the marketing analytics ISPs deploy on their DNS servers so that if your browser ever looked up viagra.com, forever will it grace your web browsing retargeting ad units? /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: