Oh great, and at some point when key ISP DNS servers crack under the load, more and more websites will appear to "go down" from a users perspective - suddenly gmail.com and outlook.com not working. More and more people reload websites, restart devices etc. and increase the load even further. People fall back to using SMS/telephone, but since it is not used to that heavy load of 2021, soon phone calls fail. With FB, WA, Email and Phone "down", engineers can't be reached to fix this. And if they can, they fail to call the Uber to get them somewhere. And even if they could, the streets are congested with people that cannot communicate remotely so try to get somewhere to convey messages in-person.
Hope they will just go back to fix FB and this is just in my head :)
I wrote this as a pure joke, but now that I learned that SERVFAIL is not cached on browsers, clients, intermediate DNS servers [0] etc. I am curiously wondering what will be going on. It is not only FB apps, it is basically every website request (that uses FB JS for ads, tracking, etc.) that triggers a DNS request, which will be forwarded 1:1 from the ISP's DNS to the null-routed FB Subnet. This should put orders of magnitude more load on resolving DNS servers than usual.
Isn’t there an easy fix to just add a bullshit record for FB to DNS until this blows over?
I am also thinking that all the poorly coded sites that don’t work unless the Share on Facebook button loads are also going to hemorrhage money. So are all the e-commerce sites that rely on Login with FB.
I hope this results in everyone rethinking adding all that shit to their infrastructure before the next time.
The problem isn't (wasn't) facebook.com's A records, it's that the authoritative nameservers for facebook.com are (were) unreachable. In theory someone could change the NS records for facebook on the .com nameservers to point somewhere else and serve up a fake facebook.com domain, but... 1) those NS records have a 6 hour TTL, so it would take a while be effective, and 2) who has the authority to do that?
.com (used for Facebook authoritative DNS) and .net (used for WhatsApp) servers serve glue records with a 2 day TTL, and domain holders don't get to choose TTL. (the root servers serve glue records for the TLDs with a 2 day TTL as well).
So, two days is the TTL you get for your NS records. Of course, it takes way more than two days for people to stop querying your old records. I gave up after 30-45 days.
Heh. I remember the day when a TTL of a week was pretty standard.
Anyway, the since the NS records for a domain come from the next level up, i.e. in this case the .com nameserver which are also the root nameservers, if they had a short TTL then those root nameservers would get hit pretty hard. Also, it isn't your choice what the TTL on your NS records is, it's admins of your top-level domains.
I would assume DNS providers were on call with FB staff to manually trigger flushes across the system to prop the records ASAP. Of course this doesn't affect client TTLs but resolution failures generally aren't cached anyway.
My understanding is that the Facebook network itself was unreachable from the internet because of BGP. So even if an IP was resolved from DNS, that IP wouldn’t get routed to Facebook because it withdrew its routes from its peers where it connects to ISPs via BGP
That's precisely what my ISP did. In fact, they went a step further and appear to have hijacked all unencrypted DNS queries. I was able to run DNS queries using 'example.com' as my DNS server and get a response.
If its any consolation, I find SMS and telephone to be remarkably robust in these situations. In college during football games, the campus population would swell to probably 200k people within a few square miles each with a cell phone in a pocket. 3g and LTE would be worthless. Campus area wifi would be worthless. The only thing that would work is shutting all that off your phone and resorting to SMS and calling people over EDGE, but it worked flawlessly even with all the people stressing out a handful of towers at once.
Football game's are planned events, and the carriers plan capacity accordingly. I remember sms+cell phone going down on several music festivals with 20.000+ participants (especially when it ended at around midnight). Only some carriers supported those "sponteneous" gatherings in the middle of nowhere with mobile cell towers that would keep connectivity going - but low-cost carriers never did.
During planned events, COWs are often deployed to provide extra mobile bandwidth. Sports areas will have
design features to help (e.g. dedicated installed backhaul).
It would be cool to see Matrix's 100bps experiment allow full apps (with E2EE and other modern features, as opposed to SMS) to handle horrid network conditions.
German carrier O2 was/is notorious for offering somewhat decent-ish service in urban areas under normal conditions - but major events that happen to have lots of people moving around the city like fans congregating to a soccer game with public transport, political rallies or your average drunkard festival (=Oktoberfest)? Instant collapse...
I lit a match /
And it made a fire. /
And for my cigarette /
I wanted to take the fire from the match. /
But the match slipped from my hand and landed on the rug. /
And it almost made a hole in the rug.
Well, you know what can happen /
If you're not careful with fire. /
And for the light on a cigarette /
A rug feels rather too expensive. /
And from the rug the fire, alas, /
Might have spread to the whole house /
And who knows what would have happened thereafter?
There would have been a fire in the district /
And the fire fighters would have had to come. /
They would have honked in the streets /
And unloaded their pipes. /
And they would have sprayed fire /
And it would have been in vain /
And the whole city would have burned with nothing to protect it.
And the people would have jumped around /
Fearing for their possessions. /
They would have thought somebody had started a fire. /
They would have grabbed assault rifles. /
Everyone would have shouted: "Whose fault is it?" /
The whole country would have rioted. /
And they would have shot at the ministers behind the lecterns.
The UN would have become involved /
And the UN enemies as well. /
To preserve peace in Switzerland /
Both would have come with tanks. /
It would have spread, little by little, /
To Europe, to Africa. /
There would have been a World War and humans would be no more.
I lit a match /
And it made a fire. /
And for my cigarette /
I wanted to take the fire from the match. /
But the match slipped from my hand and landed on the rug. /
Thankfully, I picked it back up.
It's already the case. It's been some times since some websites randomly are hard to reach for me. Changing DNS helped, but the best move has been a VPN to a far away country.
You know (or could find out) the hashing algorithm. I’m sure you could come up with a model estimating “pollution” based on hash rate, kWh per hash, “greenness” of the energy source, and estimated hashes until a prefix collision.
No, the protocol didn't support that when it was first written, and that absence has continued through today. Internet protocols are typically written from the standpoint that the transport/protocol layer handles all this for them, so the real question is how DNS-over-HTTPS can help manage that, since there's no hope of altering classical DNS. Unfortunately, the MDN page for it seems incomplete, so I have no idea how browsers typically handle that header, much less how a DNS resolver's "DNS-over-HTTPS" shim library will. It might be possible to use the QoS functions of HTTP2 or HTTP3 to accept the connection with a receive maxlen of 0 and send the 503 Retry-After at that point. This also assumes a client coded to behave in a kind manner rather than in a greedy manner, which is generally not a safe bet when "time is money" is a factor.
If so, an option could be responding with a fake answer like 0.0.0.0 + some TTL, plus an extension that tells modern clients "hey, this is actually a SERVFAIL but please cache it for the TTL".
DNS is distributed. thats the problem, you can return nonsense but it might be ignored by the legion of other DNS services that are caching your responses.
because of this there are lots of settings in most servers to ignore and sanitise shitty responses.
Just imagine how annoying that would be. Your local resolver has a blip of connectivity and during the time it cannot contact an origin nameserver it burns all of the reasonable retry intervals. After a few seconds its link comes back up but the resolver is still waiting because of "backoff".
Exponential backoff is a really bad antipattern that harms users. There are other, better ways to shed load.
Also, there's no evidence that any DNS infrastructure was overloaded during this event, so what are we even discussing?
Not really; if I unplug a cable, connections break; but if I plug it back in they can and should resume much faster than 6 seconds.
And what about a five minute "blip"? Arguably it should resume proper operations as soon as the link goes back up, not some (unknown to user) time afterwards.
Yes but event driven is way different. That should be instant because the software can be told by the system that connectivity was restored and respond immediately since it’s aware of the hardware being plugged in.
This technology problem is a good metaphor for Facebook overall as a company. There is nothing fundamentally wrong with having your app regularly polling for DNS records when it can't find them, but that can be an actively harmful approach when you are the size of Facebook. Being that size comes with a whole swath of extra responsibilities to ensure that your behavior doesn't end up harming society as a whole.
Is it just me, or every time there's an issue with facebook server side, the FB SDK just absolutely damages any software with it installed as a dependency. The thought put into failover states under the hood is very lacking. It speaks to an ethos of if developers aren't a working data pipeline to facebook services, them and their own products can go pound sand.
Excellent opportunity to set up a DNS-by-mail service. Just send me a letter with the names you want and I’ll get back to you within 3 to 5 business days!
Lots of people interested in the event and normal web response queries their sql database for user data like upvotes which is their limiter.
Logout which I believe they just forced for all users sends you to the cache and it'll be fast. dang@ mentioned it during the giant S3 outage a few years ago.
I'd thank anybody who would post a tutorial/configfile to setup a DNS server (dnsmasq?), forcefully caching even failed requests to a configurable timeout, and large cache sizes. We might need them in case DNS servers going down under the load of requests from "smart devices" :)
> forcefully caching even failed requests to a configurable timeout
I've been doing ~SRE for 1.5 years and I've worked or helped on 3 outages related to negative DNS.. Please don't use negative cache, if you don't know how enough about DNS and can't monitor it
I wouldn't suggest any ISP should do that (and I am none) but probably host this for own personal usage/home networks. If recursive DNS servers go down under the load of "smart devices", having a local copy of a larger number/set of IPs I usually visit might come in handy (and none of my requests would worsen the issue of server overload).
This is 'cache', my OpenWRT router has this for thousands of records, but negative cache means: "remember this domain doesn't exist and don't retry asking other DNS providers". This is very dangerous.
Your browser AND operating system AND router already provide DNS caching, it's not something average user should even think about. You might want to consider it when things in your ISP go wrong (hello BT), or majority of computer request the same domains frequently, but then again, your router should do it already.
Well now in retrospect, Cloudflare claims to have enabled SERVFAIL caching to mitigate, and other commenters here say their ISPs did similar. You might want to overthink the negative caching strategy.
From your linked SO post, the accepted answer concludes:
"In summary, SERVFAIL is unlikely to be cached, but even if cached, it'll be at most a double- or even a single-digit number of seconds."
That would be fatal right now, wouldn't it? That would mean every major ISP's DNS server right now forwards millions of identical DNS resolve requests to the (currently null-routed) Facebook DNS servers. These must be millions, as heck, every larger website uses FB tracking tools, "like buttons" etc. Are they at least smart enough to throttle based on a domain/ip hash? Else it could happen that DNS servers of major ISP are soon overloaded as (constantly failing and thus uncached) requests to FB DNS would eat up all bandwidth/ressources?
Not that fatal. I think at least some recursive servers will do 'collapsed forwarding', where additional requests to resolve the same name while the first request is in progress will wait for the first request to finish and send the same results to all clients at that point. Although, perhaps that's just wishful thinking on my part.
Then you have port limits, usually each request goes out on a new port, a recursive resolver can only have 64k requests outstanding to any given authoritative (or upstream) server IP for each IP the recursive uses. Facebook runs with 4 hostnames listed, so that's a limit of 256k requests outstanding, 512k if your recursive does IPv4 and v6 (and 1 M if they're also making whatsapp requests).
DNS services for both domains appear to be back up by the way.
On the authoritative side, it's not too hard to manage this load. If you can't handle the big crush to start with, drop all requests, and then accept all the requests from 1.0.0.0/8, and add one /8 at a time as CPU permits until you're allowing everything. Once you handle the initial crush from a resolver, it should go back to normal load, and there should be some distribution of load across the various /8s. I wouldn't expect it to be evenly distributed, but it should be even enough.
Disclosure: I worked at WhatsApp, but left August 2019. I don't know anything about this outage other than idle speculation. I don't know if FB has a procedure to slow start DNS, but the theory is simple; the practice is complicated by the DNS ips being used in Anycast.
> servers will do 'collapsed forwarding', [...] perhaps that's just wishful thinking on my part
I think it is wishful thinking, because that would basically be caching which is not allowed by the RFC. In 2017 the BIND implementation changed to a default cache time of 1s which would certainly ease the problem.
> then you have port limits, usually each request goes out on a new port, a recursive resolver can only have 64k
I'm unsure if this helps or worsens the situation, depending if the 'collapsed forwarding'/1s caching is in place. If this is not the case, ephemeral port exhaustion would kick in, at which point the DNS server will not be able to server other requests.
> On the authoritative side, it's not too hard to manage this load
Of course not, all you need to do is just present any response which will be cached by downstream resolvers. No smartphone/end user device will query the authoritative side as long as there is just any (even stale) response.
> If this is not the case, ephemeral port exhaustion would kick in, at which point the DNS server will not be able to server other requests.
You can use the same local ip/port to contact multiple server ip/ports, so filling up connections to FB ips shouldn't prevent you from connecting to others (but there are plenty of ways to do that wrong, I guess)
>> On the authoritative side, it's not too hard to manage this load
> Of course not, all you need to do is just present any response which will be cached by downstream resolvers.
You need to present a response before the resolver times out. One can certainly imagine a situation where the incoming packet processing results in enough delay that the responses arrive too late and are discarded. In the right conditions, this queuing delay would never clear and things just get worse. If it doesn't happen, great, but if it does, dropping most of the requests so you can timely handle the few you accept is a good way to get moving.
It’s not so much the load as the DNS servers having to maintain state for all those queries until they time out. Must consume tremendous RAM and servers that are not event-driven could also be generating large numbers of threads.
my pihole is rate limiting my partners phone, up 20k + requests last hour to facebook et el. The rest of my non standard dns is holding up so far time will tell though, if the big boys get overwhelmed you might be correct.
Is this specifically an issue with the Facebook app? Or is it just a predicable consequence of DNS responses no longer being cached due to query failures for a site as popular as Facebook?
It is certainly not specific to Facebook, but the scale at which Facebook is referenced across websites and apps is pretty unique (I can only think of a few key players like Google who would cause a similar load.)
And to clarify a bit, the queries aren't "no longer being cached due to query failures", it's because their TTL expires and the resulting SERVFAIL from the next query (which fails) isn't cached at all.
I'm the sysadmin looking at a number, and I don't identify people individually unless I have a witness and a direct order to do so based on the policy manual. I take that very seriously. I will not break people's trust. I was asked about how it was affecting us.
this does not make sense to me. ttls tend to be short lived, but as i understand expirys are typically measured in days for this exact reason. if the ttl expires, and the upstream cannot be contacted, retry every retry seconds, and in the meantime, if you have the records (every isp has/had records for fb before the outage) that have not expired, then serve them.
why did the public user dns servers (isps/google/cloudflare/etc) drop the facebook records? dns is designed to be able to handle an unreachable server. (or was some mitigation for some sort of abuse added along the line somewhere that invalidates zones if they're associated with completely unroutable addresses?)
At this point people hope it will just restart so that we can all resume a normal life. There can only be 2 options. It will be fix very soon or it will be a hell of a night with a lot of coffee.
Yes and no. Yes they could technically hardcode an anycasted IP address, however it'd be less reliable. Also you'd run into issues with TLS certificates. It'd be very inflexible and would probably result in more outages.
But even if they did hardcode an IP, the underlying infrastructure for Facebook was also down not just DNS resolution of facebook.com. So even if the FB app didn't need to resolve a hostname, it would still be broken.
The internet phone book to convert .coms to ip addresses only scales up to the load level of hte internet because results are cached at multiple layers.
your browser asks your computer which asks your router which asks your isp which asks .com's dns servers which ask facebook's dns servers for facebook's ip address.
each layer will cache the results so say, even if 100,000 people in seattle want to know facebook.com's ip address, only the 5 or isps who provide internet in seattle have to ask for facebook.com's ip address, so 100,000 requests, but only 5 actual requests.
even the per-device and per-home cache is helpful, because 500 page loads in 15 minutes still only results in 1 actual dns request.
Here's the issue:
Failures aren't cached.
so while 100,000 people in 1 second trying to get facebook's ip only resulted in 5 requests going to the core dns servers, now results in 100,000 requests in 1 second trying going to the core servers.
It's still a DDoS, just not an attack. Slashdotting a site is a DDoS, but usually not intended as a deliberate attack. Now take every WhatsApp, Facebook, Instagram, and Facebook Messenger user, every app that uses FB for user authentication, every site that does the same, every app and site that serves FB ads, every app and site that uses FB for metrics, and we have an unintentional DDoS just waiting to happen.
If you're an ISP that is anything bigger than a mom-and-pop operation, you should have at least 3 or 4 geographically distributed anycast recursive resolvers.
Recursive DNS is pretty easy to do for really large volumes on a $600 1U server. It's not like the days of 15 years ago...
Ah, but what about all the marketing analytics ISPs deploy on their DNS servers so that if your browser ever looked up viagra.com, forever will it grace your web browsing retargeting ad units? /s
Hope they will just go back to fix FB and this is just in my head :)