Today's Outage Post Mortem

druiid · on March 3, 2013

As always I'm glad to see Cloudflare post such detailed outage reports. They are one of the few providers I know of that is willing to go into such depth and that is one of the things I appreciate about them. That said, the outage that occurred was one that was indeed fully preventable. We don't exactly have as many locations as they do, but for internal resources at least, not pushing configuration changes to all devices (network included) is pretty standard practice. Basically I imagine for them a good routine to follow might be to script changes so that they are 'rolled out', something along the lines of push manual changes to a scripted 'random' router set (one in country A,B,C), wait 15 minutes and then push to the remaining router sets. That wouldn't work for all situations, such as if the entire network is seeing a DDoS or what have you, but I imagine they could adapt a routine that would prevent this particular scenario.

With all of that said, as a Cloudflare customer and also having a call with them tomorrow scheduled already over the WAF stuff, I find it a bit... frustrating that this is occurring now and such a kind of mistake.

Edit: As an aside, I wonder if the Puppet module for Junos will be extended to support route statements. That would make this kind of deployment much easier.

alexchamberlain · on March 3, 2013

Although I agree, it would be rather hard to fight an attack if you didn't roll out fairly quickly...

saurik · on March 4, 2013

The article, however, actually states that the expectation should be for the rule to do nothing, as the packets in question were much larger than the maximum packet size. You thereby have to examine this as some kind of rushed "let's try something, anything" reaction to a situation where an engineer didn't actually understand what was happening enough to make such a call: it is not surprising that the result was that they ended up landing squarely in "something even more confusing has now happened and everything is offline" territory.

eastdakota · on March 3, 2013

You've identified the rock and the hard place.

ryguytilidie · on March 3, 2013

This is pretty impressive. Keep in mind most of the team is on the west coast so this happened at 1am on a Sunday and they put up a post mortem within hours. Obviously you would prefer it not happen at all, but that is a great response imo.

jgrahamc · on March 3, 2013

Everyone who was needed (management, network operations, and the technical support team) were woken up as soon as this happened. The small team who monitor things during the night were quickly calling people. Some folks physically went to the office to help answers phones; others drove to one of our data centers.

larrys · on March 3, 2013

"The small team"

Didn't see this when I posted my comment. So there is a team monitoring not a single person?

TazeTSchnitzel · on March 3, 2013

Uptime is a fundamental part of their business, of course it's a team.

larrys · on March 3, 2013

What do you mean "of course it's a team" because "uptime is a fundamental part of their business"?

That's an assumption on your part. Many negative events happen because of assumptions people make and things they take for granted. I ask a valid question. Do you know for a fact how many people monitor their network or what systems they have in place? Or whether they even simulate events like this to see if other members of the team are even reachable? Have you ever seen the constant testing that goes on at large organizations (such as the military) to make sure battle systems are ready to deploy when necessary?

Note also that cloudflare charges $3000 per month for enterprise plans (in addition to free) and that their own website was taken offline.

david_shaw · on March 4, 2013

> Uptime is a fundamental part of their business, of course it's a team

Many businesses call roles a "team," even when it's just one person. I believe this is the case here, since the linked CloudFlare blog post states:

> Someone from our operations team is monitoring our network 24/7.

on March 3, 2013

[deleted]

Dylan16807 · on March 3, 2013

Well I was in the process of upvoting you to a less faded gray and suggesting you be less rude when you're making a valid point, but "edit: downvotes? truth hurts." tips the line to where you just go ahead and keep on keeping on and get yourself banned have a nice day see you never.

larrys · on March 3, 2013

But this is not impressive:

"Someone from our operations team is monitoring our network 24/7."

"Someone" seems to indicate "1 person". Not "people are monitoring" but "someone". That's it, one person monitors the network? Like the single night guard at the warehouse?

dandelany · on March 3, 2013

Why would you assume this and judge the company on it, based entirely on an offhanded comment? You couldnt have given them the benefit of the doubt long enough to find another of several comments which clearly indicate there was a team of people working on the issue?

larrys · on March 3, 2013

The question remains how many people actually are monitoring the network in order to call the first responders. Is it one person, two or five?

And my use of "not impressive" was in reply to someone who said "impressive" but more importantly thought it was "impressive" that they put up a post mortem within hours. That's nice but it doesn't answer the question that I had.

I stand behind my comment and re ask the question (since the info is ambiguous we have jgramhmc saying "small team who monitor things" and we have the blog post saying "Someone from our operations team is monitoring our network 24/7."

I don't think it's unreasonable (in the interest of transparency) to know exactly the structure and # bodies of who monitors the network at any given time. What is the human point of failure in the system?

I don't depend on cloudflare. But if I was running a mission critical operation and depended on them I might setup a site visit to actually get a feel of what is going on.

As an aside back when the .org registry got started one of the dns servers sat in an open unguarded office under a desk accessible by the cleaning person. I saw it when I did a site visit. And of course if you've been around long enough you know there was a time when the root dns servers sat unguarded in university offices.

dubcanada · on March 3, 2013

You seem to be looking for "problems" where non exist. There could be 15000000 people monitoring it. It can still go down.

seanp2k2 · on March 4, 2013

This.

It seems that you care more about how many people are actively waiting for something to break vs. how long their response takes. Also, (those people | that person) probably (is | are) the first responder. I really believe that one "first mate" watching the automated ship sail at night ready to triage a technical problem is better than 10 guards who will promptly fall all over themselves when the bits hit the fan.

DigitalSea · on March 4, 2013

That is a little harsh. What makes you think it's only one person monitoring the entire network? And even if it is, what's the problem with that as long as they do their job correctly?

seanp2k2 · on March 4, 2013

It's not usually necessary to have more than one person working overnight shifts with good automated monitoring in place. That person is tasked with "watching the watchman" as it were.

Paying 5 people to be up all night staring at dashboards isn't generally (required|a good use of people).

on March 3, 2013

[deleted]

larrys · on March 3, 2013

It's not an issue of whether they can get backup or not.

(And I don't agree with that anyway a night guard can call 911 and get the police pretty quick.)

The issue is whether there is a single person monitoring the network or several or even two. And is the coverage different during "working" hours? And what about the skills of the person monitoring at 1am vs. during the day?

Seems that after something happens people wise up to the weak points. Remembering the case of a single air traffic controller in some towers and after something went wrong there was such shock that only one person was on duty with no backup.

manys · on March 4, 2013

You're being That Guy.

powertower · on March 3, 2013

> CloudFlare currently runs 23 data centers worldwide.

Shouldn't that always say - CloudFlare currently runs in 23 data centers worldwide?

Or is that just how one would phrase that if you rent multiple racks or a cage in a datacenter? ...because I've seen that a bunch of times before from just about everyone.

Just curious.

ChuckMcM · on March 3, 2013

A bit nit picky yes? Even Google mentions places as "data centers" which really are co-location facilities hosting a bit of gear. It has come to be a way to identify unique numbers of building rather than ownership in most papers or articles that I've read.

To be clear I share your desire for precise speech, perhaps the could have said 'Cloudflare currently uses 23 data centers worldwide' but given that the person writing this is doing it on a Sunday, probably after a long night and having missed all the things they normally would have been doing on Sunday, I'm willing to cut them some slack.

DASD · on March 3, 2013

You are correct. However, it is not as bad as the egregious use of the word "unlimited" by hosting providers.

tlrobinson · on March 3, 2013

The distinction is a bit arbitrary. As a customer you should care that their service is geographically distributed, not whether they own the buildings where the servers are kept.

larrys · on March 3, 2013

"As a customer you should care that their service is geographically distributed"

Don't agree. If you own the data center you have more control over it. We had a case where the UPS systems in a data center had bad batteries and equipment went down because the batteries failed to kick in. Since we don't own the data center we have no realistic way to make inspections and make sure the right thing happens or that the batteries (or the generators) are cycled and maintained. We just have to trust. [1]

Now this may or may not matter with the way they have their redundancy setup. But owning a data center does give you more control over more things.

"not whether they own the buildings where the servers are kept."

Owning the data center and owning the buildings are two different things. Owning the building is owning real estate. Owning the data center is owning the security setup, backup systems etc. Two different things.

[1] So as not to contradict myself with other things I have said I should not say "trust" because you can always put some things in place to verify the right thing is happening (inspections, logs etc.) if you want. But if you own the place it's easier. If I own my home I can decide when to replace the HVAC so it doesn't fail in the middle of the summer. If I rent that's up to the landlord.

acdha · on March 3, 2013

The downside is that this obligates you to do a wider variety of tasks. I'd be surprised if CloudFlare has enough profit margin to afford many redundant electrical engineers, physical network & power techs, etc. in 23 widely dispersed locations. It's obvious that the extra expense wouldn't have helped in this case but it certainly would consume a big chunk of money and management time.

Part of being a business is that you have to make tradeoffs in the real world rather than game-theoretical perfect moves. In many cases this means carefully writing contracts because you can't afford the certain expense and distraction of doing it in-house in the hope that the results might be slightly better.

oijaf888 · on March 4, 2013

Or you could just engineer your systems to take that into account and realize that you can't trust the data center's power so you have to be able to quickly and easily shift all the load to another one.

23david · on March 3, 2013

Gotta disagree. Owning a datacenter doesn't mean that they own the building.

The difference between renting space in a datacenter versus running an entire datacenter is very big, and has ramifications for their uptime, security of their data and disaster recovery. Not sure why they aren't clearer about this.

yRetsyM · on March 3, 2013

I'm not very educated on this end of the spectrum - but I wonder if a process is possible where a rule or router update of some description is applied to one router only, testing the specific schema before pushing to the rest of the routers, thereby failing one router and not failing the rest? I understand the need to respond as quickly as possible - but as stated in this case this was already a manual response.

It appeals to my limited knowledge and non-existant experience that this would be a solution to the prevention of this occurring again in the future?

eastdakota · on March 3, 2013

Don't sell yourself short: our ops team has been on our internal chat talking about how to do something exactly like this for the last hour or so. It's difficult at our scale to truly simulate traffic, but we should be able to roll rules out to just subsets of our network. That's already how we handle router OS upgrades. If a small handful of data centers had crashed, likely no one would have noticed because we've designed that fault tolerance in. This was a problem because the crashes happened system-wide. In the end, we hadn't anticipated that a simple filtering rule like this would cause such a router crash, which was a bad assumption on our part.

yRetsyM · on March 3, 2013

Thanks for responding Matthew, and I can imagine the sense of frustration you're all feeling.

My 2cents:

Any change should be considered dangerous, and be tested first, time weighted to it's level of change. (An internal policy that could be communicated publicly. One that I apply to all my staff)

It would be also good to have data center clusters (preferably a datacenter clusters are sharded among regions) which would allow these changes to happen as necessary. A random cluster being the "first cluster" with a roll back in place if fails, or proceed through to other clusters progressively until all are live.

The sharding should hopefully alleviate any corner of the world taking any massive hit due to degraded performance.

nathannecro · on March 3, 2013

Similar to yRetsyM, my domain knowledge doesn't extend to this side, but if your network is undergoing a DDOS or some other form of attack, taking the time to test rules in a pre-production/test server seems to be quite dangerous. What does the ops team think about using multiple hardware vendors?

toast0 · on March 3, 2013

Doing the wrong fix is at least as dangerous as not handling the DDoS (as shown in this case). Based on their general network architecture, I would think a prudent thing would be a quick sanity test on a pre-production system if available, then deploy to the various colos in groups at intervals that seem appropriate given the nature of the change. If pre-production isn't available, then having the first group be one colo limits the production impact.

If they did some colos as vendor J and some colos as vendor C, I think it would be manageable, but I don't really know how much of the cross colo traffic is actually their routers talking to their routers. Homogeneity in networks makes things easier to manage, until a platform fault breaks everything at the same time. In this case, at least it was related to a change they had made and happened quickly, so it was easy to determine the cause; other platform faults may not be as easy to determine, but if only your vendor X colos fell over, at least you'd have your vendor C colos up and something to look for.

jonaslejon · on March 3, 2013

Good idea. There might be a number of rules that can limit their communication with the routers (and even OOB)

jcr · on March 3, 2013

To the couldflare folks; It's refreshing to see you take responsibility, but I think you've been a bit too hard on yourselves by taking all the blame. First of all, what you hit was a unknown bug in JunOS, and Juniper is to blame for their part. Using some form of staging to slow roll-out of rule changes might have saved you from a full meltdown, but when you're getting attacked, every second counts. Slow versus fast roll-out is one of those really tough balancing acts in your situation. You did a great job with it; by the time I saw the "cloudflare is down" post in the newest queue, it was already back up running again.

MichaelApproved · on March 3, 2013

"by the time I saw the "cloudflare is down" post in the newest queue, it was already back up running again."

Not sure what your timeline shows but the "cloudflare is down" post hit the #1 spot on the front page just a few minutes after they went down. About 40 minutes after that, the services started to come back online for me.

That's a significant outage. That's not reflecting on the job they did bringing things back online but your statement made it seem like a minor outage.

druiid · on March 3, 2013

Indeed. This was no few minute outage. There was very close to a full hour of zero traffic according to our MRTG.

gwern · on March 3, 2013

Pingdom reported an hour downtime on my site too.

GigabyteCoin · on March 3, 2013

Is junpier to blame for the bug in their OS? Or is cloudflare to blame for not testing JunOS enough before relying on that OS?

eastdakota · on March 3, 2013

Buck stops with us. We choose the hardware and software that runs on our network. We test and work around thousands of bugs in it. It was up to us to check range limits before applying them. While we'll never be perfect, one of the things I am most proud of with the CloudFlare team is how quickly we do learn from mistakes.

cbsmith · on March 3, 2013

There is more to this story than meets the eye. This had to be an IPv6 fragment attack. Why weren't you already advertising rejection of such packets, at least for DNS? Why would your analysis software and procedures not already be checking for memory problems with rules that would need yo assemble all the fragments before matching? Seems like there is more to this story than meets the eye.

on March 3, 2013

[deleted]

eastdakota · on March 3, 2013

To be clear, while I wrote out the effective command that was pushed to the router, it was a script that generated it and the fault lies with the script not having adequate error handling.

dubcanada · on March 3, 2013

One would assume paying a company for an OS should be tested via the developers. Juniper SHOULD have tested that route scheme since they sell mission critical architecture.

However you are most likely right, CloudFlare should have test ed it before rolling out.

packetslave · on March 3, 2013

Well-known Google SRE motto: "hope is not a strategy".

See also R.Reagan: "trust, but verify"

rainsford · on March 3, 2013

That was a pretty interesting writeup and I always like it when companies are totally (and quickly) upfront about negative events.

One thing that occurred to me though is that performing a hard reboot of the routers required calling people to physically access the devices and took some time to perform (as you would expect). Although I wouldn't expect it to be needed very often, I'm sort of surprised CloudFlare doesn't have out-of-band remote power cycle capabilities.

There may be some factor I'm not considering that would make that an unattractive option, but it does seem like it could cut down an already quick response time even further for any similar events in the future.

rdl · on March 3, 2013

I've never seen remote power cyclers on big routers in major facilities which have on-site remote hands, even when servers all get both IPMI/LOM board cyclers and physical external cyclers. At most, the routers get a serial port connected to a serial port console server or directly to a modem, and/or an admin ethernet network.

I've seen smaller routers, CSU/DSU, etc. type devices in branch offices on cyclers, though.

I think it's mostly that the routers usually have both good OOB management and good watchdog (reboot on freeze) behavior, and that the PSUs in the bigger routers tend to exceed the per-port power limits of most of the external power cyclers.

It may be a good idea, though.

packetslave · on March 3, 2013

A lot of big routers run (or have the option to run) on DC power, too, which makes it harder. Much of the remote power control gear out there is AC

justincormack · on March 3, 2013

You would need another network (not just a vlan) to run this as well, if you are going to try to reach it when nothing else is working.

mprovost · on March 3, 2013

We just hook up a DSL modem to the OOB network or plug it straight into the OOB interface on a core router. You used to do this with actual modems but it's cheap enough to do it with DSL these days, then you're not dependent on any of your own network to access the device in case of failure.

macros · on March 3, 2013

We've been doing this with mikrotik boxes with either wifi or usb gsm modems depending on the what is available in the location.

rdl · on March 3, 2013

Yeah, I've seen a lot of great options for OOB access: 1) At carrier hotels, wifi (heh) 2) Cellular modems (ideal for branch offices; a lot of datacenters have bad cell coverage inside the racks/cabinets/floor though) 3) Cross-connect (in places with free/cheap cross connects) to someone you don't use for transit. Can be mutual 4) Some facilities give you an OOB network, although this often has issues (if you buy transit from them, it's possible your outage is due to something going wrong with them, and it might take out your OOB access)

I'm looking at the Verizon Private-IP thing (an outsourced private network over Verizon's cell infrastructure) for OOB management of lots of CPE; the cost per device per month is low, and then you pay for bandwidth across all of them. Makes initial provisioning easier, plus ongoing monitoring/maintenance.

dododo · on March 3, 2013

if you want to build a reliable system, one useful thing to do is use equipment from multiple vendors. sure it's inconvenient, but by doing this you can often de-correlate failures. especially if you want to improve someone else's reliability.

e.g., from simple things like hard drives in a raid from different vendors, to n-version programming in safety critical systems (like airplanes).

rdl · on March 3, 2013

That works when the interfaces are totally standard, but edge/core routers are not like that. Cisco supports one set of protocols for talking to other Cisco products; another set for talking to everything else. The "everything else" protocols suck in a lot of ways (they're ok inter-site, but not really so great intra-site).

Same with Juniper. (there aren't really other viable options besides those two)

You could build the same site fully independently with all-Cisco on one, and all Juniper on another, and potentially get some better isolation from vendor faults, but at very high expense.

You end up with much worse reliability if you have a mixed Cisco/Juniper network without a lot of additional isolation otherwise.

windexh8er · on March 3, 2013

Re: "there aren't really any viable options..."

Total misconception. BGP, OSPF, ISIS, LISP, etc. are all non proprietary. Sure, the root cause of this particular problem is that CF is using something specific to Juniper, however router interoperability is not predicated on components like that. This example was a tool CF operationalized, and likely had little to do with their routing with the exception of it being a metric they may have influenced routes with.

People who have all Cisco or all Juniper shops namely do it from a cost perspective. Sure, there are some reasons outside of that but it's likely the big driver. The more you buy, the more you save. And the network sales realm is royally messed up to begin with. I've seen Juniper give 90% discounts on hardware just to break into a Cisco shop. But, the reality of the situation is that all of this gear is marked up well into the thousands of percent. So if you're not getting, minimally 50% then your probably not doing yourself due diligence.

rdl · on March 3, 2013

"there aren't really any viable options" to juniper or cisco for core/edge routers.

There are some routing protocols which interoperate (which is how different sites on the Internet can talk to each other), but most of the protocols used for HA or management of a given set of routers, or, more importantly, most tested/debugged implementations of HA and device management, are Cisco or Juniper specific.

No big deal announcing routes to your upstream if you use Juniper and they use Cisco. Big deal if you have Cisco+Juniper and want to do HSRP (Cisco-only).

windexh8er · on March 4, 2013

Well, no.

I've been in network engineering for 12+ years and I fundamentally disagree with a lot of what is said about "networking" and interop by many programmer-types (not casting here, but) on HN. Yes, yes, you may understand system DevOps to a point, however I'm not sure you've spent a significant amount of time studying Dijkstra's algorithm or truly have an idea of how to deploy a global IPv6 overlay. I'm also not trying to be snide here but I feel that, often times, many things that come up on HN are just fundamentally designed wrong from PHY all the way up until the devs get a hold of the rest. I've been in a very successful startup (think one of the top online backup services) wherein their network was run on commodity junk hardware. They were asking me how I'd troubleshoot this, that and the other thing - obviously with no debug (this guy said that with a grin). First and foremost, you designed it wrong - I can show you inefficiency in about 10 minutes of performance engineering that I would have designed around without thinking about those things. So, yes, I can waste time tracking down a bad NIC on your network, but if you feel that you've earned geek cred because you fired up Wireshark and parsed through a few simplistic ARP tables - you haven't impressed anyone but yourself. That's when I realized I was working with professional developers, and not network architects.

Your simplistic view of FHRP is trivial at best. Maybe if you were talking about how you'd design fault tolerance into a virtual link, say an LSP, with something like BFD in your design I'd be more impressed than conversations about proprietary redundancy protocols of which most network engineers won't touch for a variety of other reasons than the big "C".

</endrant>

rdl · on March 4, 2013

Virtually no network engineers (by percentage) have to do anything other than worry about what their vendor supports for a given configuration (and usually a fairly small set of configurations, too); it's much more about policy and operations.

Similarly very few developers have to solve open CS problems in writing a CRUD application (or I guess more comparably to ops, come up with a novel implementation of a complex algorithm).

This is progress, though.

windexh8er · on March 4, 2013

"Virtually no network engineers (by percentage) have to do anything other than worry about what their vendor supports for a given configuration" - this statement puts a perspective on your thinking. And then I read your information on the services your company offers, and I realize that it's not worth having a discussion.

"<redacted> takes your security very seriously." - right. That's a statement, not information regarding the thought or implementation. There's not even a mention of technology. <sigh>

JoachimSchipper · on March 4, 2013

Be nice.

justincormack · on March 3, 2013

Thats why software defined networking, Openflow etc, are going to take off, as you can get back control of the protocols and what is going on, and avoid the vendor lockin.

sneak · on March 3, 2013

> Thats why software defined networking, Openflow etc, are going to take off

I've been hearing this for a decade. It's still not true. I'm not sure why, either.

rdl · on March 4, 2013

What do you call Arista? Also there is a lot of interesting "virtual appliance" networking going on.

I agree the right choice today is almost certainly a C or J router and probably C or A switches, but e.g. hardware load balancers like F5 seem to be losing out to software in most deployments (increasingly).

I built a decent sized network with Zebra 15y ago, which was pretty obviously the wrong tech, but interesting.

FlukeATX · on March 3, 2013

Cisco + Juniper environment and you want gateway redundancy?

Hello, VRRP!

There are open standards for pretty much every Cisco proprietary protocol. Even EIGRP is now available as an informational RFC.

pyvpx · on March 3, 2013

you can't utilize OSPF, ISIS, BGP, et. al. for high availability? You do realize HSRP is merely for a redundant gateway address, right?

most service provider networks manage via the CLI (generally scripted, for better or worse) and occasionally a vendor-specific API.

while I will agree juniper and cisco are generally the best choice for core/edge routers, there are other 'viable' options depending on your requirements. if you need in excess of 2500 BGP sessions on a single chassis, there aren't many viable options besides a Cisco 7600.

I certainly do not mean to be rude, but I feel you're attempting to speak from experience you don't fully have (yet, hopefully!)

windexh8er · on March 4, 2013

Wow, just... Wow. So. Far. Off. Base. (see my post above)

grogers · on March 3, 2013

I view it less as a cost factor and more as a convenience. Developing expertise with juniper and Cisco takes a lot longer. Each router vendor has its own quirks. Even just buolding software to monitor routers is basically a full time job since its a constantly moving target. New bugs are always coming up....

scoot · on March 3, 2013

"But, the reality of the situation is that all of this gear is marked up well into the thousands of percent."

You seem to be confusing hardware with software. Juniper's gross margin was 64.25% for the quarter ending Dec. 31, 2012, and in that ballpark for previous quarters back to inception.

packetslave · on March 3, 2013

You'd be amazed how often "standard" network protocols behave subtly different between vendors. You have to exhaustively test interoperability for every single feature and config option if you want assurance that it isn't going to break in some bizarre way.

rdl · on March 4, 2013

It is also really nice to be able to call one TAC and have them devote effort to fixing it. If you have a heterogenous network, they can pass the buck, or even if they are awesome and try to help you out, there is no way Cisco's TAC knows as much about Juniper stuff as they do about Cisco, it is harder for them to put together a duplicate config, etc.

Back around 2000 this was a big deal. Cisco slacked on gigabit routers, and Juniper didn't have a comprehensive product portfolio, so while SP networks could be all J (but maybe with some switches from Extreme, etc,), enterprise networks were a lot more likely to have juniper and Cisco mixed, if they needed juniper performance in the core. Juniper ended up broadening their portfolio and Cisco improved their high performance offerings a few years later.

DASD · on March 3, 2013

http://www.zdnet.com/uk-internet-hit-by-linx-router-failure-...

Here's an example of a large provider with parallel infrastructure each powered by a different hardware provider(Brocade/Extreme). One failed and one kept working. I seem to recall a more detailed RFO but my Google-fu has failed me this morning.

rdl · on March 3, 2013

LINX just runs inter-provider switch fabric, though, which is vastly simpler, and just runs two separate switch fabrics for customers to plug into.

Running an anti-DDoS/CDN service which handles traffic like Cloudflare does would be vastly more difficult.

It's certainly possible to do, but I think the given ~reasonable engineering resources, the net reliability of a heterogenous J/C version of CloudFlare would be less, and performance worse, than what they have now.

Switch fabric is a lot closer to the "run different models of hard drives" (although, you don't do that WITHIN a RAID group either -- you do it on separate RAIDs and possibly separate chassis), than routing infrastructure (which is like running a 777 with 1 GE engine and 1 RR engine. At best, you can turn it back into a 747 and run 2 GE engines and 2 RR engines.)

DASD · on March 3, 2013

I'm not going to go much further because this debate is useless without a context of limits and expectations. No one is discussing simplicity. LINX's operation is not simple. A specialized provider is offloading a difficult function as a core competency in return for simplicity. As an end-user, the difficulty is a non-factor. Just make it happen. Couple in "reasonable" with expectations and then we know what to expect. If it costs the moon to never make this happen again, then charge accordingly. If this happens once in a blue moon, then charge a lesser price.

inopinatus · on March 3, 2013

I'm afraid this perspective is misconceived. I, too, used to believe it. After all, in any portfolio, risk is reduced by diversification, right?

Unfortunately, amortized over the lifetime of a computer system, risk is not reduced in this manner. There is no hedging of vendor vs vendor in a technical portfolio; what happens instead, for any tech of significance, is the internal development of an abstract control plane that can communicate with both, and that control plane is then the single source of defects. In the meantime your engineers have to become world-class experts on two platforms rather than one. In practice the divided loyalties will turn one world-class engineer into two half-assed ones.

Domino-effect failures, or global misconfiguration failures like those experienced by Cloudflare are edge cases in my experience and not something you should optimise for. When they happen they tend to be catastrophic, but worse is the insidious decline in quality caused by carrying too much technical debt.

Cloudflare's scenario is not comparable to the installation of a RAID set. They are more comparable to a developer of RAID controllers. The experience curve for such is very, very long.

Not saying they couldn't have done other things to make this situation less catastrophic, but diversity of core technology portfolio isn't a winning ticket.

dageshi · on March 3, 2013

Will customers be willing to pay for the additional costs incurred by that inconvenience? I think there are a whole lot of different things to try before you start introducing different routers with different os's/quirks/capabilities into the mix. Frankly that sounds like a recipe for not just inconvenience but chaos.

senthilnayagam · on March 3, 2013

So as far the DOS attack was very successful. It took down the site which it intended to and take down the network with lots and lots of the sites.

Hope lessons are learnt and your next generation is less prone to these attacks

packetbeats · on March 3, 2013

If the attacker knew about the Juniper bug and thought about a way to convince the network operators to introduce the rule of death themselves, then this is a nice hack indeed. It won't be easy for CloudFare to generically protect against these types of attacks. They could either have mechanisms to revert configurations faster or a way to test new configurations on a single router.

rurounijones · on March 3, 2013

The idea that the attacker knew about the bug is, I think, a remote but intriguing idea.

Wonder if Cloudflare need to do some tests along the lines of:

A) List up all the types of rules we usually use to mitigate these situations. B) Run those rules on a test router with wildly unusual input values, as was the case in this situation. C) Send test traffic using that wildly unexpected input to see what happens.

Basically a bit of manual fuzzing

Time-consuming and maybe not worthwhile, but it could save against another full system death.

senthilnayagam · on March 3, 2013

even according to their admission, if they had not made any change the apps would all have run, just possibly some lag, but making a change for malicious user without knowing the consequence lead to this scenario.

DigitalSea · on March 4, 2013

Rather unfortunate for the credibility of Cloudfare as a network provider, but you've got to admire them for their honesty and it'll work out better for them in the end. It's amazing how a few lines of code managed to bring down Cloudfare, they could have told us anything and nobody would have been able to question it; instead they gave us the truth and I really respect that. They didn't blame the intern, they didn't blame their hardware or make an excuse about a power outage. In terms of honesty Cloudfare seems to be leading the way regardless of their public credibility or image being tainted. Very impressive response time and resolution of the issue as well, good job Cloudfare!

rdl · on March 3, 2013

Wow, that's pretty fast turnaround for a post-mortem (although it looks to have been a simple problem, so easier to figure out what to write)

jgrahamc · on March 3, 2013

Our customers deserve to know what happened as quickly as possible.

BoyWizard · on March 3, 2013

Two things:

1. That video of the BGP routes disappearing is awesome, and

2. A 40 minute outage sounds bad, but consider the following timeline (based on the writeup):

> T+0: route change made, propagates

> T+10: Response team online, attempting local fixes

> T+30: Routers across 23 data centres in 14 countries hard reset and networks coming back up.

DoubleMalt · on March 4, 2013

Funny that now the post mortem is down ...

jgrahamc · on March 4, 2013

Yes, ironic. Unfortunately, the CloudFlare blog is hosted on posterous and they seem to be down.

rdl · on March 4, 2013

You may want to move off posterous before it goes down for good in a month, too :)

tomvo · on March 4, 2013

same here, ironic indeed.

Igalze · on March 4, 2013

Ironic is one way of putting it... :)

onemorepassword · on March 3, 2013

> Even though some data centers came back online initially, they fell back over again because all the traffic across our entire network hit them and overloaded their resources.

I know very little of networking, but this seems to be a recurring pattern that aggravates many major outages. What surprises me is that this so often seems to be a scenario not accounted for.

jonknee · on March 3, 2013

You can only account for it by having more hardware and then it's possible more of your hardware will fail which puts you right back to where you started.

Dylan16807 · on March 3, 2013

I don't think that's the only solution. I would be willing to bet that outside of heavy-DDoS conditions that even a tiny fraction of Cloudflare's network could handle the incoming tcp connections and deny all of them. At that point you don't have to worry about traffic collapsing anything. You can wait to bring up more equipment. You can send a tiny error page. You can let X% of requests get through and be fully served.

I bet that most of the time the domino effect happens to internet services in general it's with nodes that are accepting most requests. They allow themselves to be overloaded. An active HTTP session uses orders of magnitude more resources than simply denying the initial packet and forgetting about it forever.

seanp2k2 · on March 4, 2013

You're vastly oversimplifying the problem here by only accounting for one class of problems.

>". I would be willing to bet that outside of heavy-DDoS conditions that even a tiny fraction of Cloudflare's network could handle the incoming tcp connections and deny all of them." depends on the attack.

>"You can send a tiny error page. You can let X% of requests get through and be fully served." Not usually that easy.

Dylan16807 · on March 4, 2013

I said outside of attacks.

I call BS on saying it's not easy to limit the number of served connections and RST the rest. Isn't this something every web server can do by itself it's so easy?

jaequery · on March 3, 2013

this is the type of reason why i stopped using cloudflare. there are just too many eggs in one basket. it's as if their entire service becomes a SPOF to your infrastructure.

driverdan · on March 3, 2013

You could say the same thing about almost any of your service providers. Your DNS provider goes out, everything goes out. The routers at the data center with your servers go out, all your servers go out. Your CDN goes out, all of the static assets on your site go out.

There will always be potential SPOF.

saurik · on March 4, 2013

While your example with the routers at your backend is truly problematic, DNS is designed with built-in redundancy and CDNs (which CloudFlare should not really count as) having a world-wide outage (as opposed to "people accessing from New York are currently having issues, as we lost one PoP") is nigh-unto unheard of... can you imagine Akamai (or CDNetworks or EdgeCast or even Amazon) saying "doh, all of our infrastructure everywhere just disappeared"?

The core problem with CloudFlare is that they seem to have a highly-centralized take on what is normally a massively-decentralized solution-space, with large numbers of value-adds they encourage customers to use without making it clear that they treat in a haphazard manner, doing very little testing before deploying pushing-the-envelope features while simultaneously having very little in-house debugging expertise to handle serious issues.

(As a concrete example of that last complaint, Cydia was crippled for an entire day due to ModMyi turning on CloudFlare's "preloader" transformation, which apparently caused many WebKit-based browsers--including both MobileSafari and Cydia--to entirely lock up; CloudFlare seemed to go the entire day without noticing, which I continue to be utterly shocked by, and it was only after I told them how to fix it that they were able to acknowledge the issue.)

http://www.saurik.com/id/14 <- When "Dumb Pipes" Get Too Smart, an extensive analysis of this bug

danielpal · on March 3, 2013

What is the alternative? Are you saying you can run a service without a DNS provider? You can always have multiple DNS service providers and CloudFlare is probably one of the best ones.

frankacter · on March 4, 2013

As of January 2013, CloudFlare is number two, second only to Dyn:

http://blog.cloudflare.com/cloudflare-fastest-free-dns-among...

Flow · on March 3, 2013

I think you should have investigated why you got ~90kb packages despite having a max pkg size of ~4kb instead of putting in that rule. :)

dubcanada · on March 3, 2013

I was thinking the exact same thing.

noselasd · on March 3, 2013

> attack packets were between 99,971 and 99,985 bytes long.

This should raise a red flag, as it must be impossible. Ethernet NICs would just bail out on packets longer than what you've set the MTU to, and ethernet frames would just come from the next hop in most cases. And IP packets have a max length field of 16 bit.

naww · on March 3, 2013

http://en.wikipedia.org/wiki/Jumbogram

> An optional feature of IPv6, the jumbo payload option, allows the exchange of packets with payloads of up to one byte less than 4 GiB

cbsmith · on March 3, 2013

People are praising the transparency of this report, but I am not sure I agree because of this point. when I read the report, I had to stop to think when I read the part about packet size to conclude that they had to be talking about an IPv6 packet using the hop-by-hop extension for fragmented packets. That is a special case, because you don't actually know the length of the packet until you receive the last fragment.

As a consequence, fragmented ipv6 packets are error for use in DoS attacks. This is not a "weird" occurrence, but rather an expected one, and since end points are not required to accept such huge packets, I am surprised Cloud flare want already doing all it could to advertise to upstream sources that IPv6 fragments longer than a much smaller than 90K should be dropped, at least if rooted to their DNS. I am also surprised that when their software came up with that kind of a response without first validating that it wouldn't cause the exact memory problem it did. Rules on v6 fragmented packets that can't match on a single fragment are inherently dangerous. It is only reasonable to have safe guards already in place for them.

I am also not sure this is really a bug in Juniper software. I imagine the memory problem only shows up with high traffic and in the midst of a DoS attack. That is kind of a given when you put a rule like that in that kind of a situation.

mprovost · on March 3, 2013

Yes but they were still seeing packets bigger than the MTU of Ethernet (or Sonet or whatever other layer 1/2 tech they're connected to the rest of the net with). It doesn't matter what higher level protocols can handle.

bdonlan · on March 3, 2013

They could've been fragmented IPv6 packets. Or it could've been a bug in their profiler.

pyvpx · on March 3, 2013

which is precisely why it seems like lunacy to roll out such an asinine firewall rule to every router. if there was ever a time to "spot check" a change, this was it.

they didn't. and they paid the price. good on 'em for the quick and honest post-mortem. regardless, it was a dumb move.

cbsmith · on March 3, 2013

You are joking right? The packet size at the higher layer is what they were matching against. The size of the layer 2 packets is irrelevant.

noselasd · on March 4, 2013

Maybe, but nothing in the the rule they showed hinted it was not at layer 3 (For IPv4 )

cbsmith · on March 4, 2013

It is at layer 3. IPv6 is layer 3.

noselasd · on March 4, 2013

If it was IPv6, I'd assume the routing rule on their blog contained IPv6 addreses, not IPv4 addresses, even if the blog faked the IP addresses.

cbsmith · on March 5, 2013

Perhaps then you aren't aware that IPv6 stacks can reach IPv4 addresses, nor that IPv6 packets are a popular way to compromise systems that support both IPv6 and IPv4, because the IPv6 stacks are not as well hardened.

brokentone · on March 3, 2013

Impressive response. 30 minute outage for something most of the hosts I've worked with in the past would have been mystified about for hours. Then a quick RFO and promise of proactive SLA adjustments? Next time I need a CDN or attack mitigation I'll be talking to Cloudflare

tedchs · on March 4, 2013

What I don't understand is why Cloudflare is making changes to their border routers in the process of protecting their customers. I am a network engineer and I love Juniper, but the reality is with any complex system, every change you make has a possibility of inducing an unexpected failure. I would think Cloudflare would have increased stability by using an architecture where the border routers have a mostly static config, and there is a set of firewalls (e.g. Juniper SRX 5800) behind them that are doing the actual filtering and changing configs in response to threats.

seanp2k2 · on March 4, 2013

So now you have two pieces of gear to test changes on and another interaction where stuff could break / go weird.

I don't see how that would solve anything here.

tedchs · on March 4, 2013

The thing it would solve is risking all their BGP peerings going down as a result of day-to-day service operations (i.e. every time they add a filter).

random42 · on March 3, 2013

OT: I want to pitch cloudflare for our CDN needs. Can someone estimate the scale of cloudflare wrt. akamai (current provider), in terms of operations, consumers etc.?

acdha · on March 3, 2013

Akamai is much larger – 100+K edge nodes, many collocated in ISP's local facilities – and has much, much better global coverage, particularly if you care about Asia or South America.

Downside: priced accordingly.

pyvpx · on March 3, 2013

akamai is about 100x the size and probably 200x the price.

saurik · on March 4, 2013

If you are going to go for the $3k/mo CloudFlare plan, you are already nearing the ballpark of Akamai, and would do good to look at the many CDNs that sit in the middle of that scale (such as CDNetworks or EdgeCast).

contingencies · on March 3, 2013

Developing good software comes down to consistently carrying out fundamental practices (regardless of the technology) - Paul M. Duvall

In this case: Development. Versioned change. Test or staging environment. Tests pass. Production.

rurounijones · on March 4, 2013

Meanwhile your customers are getting DDOS'ed while you are faffing about.

Yes, I fully agree that for things like software and standard network maintenance the above is good. But as someone else mentioned in this thread. DDoSes that require quick resolution put you between a rock and a hard place in terms of doing things "right"

contingencies · on March 4, 2013

That's true. However, look at what happens to all of your customers when you fail to test. If you haven't limited, or at least tested the extreme ranges of allowable input to a system automatically pushing out live configuration to all of your routers, there's nobody else to blame but yourself. Sorry. Are most people this diligent? No. Should we be? Yes.

rurounijones · on March 4, 2013

I actually wondered about testing with extreme ranges in another comment, but this testing is done "offline" (Not on live routers and not in response to current circumstances).

However, at least how I read it, your comment was about testing rules in consistently in dev -> staging -> prod when you create one which I think is not viable in this situation since you are on a very tight deadline with immediate impact on your customers.

lazyjones · on March 3, 2013

So what are they going to change as a consequence? It seems logical to not rely on a single router vendor anymore, or to test new rules on a staging setup at least for a very short time before pushing them to all routers.

rdl · on March 3, 2013

Running Vendor J and Vendor C routers together means you get exposed to the weird bugs in either's open/interoperability code, and lose out on all the advanced features (since most of the good stuff isn't well supported in true cross-platform vendor independent fashion).

It's probably more reasonable to split your network into a few more independent sections and never do updates which affect everything, but unless you're building the space shuttle (and can accept vastly higher costs and lower performance), it's probably better to pick one hardware platform, at least now.

eastdakota · on March 3, 2013

While we'll discuss it more at length and after a bit of sleep, my hunch is this will be closer to our approach.

rdl · on March 3, 2013

The thing which annoyed me the most was losing all DNS. You really need to have the DNS servers in separate infrastructure (ASN, netblock, while anycasted) so there is never a case where both of your DNS are out for a customer domain. The "CNAME" product looks pretty kludgey.

ams6110 · on March 3, 2013

By the same token you (the customer) should not have all your DNS eggs in one basket.

rincebrain · on March 3, 2013

CloudFlare's CDN bits require you to give them DNS delegation of your stuff, last I looked.

rdl · on March 3, 2013

There is a way around it with some of the premium accounts, but it kind of seems like a hack.

jacques_chester · on March 3, 2013

Adding different vendor types creates a cartesian explosion of possible combinations of bugs.

For standalone units that don't interact, n-version redundancy is good.

But if they have to interact and somebody has to troubleshoot, n-version redundancy is a nightmare.

The only reason it got fixed this quickly is because they had an intimate knowledge of a single vendor's products. That would be much more difficult with multiple vendors.

TranceMan · on March 3, 2013

Just wondering if the source of the large packets were from a [large] range of hosts or maybe a single host?

Ouch if a single host activity took down ~750k websites - whether deliberate and direct or not.

dubcanada · on March 3, 2013

It's probably even more then that. CloudFlare hosts cdnjs which a ton of people use. It could have "taken down" millions of sites. And by taken down I mean rendered unusable.

ralph · on March 3, 2013

Presumably Jupiter's Junos is closed-source, making investigation more difficult? Do they provide it to some of their bigger clients under an agreement?

Ecio78 · on March 4, 2013

I got a "Oops there was a problem" page from Postereous trying to open the blog page/site...

rschmitty · on March 4, 2013

Now we need a Post Mortem on the Post Mortem, as it is now down

"Oh noes! Something went wrong."

newman314 · on March 4, 2013

Posterous seems to be down.

graycat · on March 3, 2013

Yes, case number 384,449,194 of systems management causing a system problem. Also case number 439,224 of what looked like a localized problem quickly causing a huge system, e.g., all 23 data centers around the world, to crash.

They have my sympathy: So, they typed in a 'rule'. At one time I was working in 'artificial intelligence' (AI), actually 'expert systems', based on using 'rules' to implement real time management of server farms and networks. Of course, in that work, goals included 'lights out data centers', that is, don't need people walking around doing manual work but not the case of 'lights out' as in the CloudFlare outage, and very high reliability.

Looking into reliability, that is, putting into a few, broad categories the causes of outages, a category causing a large fraction of the outages was humans doing system management, or as in the words of the HAL 9000, "human error". Yup.

And the whole thing went down? Yup: One example we worked with was system management of a 'cluster'. Well, one of the computers in the cluster "went a little funny, a little funny in the head" and was throwing all its incoming work into its 'bit bucket'. So, the CPU busy metric on that computer was not very high, and the load leveling started sending nearly all the work to that one computer and, thus, into its bit bucket and, thus, effectively killed the work of the whole cluster.

As one response I decided that real time monitoring of a cluster, or any system that is supposed to be 'well balanced' via some version of 'load leveling', should include looking for 'out of balance' situations.

So, let's see: Such monitoring can have false positives (false alarms) and false negetives (missed detections). So, such monitoring is necessarily essentially a case of some statistical hypothesis testing, typically with the 'null hypothesis' that the system is healthy, applied continually in near real-time. So, for monitoring 'balancing', we will likely have to work with multi-dimensional data. Next, our chances of knowing the probability distribution of that data, even in the case of a healthy system, is from slim down to none. So we need a statistical hypothesis test that is both multi-dimensional and distribution-free.

So, CloudFlare's problems are not really new!

I went ahead and did some work, math, prototype software, etc. and maybe someday it will be useful, but it wouldn't have helped CloudFlare here if only because they needed no help noticing that all their systems around the world were crashing.

In our work on AI, at times we visited some high end sites, and in some cases we found some extreme, high up off the tops of the charts, concern and discipline for who, what, or why any humans could take any system management actions. E.g., they had learned the lesson that can't let someone just type in a new rule in a production system. Why? Because it was explained that one outage in a year, and the CIO would lose his bonus. Two outages and he would lose his job. Net, we're talking very high concern. No doubt CloudFlare will install lots of discipline around humans taking system management actions on their production systems.

Net, I can't blame CloudFlare. If my business gets big enough to need their services, they will be high on the list of companies I will call first!

scoot · on March 3, 2013

'in the words of the HAL 9000, "human error".'

Except that it wasn't human error, at least not in the sense that the decision to enter the rule, or the rule itself was in error. The human error was with the bug in the Juniper firmware that caused this rule to crash this router, and arguably with the CloudFlare process that allows rules to be propagated to all routers concurrently, rather than segmenting the network and testing for success before further deployment.

kisielk · on March 3, 2013

Actually this outage report is a good example of compounding systematic errors. A among the things that went wrong: incorrect and impossible packet sizes were detected, the rule generator generated rules matching the impossible packet sizes, the human operator who looked at the rules and entered them in to the router didn't notice any problems, and finally the routers responded to the incorrect rules by starting to crash.

Had any of these steps not gone wrong there likely would not have been an outage. It was a combination of failures that caused it.

scoot · on March 4, 2013

I don't disagree with you, but I was calling in to question the suggestion that the creation of the rule was the specific human error. By definition, every error you listed is a human error, as even if ultimately carried out but computers (routers), they were designed by humans.

graycat · on March 3, 2013

"or the rule itself was in error"

I thought that the rule was in error: I couldn't read the rule clearly on the screen, but it seemed, or I guessed, that the problem with the rule was that the "humans" omitted the decimal points and, thus, asked for blocking packets with lengths 1000 larger than intended. The Juniper software got sick, i.e., allocated too much memory, only because it was trying to swallow working with such absurdly large packet sizes. But, then, I couldn't clearly read the screen capture with the rule.

scoot · on March 3, 2013

The rule matched the output of the profiler, so in that sense was correct. It wasn't clear from the article whether the profiler output was correct, a result of intentionally malformed packets, or otherwise.

Regardless, the Juniper should either have rejected the rule, or accommodated it.

imbriaco · on March 3, 2013

It's pretty clear that it was a failure of CloudFlare's profiler in generating an obviously impossible rule, the engineer who attempted to apply that rule, and Juniper for allowing it. There's no one place to lay blame, nor is it relevant.

I would imagine that while we have seen a public reason for outage statement that there is a lot more work going on inside CloudFlare as far as the post-mortem is concerned. There are a lot of angles to cover here to really understand how the _system_ failed.

eastdakota · on March 4, 2013