But if they fetch a lot of URLs from cloudflare and it takes 30 seconds to answer with a timeout instead of a http/200 under 20ms, then some architectural decisions that were sound under the latter case may make the whole system slow in the former one.
People have started testing network failures but so often fail to test slow network failures. All of a sudden your code has 500x more pending open sockets, both in and out, and the memory spikes and it all goes to hell. Even if the fast fail code path is indeed best effort.
I would love to see graphs showing "length of reply that eventually succeeded" - I suspect that most networks today, if you don't get a response in 5 seconds, you ain't never gonna get anything useful.
In other words, I wonder if going to fail fast would help the health of the Internet more than wait forever timeouts. Might reduce DDoS effects, too.
> When we plotted the data geographically and compared it to our total numbers broken out by region, there was a disproportionate increase in traffic from places like Southeast Asia, South America, Africa, and even remote regions of Siberia. Further investigation revealed that, in these places, the average page load time under Feather was over TWO MINUTES! This meant that a regular video page, at over a megabyte, was taking more than TWENTY MINUTES to load! This was the penalty incurred before the video stream even had a chance to show the first frame. Correspondingly, entire populations of people simply could not use YouTube because it took too long to see anything. Under Feather, despite it taking over two minutes to get to the first frame of video, watching a video actually became a real possibility. Over the week, word of Feather had spread in these areas and our numbers were completely skewed as a result. Large numbers of people who were previously unable to use YouTube before were suddenly able to.
If you're building complicated systems, you should probably reduce traffic to sites that are failing to respond, rather than continuing to send traffic, but timeout faster. Depending on what stage in the process the request is failing, it might not make a big difference (in a typical HTTPS exchange, the costs ramp the farther you go, processing a syn < processing a client hello < processing a complex request. (If it's a simple request, processing the client hello is more expensive than the request, of course).
If you send all the same traffic, and probably more because of retries with shorter and shorter timeouts, chances are you are going to keep the system overloaded, never detect success and never return to default timeouts. Dropping most of the traffic, and then turning it back on when the system recovers can lead to oscillation where the system works enough to drive more traffic that overloads the system etc, but at least you're getting some processing done.
Well, you do need, jitter, exponential backoff, caching, black and white listing, a stat base decision tree, etc. That's why it's a complicated and costly problem.
But if you are consuming a lot of API content, if you have crawlers or if you provide features like "get article title/summary/image/thumbs", at some scale it's an important decision to make.
I guess that comes with the type of end point in that network. A typical website, absolutely, yes, I'd agree. An API end point allowing requests of large data pools that might take a few seconds to generate but yet not a total time out would be acceptable.
But if they fetch a lot of URLs from cloudflare and it takes 30 seconds to answer with a timeout instead of a http/200 under 20ms, then some architectural decisions that were sound under the latter case may make the whole system slow in the former one.