At no point does the author elaborate on why failing to start if the environment...

fauigerzigerk · on Dec 24, 2015

I agree. Things become incredibly murky if preconditions are not clearly seperated from optional settings with sensible defaults. For a server-side multi-user application to just go off hunting for data stores or even create new ones whenever configuration settings are missing seems like a security and data integrity nightmare.

If something is a precondition then the app shouldn't act like it wasn't just to make Docker configuration easier. It needs to fail fast.

Retrying database and other connections is sometimes the right thing to do in long running applications. But I dont' think application launch is the right time for it. Application launch is an opportunity to make sure that all dependencies were in place at some point. If things break later on, the odds of it being a temporary issue are much better.

parasubvert · on Dec 24, 2015

"If something is a precondition then the app shouldn't act like it wasn't just to make Docker configuration easier. It needs to fail Fail fast doesn't mean "crash completely". It means "fallback to the next sensible approach".

"But I dont' think application launch is the right time for it. Application launch is an opportunity to make sure that all dependencies were in place at some point. If things break later on, the odds of it being a temporary issue are much better."

This is presuming you or some human has control over the lifecycle of an individual process.

The trend in both mobile and cloud native is the process model, which says the opposite: your app process can and will be killed or relaunched any time by the underlying OS. It may do so out of sequence with backing service availability. This, retries (with a time or count bound, perhaps) are a sensible default.

fauigerzigerk · on Dec 26, 2015

>This is presuming you or some human has control over the lifecycle of an individual process.

No, that doesn't matter at all. It's like with DBMS transactions. I want a defined point at which the system is in a known good state or fail in some detectable way.

For long running processes that get bounced automatically, there needs to be some sort of monitoring anyway. Monitoring is easier if the application does not linger endlessly in an inconsistent state.

parasubvert · on Dec 27, 2015

"I want a defined point at which the system is in a known good state or fail in some detectable way."

I think we're (as an industry) getting to scale and complexity of systems that warrants systems to heal themselves for a range of predictable and well-understood failure modes, in a way that doesn't require my manual interference.

davidbanham · on Dec 29, 2015

Absolutely! But that doesn't need to be the app's job. It's the role of an orthogonal process, the monitoring or init daemon, to say "Hey, this process bailed. I should restart it."

That way all your app has to worry about is "Every time I get started, I should try and connect to my dependent services. If it doesn't work, bail."

And the monitoring process gets to worry about things like "I should retry X times before giving up. There need to be at least Y instances of this process running."

fauigerzigerk · on Dec 27, 2015

Agreed, but I don't think we can do that without having transactional boundaries. When a transaction fails, it doesn't necessarily mean that a human has to intervene. It just means that we have a reduced set of possible states that are known to be consistent. I don't see how we could ever hope to define correct self healing algorithms without reducing the number of possible states a system can be in.

parasubvert · on Dec 29, 2015

Yes. Consistency guarantees in the face of distributed failure is a popular topic. (CAP theorem, etc). The whole "cloud native" (12 factor apps, microservices, immutable or disposable infrastructure) movement is also trying to describe ways to simplify most codebases so that you keep as much of the system as stateless/ephemeral as possible.

What is interesting is that , for the stateful/persistent data processing, most large scale systems are rejecting transaction boundaries as we know them (fully serializable isolation and consistently) for relaxed consistency. There are some good articles and papers on how programming needs to change to enable better self healing / "recoverable to a known state" behaviour, such as CRDTs.

ozim · on Dec 24, 2015

I think that author just does not have experience outside of what he does. Maybe his systems can fall back to sane defaults. But what is sane default if you have to communicate with 3rd party server and your system is worthless when connection is not there? You have to have ip/domain name configured.

For cars if something is wrong then in some cases you can start and even drive but users get warning. If there is something really wrong car will not start.

So I think what author suggests is at least asking for trouble.

Almost everywhere as quoted:

"Everything in this post is about improving the deployment process for your applications, specifically those running in a Docker container, but these ideas should apply almost anywhere."

parasubvert · on Dec 24, 2015

"I think that author just does not have experience outside of what he does."

Kelsey's recommendations aren't that different from general resilient systems guidelines in the Erlang community, or any of the many notes on how distributed system development is Different.

"But what is sane default if you have to communicate with 3rd party server and your system is worthless when connection is not there? "

The sane default is to wait and keep trying to connect for at least a bounded period of time (and a sensible approach to backoff).

The point is that, if you're building distributed systems, you need to account for partial failure. One of the 12 factors is the process model: your app should be considered a process that will be killed and/or restarted at will. It might do so when your backing services are currently unavailable.

The sensible thing is to retry a bounded number times (with backoff) before giving up. Sometimes the underlying application platform also does this for you (by killing your app process and rescheduling it, if it doesn't come up after a certain time period).

Baking in timing dependencies as a failure mode makes software less resilient.

lwf · on Dec 24, 2015

The author is a she.

parasubvert · on Dec 24, 2015

Kelsey isn't a she :)

lwf · on Dec 25, 2015

Oh, welp. That's what I get for gendering names. My apologies, to both you and Kelsey.