Skip to content

Fault tree analysis of the September 19 downtime

A few weeks ago we had another downtime, and since this time the causes accessible to me were a bit richer than the previous downtime, which was directly caused by a network outage at our provider, I thought it would be fun to use some risk management technique on it. Namely, a fault tree analysis, which is my favorite method because I just like the concept of “why why why” 😀

Note that after some previous undetected downtimes, I had taken steps to improve downtime detection. They proved themselves useful, as this time the downtime was detected within a few minutes of onset.

The tree follows below, in PNG for the preview and in SVG for the zoomed version:
Fault tree analysis of the September 19 downtime

I put the root causes linked to my now former host, 1&1, in red. It’s quite obvious that most of the causes are linked to them, particularly the huge 6h delay to process the payment, which at this epoch is just inconceivable… About the “set it and forget it”, that’s something I’ve always disliked about 1&1: they force you to let them store your credit card info (a bit like Amazon except that Amazon let you deleted those info), so that they can renew automatically. This helps to forget: my other hosts have manual renewal, I never forgot to renew there…

I find this risk analysis method really straightforward. If you’re interested in further reading, some more links (the first one is in English, but the others are in French):

Posted in security, servers.

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.

Sorry about the CAPTCHA that requires JS. If you really don't want to enable JS and still want to comment, you can send me your comment via e-mail and I'll post it for you.

Please solve the CAPTCHA below in order to fight spamWordPress CAPTCHA