A few weeks ago we had another downtime, and since this time the causes accessible to me were a bit richer than the previous downtime, which was directly caused by a network outage at our provider, I thought it would be fun to use some risk management technique on it. Namely, a fault tree analysis, which is my favorite method because I just like the concept of “why why why” 😀
Note that after some previous undetected downtimes, I had taken steps to improve downtime detection. They proved themselves useful, as this time the downtime was detected within a few minutes of onset.
I put the root causes linked to my now former host, 1&1, in red. It’s quite obvious that most of the causes are linked to them, particularly the huge 6h delay to process the payment, which at this epoch is just inconceivable… About the “set it and forget it”, that’s something I’ve always disliked about 1&1: they force you to let them store your credit card info (a bit like Amazon except that Amazon let you deleted those info), so that they can renew automatically. This helps to forget: my other hosts have manual renewal, I never forgot to renew there…
I find this risk analysis method really straightforward. If you’re interested in further reading, some more links (the first one is in English, but the others are in French):