Following the introductory cloud post a few days ago, and to avoid losing momentum, we are going to keep talking about the cloud, in an area where it seems particularly useful: business continuity. Along with other measures, it is clear that the existence of globally distributed datacenters (did someone say GDPR?), flexible system scaling and almost instantaneous deployment make a cloud infrastructure (on equal terms) more resilient to outages than an on-premise infrastructure. Of course, availability is not the only factor to consider, but we’ll talk about that another day.
However, to speak of the benefits of the cloud, the providers do themselves a pretty good job. What I want to talk about is some of the issues that must be considered before migrating an infrastructure to the cloud (although some of these points are also applicable to PaaS and SaaS). That is: the problems.
1. The cloud doesn’t fail
This is the myth I like the most, partly promoted by the big cloud providers, who tend to create a false image of total availability, and bought without too many concerns by IT professionals. The reality is that the cloud does indeed fall. We’ve seen that in the past, and we will see it again in the future; it’s called Murphy. Whether it’s Amazon AWS, Google GCloud, Microsoft Azure, Oracle Cloud or IBM Cloud, sooner or later there will be outages. Because, I am so sorry, total availability does not exist. Although, you know what? From the point of view of business continuity it doesn’t matter whether there have been outages in the past or not. We must assume that there will be, and prepare for it (yeah, you can also assume it, that’s a valid alternative).
2. SLAs are 99.99999… whatever
It is true that the SLA guaranteed by CSP for data durability is a long row of ‘9’, but we better do not confuse durability with availability, whose guaranteed SLA is between 99.9% and 99.99%, depending on the CSP and the particular service. In practice, this means that the probability of (the provider) losing our data is almost negligible, but the probability of a service not being available at a given time is somewhat higher (always within the margins in which we move). Actually, 99.99% doesn’t seem really so bad, but let’s move on to the next point.
3. If the SLA are not met, the CSP compensates economically
It is logical to think that even an availability in the range of 99.9% (yeah, I removed a 9) is totally good for most cases, especially if the outage is distributed in shorter downtimes throughout the year (99,9% it’s about 8.5 hours or unavailability). But you know, again, how this Murphy thing goes: things happen and outages that should be a couple of hours turn magically into incidents of a couple of days. But yes, it is true, of course, the CSP will compensate that inconvenience returning part (or the total) of the monthly invoice if the SLA falls below a certain level. But you know what? Maybe the compensation will no be enough to cover the operational impact of the fall, and that is something to take into account.
4. The cloud simplifies business continuity
Okay. Yes… but no. No doubt that the technical capabilities and virtually unlimited resources of the cloud facilitates disaster recovery. But that is only the visible part of the iceberg, and there’s much more underneath (which is the point of the analogy itself). The cloud will not save us from identifying processes and activities, their RTOs, RPOs, MTPDs and MBCOs, their minimum resources and critical staff, or identifying the existing risks. Nor will it save us from designing and implementing the Crisis Management Plan, the Operational Plans or the Communication Plan. Neither to develop business continuity awareness activities, to maintain the (management) system nor to make regular periodic continuity tests (about that, a little bit further on). So, yes, we can say that the cloud simplifies business continuity… but a lot less than we want to believe.
5. The cost of disaster recovery solutions is low
Frankly, I don’t have the data to say that the cost of implementing disaster recovery measures is higher in on-premise than in the cloud, but I would venture to say that in general, we can say that it is. In the end, moving CapEx to OpEx has its advantages. However, the immense range of services offered by the CSP and the simplicity of setting up new services can quite easily lead to resources being oversized, and mechanisms whose need has not always been analyzed in depth being set up (i.e. contracted) to deal with a potential disaster, but which are simply contracted “because they are there and it is so easy”.
6. The provider takes care of the backups, and the continuity tests are not really that necessary
Really, if our CSP guarantees 99.9999999… whatever data durability, why are backups needed? I want to think that the answer is obvious, but in case it isn’t: because the cloud provider may not lose data, but the daily operation involves errors, and those errors mean data loss. Okay, we need backups. But why test them? Again, because making backups doesn’t automatically mean making them right. And we can extend and multiply this in the case of disaster recovery testing. In a crisis situation, we’re going to need the technical measures that we’ve implemented (and that we’ve been paying for for months or years) to work, and to work well. Exactly the same as in an on-premise infrastructure. And that’s what disaster recovery tests are for. Yeah, pretty obvious.
In conclusion, despite all the nuances that can be made, there is no doubt that the cloud provides multiple technical mechanisms which make disaster recovery easier. However, blindly trusting that migrating our infrastructure to the cloud will solve all our problems is an invitation to disaster, and it’s never been said better. (Good) business continuity, here and on Mars, on premise and in the cloud, is a complex process involving multiple variables that cannot be simplified, for better or worse, to a migration to the cloud.