Close

How to Lower Cloud Disaster Recovery RTO & TCO

HOW TO LOWER CLOUD TCO

How do you feel when you hear this from a program manager – “You can either get the extra feature or a timely release. Which do you prefer?”

It’s a similar feeling for the cloud architects. You can pick either low Recovery Time Objective (RTO) or low Total Cost of Ownership (TCO), but, not both. To understand why you have to sacrifice either RTO or TC, let’s start with an understanding of cloud storage costs.

Cloud Storage Costs

The following table shows the cost comparison of object storage vs block storage in various clouds:

Cloud block storage, on average, is 10x more expensive than object storage. Thus, any architecture which leverages 24×7 block storage will significantly increase the TCO.

The Low RTO & High TCO Problem

Almost all on-premises applications like file servers, databases, app servers, and web servers that are running in VMs or physical servers use primarily block storage and file storage. These apps are not designed to use object storage natively.

So, when these applications need to be recovered in the cloud, they need to use block storage in the cloud. Therefore if the backups are stored in the native application format in block storage, this delivers the fastest recovery i.e. lowest RTO.
However, the low RTO achieved here introduces high costs because of using block storage. For example, consider that for low RTO reasons, you replicate 100 TB of on-premises data to AWS EBS SSD storage. You would incur $100/TB x 100 TB = $10,000 per month i.e. $120K per year just for the cloud storage costs.

The Large RTO & Low TCO Problem

On the other hand, if you were to store 100 TB in S3 IAS, you would incur $12.50/TB x 100 TB = $1,250 per month i.e. just $15K per year (for cloud storage costs). This reduces the costs dramatically, by 8x in AWS, 13x in Azure, 17x in GCP, and 12x in IBM.

However, with this architecture of storing data in object storage to lower the costs, at the time of the DR or DR test, the data has to be restored to block storage. This could take quite some time, depending on the amount of data being restored. 

For example, restoring 100 TB of data from object storage to block storage might take 1-2 days, thus increasing your Recovery Time Objective (RTO).

Using Object Storage, Block Storage, and Cloud Snapshots

Some backup vendors have implemented an architecture where the backups are first stored in object storage, and then to reduce RTO, they implemented a mechanism of restoring the latest backup from object storage to block storage. 

This ensures that the data is in native application format available in block storage, thus allowing you to recover quickly and lower your RTO. 

But what if you want the ability to recover from an older point in time because of data corruption or ransomware attack? 

To solve this requirement, these backup vendors take a cloud storage snapshot of the block device after each recovery. Therefore, if you want to recover from an older point-in-time, you would recover from the snapshot. But recoveries from cloud snapshots are slow because this involves transferring data from cloud snapshots, which are typically stored in object storage, to block storage…thus deliver large RTO.

This architecture is not only complex with too many moving parts but also introduces the highest costs as it involves the cost of object storage to store daily backups + temporary cost of compute to restore from object to block storage + 24×7 cost of block storage where data is restored after each backup + storage snapshot costs.

An Ideal Cloud DR Solution

Obviously, an ideal solution should lower your RTO and TCO. To deliver low TCO, the architecture has to store backups in object storage and avoid using block storage as much as possible. If the backups were stored in native application format in block storage, vendors like Actifio have the capability to present the backup as an iSCSI block device to a recovery server and recover the database or file system online. Thus, if there was a way to mount the backup stored in object storage as an iSCSI block device and recover the file system, database, etc, this would lower the RTO to minutes. 

And this is exactly what Actifio delivers.  Actifio also has the ability to recover the entire on-premises physical server, VM, multi-TB file system and database using this approach. In this approach, Actifio serves all the block read requests to object storage reads and all block writes by the application are written to block storage that serves as a cache. This ensures that the original backup image in object storage is immutable.

But What About The Read Performance?

In the above approach, while the write performance is good because of writes to block storage, the read performance, especially the small read IOPs, would suffer because of the reads having to work hard to get the data from object storage. 

To solve this problem, Actifio 10c introduced a read cache in block storage. The first read of a block is fetched from object storage and cached in block storage. All future reads are served from the block storage cache, thus ensuring an SSD- like performance. The net effect of this architecture is that you can get up to SSD performance at the cost of object storage!!! This architecture is especially great for DR tests where the focus is more on application recoverability and testing end to end processes, networking, firewall rules, etc and not on performance testing.

“Storage v-Motion like” feature in the cloud for Real DR

In a real disaster recovery scenario, however, you would want higher performance for Tier 1 and mission-critical applications. For such a scenario, Actifio offers you another flexibility: Mount & Migrate. 

Using this feature, you can mount a multi-TB VM or file system or database in minutes from object storage and bring the system online. Thus users are productive. In the background, Actifio has a “storage v-motion” like capability in the cloud where it can copy the data from object storage and its cache to block storage. Once the copy is done, all I/O happens to block storage instead of object storage. More importantly, this happens “live” i.e. there is no downtime. Thus, you get the best of both worlds i.e. 1) low RTO with instant mount and recovery straight from object storage to keep low TCO, 2) zero downtime migration to block storage for better performance and incur block storage costs only for the duration of DR.

What about Recovering at Scale i.e. 100s of 1000s of machines in the Cloud?

You might be thinking …”Mount and Migrate for one machine is great. Would it scale for 100s or 1000s of machines? Does somebody have to baby-sit this for each machine recovery? What is the burden on operations?”

Actifio makes it simple with a scale-out one-click DR orchestration feature named Actifio Resiliency Director. You can set up one or more DR plans. In each DR plan, you can specify: 

  1. The physical servers, on-premises VMs or cloud VMs you want to recover
  2. Work with default source CPU and memory or customize CPU and memory resources
  3. Network customization, even for multi-homed VMs
  4. The order of recovery between the servers
  5. Pre and post scripts for scenarios where certain actions have to be taken before the recovery of the next VM starts. For example, you may want to invoke a script to specify firewall rule or NAT or DNS redirection before a VM is recovered in a specific network

At the time of the DR or DR test, you just have to select a DR plan and run it. Go to Starbucks. Have a coffee. Come back and see 100s of VMs already recovered within an hour.

Here is a short 3 minute video on cloud DR orchestration and a video on how it works.

What about Recovering every Friday night while I am dining at TGI Fridays?

Yes, you can schedule a DR plan to be executed every day or every week. You could specify a post-script in the DR plan to run data integrity checks or a multi-tier application transaction. Thus the DR plan executes, recovers the VMs, runs DB checksums or integrity checks, or 3-tier transactions and shuts down the test recovery machines. You could also be more aggressive and run ransomware, malware scans after auto-recovering your critical servers in the cloud every night.

Summary

Cloud offers great flexibility. And each element has its pros and cons: High-performance block storage but it’s expensive; Simple storage snapshots but are slow to restore from; Cheap object storage that can’t be used directly by enterprise applications. You want to use the cloud for low RTO, low TCO, with a low burden on operations with full automated 1-click DR orchestration, high performance after a real DR situation.

Actifio delivers all of this keeping object storage at the center of our design and architecture. Our world-class engineers in Boston treat object storage as a first-class citizen and not as a graveyard where backups go to die. This has given us a true architectural advantage using which you could deliver low RTO & TCO…something that didn’t exist until now with Actifio10c.

Here is a great blog from our customer Hughes Networks which talks about various cloud DR challenges, critical capabilities needed for cloud DR.

You can learn more about the full list of Actifio 10c capabilities or request a conversation with our Cloud architects anytime.