Cloud Disaster Recovery Critical Capabilities

In this Q&A session, Chandra Reddy @ Actifio shares Sanjay @ Hughes Networks perspective and expertise in the area of multi-cloud disaster recovery

Sanjay Taneja is the Advisory Engineer heading cloud disaster recovery architecture and service delivery at Hughes Networks. Learn pros and cons of various DR architectures, and critical capabilities for multi-cloud disaster recovery.

In this Q&A session, Chandra Reddy of Actifio shares Sanjay of Hughes Networks perspective and expertise in the area of multi-cloud disaster recovery.

Cloud DR for on-premises workloads is something every enterprise in every continent is looking to leverage. From your point of view, what are the critical business outcomes you expect from a DR solution?

Sanjay: View from the business side is Business Continuity; Business needs to be up and running with the least amount of downtime or impact. Therefore the DR objectives are pretty straightforward. 

  1. Reduce Data Loss (low RPO)
  2. Reduce Application Downtime (low RTO)
  3. Reduce Total Cost of Ownership (low TCO)
Could you elaborate a bit more on the factors that impact those business outcomes?

Sanjay: There are three factors to consider:

  1. Reduce Data Loss (low RPO): Low RPO reduces the window of data loss. Ideally, I would like an RPO of a minute for all apps. But that is impractical as it increases the infrastructure, software, and operations costs at a logarithmic scale. Thus it’s essential to tier RPO for various applications. For example, we wanted a low RPO flexibility of 15 minutes for databases. For other VMs, it varied from 1 hour to 24 hours because applications generally have static content. Example: the software version of the web application is updated probably once or twice a month. Expecting low RPO for everything increases infrastructure, software, and operations cost significantly.
  2. Reduce Application Downtime (low RTO): Low RTO is highly desirable to reduce application downtime. We wanted the entire application stack to be up and running within an hour.
  3. Reduce Total Cost of Ownership (low TCO): We wanted a low Total Cost of Ownership (TCO). Thus, it was important for us to consider an architecture that reduces 24×7 compute and storage consumption in the cloud to a minimum. We wanted to reduce the operational burden by looking for a one-click DR orchestration. And lastly, we wanted to keep license and infrastructure costs low by considering one platform that could take care of backup and low RTO DR needs in hybrid cloud. A reliable, readily available local backup that can be accessed and deployed instantly while reducing TCO is an added bonus.  
Before we get started with Cloud DR, can you list the various DR approaches in data centers

Sanjay: There are at least three architectures for Disaster Recovery:

  1. You can perform DR from backups
  2. DR from storage or host-based replication
  3. DR from native database replication tools.
Let’s dive into the backup architecture. Almost all backup products write backups in their own proprietary format. With such an architecture in the cloud, what would be the impact on RTO? What is your recommendation to reduce RTO?

Sanjay: A proprietary backup format would mean rehydrating data back to application native format at the time of recovery. Such a process needs every byte of the backup image to be read, converted, and restored, which increases RTO. 

You can lower RTO by eliminating the need to read-convert-restore from the proprietary backup format. Instead, you can store the backups in native application format and present an instant mount of the backup at the time of the recovery in the cloud.

It’s also important to have the ability to very quickly convert the backup corresponding to the boot volume of an on-premises VM or physical server backup image, within minutes, into a cloud-native VM in multiple clouds such as AWS, Azure, GCP.

What if a vendor has an architecture where they auto-restore after each backup. For example, the backup software could restore from the latest backup in object storage such as AWS S3 or GCP Nearline to block storage such as AWS EBS or GCP Persistent Disk (PD), and take a native storage snapshot. Since the data is already in the native format, wouldn’t this reduce RTO?

Sanjay: Such an approach will reduce the RTO because the data is already pre-converted into native application format. But the side effect is that the Total Cost of Ownership (TCO) will be high. 

TCO = storage costs for backup (most likely object storage such as AWS S3 or GCP Nearline) + Block Storage costs (such as AWS EBS or GCP PD where it stores data in native app format) + snapshot costs (such as AWS EBS or GCP PD snapshots so that you can go back in time for recoveries)

For example, if you need low RTO for 100 TB source data with two weeks retention, you would need to pay 24×7 costs for 1) at least 100 TB of object storage costs; 2) 100 TB of block storage such as AWS EBS = $100/TB/month x 100 TB = $10,000 per month, which doesn’t even include snapshot costs.

Also, you will have a large RPO because you can’t get the backup to object storage, copy to block storage, and snapshots performed quickly.

What about DR using storage-based replication?

Sanjay: Some storage vendors have implemented replication from their on-premises storage arrays to cloud object storage, such as AWS S3, to reduce 24×7 block storage costs in the cloud. But such an architecture would require data rehydration from cloud object storage such as AWS S3 or GCP Nearline to block storage such as AWS EBS or GCP PD at the time of recovery, which increases RTO. Some storage vendors have implemented replication to cloud block storage, such as AWS EBS or GCP PD volumes. But this increases TCO because 24×7 bock storage usage is costly in the cloud.

It looks like you either get low RTO or low TCO, but not both. What about host-based replication?

Sanjay: vSphere replication is possible from on-premises VMware to VMware running in AWS, or CloudSimple managed VMware in GCP. But this requires expensive vSAN storage and a minimum three-node VMWare cluster 24×7 in the cloud and increases the TCO significantly. Some replication products can replicate to object storage, thus lowering TCO. And they also offer 1-click DR orchestration, thus reducing operational simplicity. But they can not do instant mount and recovery from object storage and have to restore data from object storage to block storage in the cloud, thus increasing RTO.

Again the tradeoff between RTO vs. low TCO continues. What about native Database replication?

Sanjay: While native database replication can deliver low RTO, it has the following consequences:

How will you recover the rest of the application and web servers that work with the database? You still need to invest in another backup and DR tool to protect all those apps and web servers. Moreover, recovering databases from older points in time are very time-consuming, depending on how many logs need to be applied, thus increasing the application downtime. 

And lastly, native database replication requires 24×7 compute running in the cloud while consuming expensive block storage, and not to mention costly database license, operations and maintenance costs. While this might be ok for small shops, this will be extremely expensive for enterprises that typically have 100s of database servers.

But the advantage of native DB replication, along with low RTO, is near zero RPO. So I would recommend using native DB replication for the most critical databases instead of every database.

Thank you for the great insight into the pros and cons of various architectures out there. Shifting gears now, can you describe your environment? What are the critical capabilities you were looking to consider a platform for backup and cloud DR?

Sanjay: We had approximately 1200 VMs, with several Oracle databases running in our on-premises data center. We had a high concentration of many small Oracle databases because of our on-premises application design and testing requirements.

The critical capabilities that I was looking for were to bring the best of all the various architectures I mentioned above: 

  1. Reduce RPO with incremental forever backups of VMware VMs, Physical, and Oracle databases. 
  2. Application consistent backups of Oracle databases and any other databases. DBAs would never agree to a crash-consistent backup or replication.
  3. Flexibility to store backups in multiple clouds and no lock-in to a single cloud vendor.
  4. Reduce TCO by storing backups in native application format in S3 compatible object storage such as AWS S3 or GCP Nearline.
  5. For DR tests, reduce RTO with the ability to instant mount and recover straight from the object storage and, at the same time, reduce TCO with no 24×7 block storage costs.
  6. For a real DR, have the flexibility to instant mount and recover from object storage to spin up VMs and DBs in the cloud, and in the background, have a storage v-motion like feature to get the data back to block storage, with no app downtime.
  7. And lastly, one-click DR orchestration is essential. We want to push a button, go to Starbucks and have a coffee, and by the time we come back in 30 minutes, expect to recover 1200 VMs in the cloud. It should be that simple.
  8. Option to consume the platform as a product or as a SaaS offering. Many enterprises are planning to consume any cloud-related offering in a SaaS pay-as-you-consume model. 

These critical capabilities ensure that I have the desired business outcomes of low RPO, low RTO, and low TCO.

Would you mind sharing the backup and DR platform that you picked for your environment of 1200 VMs and 100+ Oracle databases?

Sanjay: We evaluated and picked Actifio as it satisfied all the critical capabilities I listed above. We didn’t have to choose between low RTO vs. low TCO. We got both the outcomes as we could get instant mount and recovery straight from object storage. 

I am also excited about a new feature in Actifio 10c, which accelerates IO performance after recovery from instant mount from object storage by using small SSD block storage as a cache for read and write IOPS.

Thank you for using Actifio. I am glad to hear that it satisfies your critical requirements. What is next? Are you also planning to leverage Actifio’s functionality to reuse backups to provision thin Database clones within minutes for test/dev purposes?

Sanjay: Absolutely, we intend to leverage these key features that Actifio provides in addition to data masking.

What does the future of your environment look like in the cloud? Are you planning any migrations to the cloud? Any plans to use containers in the cloud for test/dev?

Sanjay: I have chosen a hybrid model of deployment where on-prem datacenter and multiple-cloud deployments integrate seamlessly. This way, we can cherry-pick and leverage the best capabilities of each cloud vendor to complement the on-prem datacenter.

And my last question, in the next 12 months, what are the top 3 services that your enterprise will look to consume in the cloud?

Sanjay: : DR, AI/ML, Test/Dev, Keubernetes, Postgres.

Thank you so much for sharing your deep expertise in this space. 

Recent Posts