Managing Cloud Snapshots at Scale in a Multi-Cloud World

Managing Cloud Snapshots at Scale in a Multi-Cloud World

At the 2019 AWS Reinvent in Las Vegas, I had a fascinating discussion with Shawn, Director of Cloud Architecture at a very large enterprise that specializes in online payment processing. They had a massive army of 10,000+ VMs in GCP and AWS. Like many others, Shawn’s team was under the impression that you don’t need data protection in the cloud. The cloud, after all, is highly resilient and offered as a managed service. They quickly realized, after a few accidental data loss issues that you just can’t go back in time and recover the cloud VMs.

Then they started kicking tires with native cloud snapshots. But very quickly, Shawn’s team found the following challenges:

The 1st challenge was keeping an inventory of the total number of snapshots. Assuming each VM with 2 disks, snapshot retention of 30 days, the inventory was 10,000 Vms x 2 disks per VM x 30 snapshots = 600,000 snapshots. 

When you try to assign a backup plan, you do not see EC2 instance names. You just see EBS, EFS and DynamoDB resource types. And when you select the EBS resource type, you then need to specify resource ID which is not at all human-friendly as shown in the screenshot below. How is Shawn’s team supposed to remember the resourceID of volumes belonging to a specific EC2 instance which is one of the 10,000 EC2 instances?

What about restores from the snapshots? See the screenshot below.

Imagine staring at this screen with 600,000 snapshots. There is no way to specify an EC2 instance named, for example, “Shawn”. Instead, you have to remember the resource ID of the EBS volumes belonging to the “Shawn” EC2 instance. Rinse and repeat the process for all the volumes from the same point-in-time (wonder how to find them??). And then rinse and repeat the same for all other EC2 instances that you want to recover.

Shawn put it very bluntly, “While this might be ok for a small startup with 10 to 15 instances, it’s not really usable for an organization with 100+ cloud VMs.”

The 2nd challenge was that the mechanics of snapshot creation and recovery were different in AWS & GCP. This increased the complexity of the custom scripts they were developing to automate cloud snapshot management. 

The 3rd challenge was that they also had on-premises workloads that needed to be protected, replicated to multiple clouds for disaster recovery. Shawn didn’t want to burden their operations with multiple point tools. 

The 4th challenge was that about 5% of their VMs i.e. 500 VMs had MS SQL, MySQL, Oracle, and SAP HANA databases. Most of the databases were mission-critical with multi-TB sizes and hence they wanted a low RTO in minutes. Their DBAs highlighted multiple challenges mentioned below:

  • The DBAs felt that the snapshots are volume/storage-centric and not database aware. Thus trying to recover a database that is spread on multiple volumes was too much of a burden on DBAs. Finding which volume(s) to recover for a DB out of 60,000 snapshots was like finding a needle in a haystack.
  • More importantly, they did not feel confident about the cloud snapshots because of a lack of database consistency and log management.
  • And lastly, their database recoveries from snapshots was taking many hours to recover each database.

They came to the conclusion that while cloud snapshots are fine for cloud VMs with no databases, they need a different approach for cloud VMs with mission-critical databases.

Actifio 10c Multi-Cloud Snapshot Management

The challenges that Shawn shared became a great set of requirements for our engineering team at Actifio to consider incorporating in our latest Actifio 10c release. Here are some key benefits and capabilities of Actifio 10c Multi-Cloud Snapshot Management.

Reduce Operational Burden

  • Agentless snapshot management means you don’t have to install and upgrade agents for backup and recovery.
  • Automated multi-cloud VM discovery means auto-discovery of cloud VMs in one or more regions, and even across multiple cloud vendors like AWS & GCP.
This image has an empty alt attribute; its file name is YCw1nyahgNnz1gcMGBc6m39r4Jvp4MrOt5iZMhT9-e_HPNnALnas0K--amsE42u1LfbQ-rijlRHL7dKHNNGogJPH-ZpjfvsjOSDpk8hpNdMIrN1PK9oJw56-MG8Pf6fMnbWdwNjK
  • Flexible, easy SLA setup enables you to specify how often to take snapshots (like every 8 hrs or 24 hours or any period in between) and how long to retain the snapshots.
  • Multi-cloud snapshot management means you can set up central, common SLAs, apply SLAs to multi-cloud VMs, and recover cloud VMs from snapshots.
  • Easy On-premises and Cloud VM data management with a single pane of glass. It allows you to protect on-premises workloads running in VMs and physical servers, replicate to multiple clouds and deliver one-click DR orchestration. You can also use the same user interface to discover, protect, and manage cloud VM snapshots

Lower your RPO and Rapid Recoveries from Snapshots

  • Rapid RPO with cloud snapshots every few hours
  • Easy Rewind and Recover like a DVR allows you to specify any point-in-time and recover a Cloud VM from the nearest cloud-native snapshot with full automation.

The screenshot below shows the GCP VM that needs to be recovered. You have the option to recover all or some of the volumes. Note that the user interface is VM centric and not a disk or snapshot centric.

The screenshot below shows how you can mount one or more volumes to an existing GCP VM instance.

The screenshot below shows how you can recover as a new GCP instance with network customization.

Low RPO & RTO for Cloud VMs with Databases

For cloud VMs with databases, based on its shortcomings that were described earlier in this post, we don’t recommend using Cloud-native snapshots. Instead, Actifio recommends using Actifio snapshots that are integrated with each database for database consistency. More importantly, it delivers the following benefits which cloud snapshots can not.

  • Database consistent, incremental forever backups using Actifio connectors which has the unique capability of change block tracking for all critical databases such as SAP HANA, Oracle, MS SQL Server, Db2, Sybase ASE, MaxDB, MySQL, PostgreSQL, MongoDB, etc.
  • Automated log handling and backup, with point-in-time recovery by rolling logs forward
  • Recover & Clone multi-TB Databases in minutes with Actifio’s instant mount and recovery from Actifio backups using database connectors. Note that Actifio has the unique capability to deliver instant mount and recovery from backups stored in both block storage and object storage.
  • Granular recovery of databases where you can select individual databases to be recovered from Actifio snapshots instead of forcing you to recover the entire database instance to be recovered from a cloud snapshot.
  • Enhanced Security with the ability to discover cloud VMs and manage cloud snapshots from the same or different cloud account using role-based access control.
  • Flexible Software Consumption Model gives you a choice to consume Actifio with perpetual product licenses or in a Software-as-a-service pay-as-you-consume model in multiple clouds such as AWS and GCP.
  • Reduce Total Cost of Ownership with a single pane of glass for data management of not just cloud VMs, but also on-premises workloads, databases, file servers, and also enabling you to reuse backups to provision rapid database clones for Dev/QA/UAT testing.

With such comprehensive multi-cloud data management capabilities, enterprises are able to protect, recover, clone their on-premises and cloud VMs with a single pane of glass management. 

For a high level overview, watch this 3 minute Cloud Snapshot Management video. There is also a comprehensive list of the benefits of Actifio 10c. If you’d like to connect with a cloud architect, you can request a conversation anytime.

Want to check under the hood? Access the How Actifio Works whitepaper