How to recover 88 VMs From a Starbucks Cafe

Your data center is in the east coast. It’s October. You know those hurricanes are churning up somewhere in the Atlantic. It’s not a question of “if” but a question of “what category” hurricane will strike.

So you make it a high priority to talk to your backup and disaster recovery (DR) team. The backup team updates you with what they are doing – “We do daily backups. We have 30 to 60 day retention. We use deduplication to reduce storage and bandwidth. We have 98% backup success rate. And we have the capability to do instant recoveries”.

You talk to your DR planning team and find out that they have a DR run-book in a spreadsheet. The spreadsheet captures information like the order in which VMs need to be recovered, the network configuration, the firewall rules, DNS redirection needs etc.

Nomad work Concept Image Computer Coffee Mug and Telephone large windows and sun rising, focus on coffee mug

They update you that their most recent DR test was done last year around the same time. And they were able to recover all VMs. But they had to do significant coordination between the DR, backup, storage, VMware, and network team. Yet their internal users were not able to start DR testing until 15 hours went by. And for those 15 hours, it took a huge cross-functional project management to get the job done. Bottom line – the 4 hour RTO (Recovery Time Objective) goal was not met.

After doing root cause analysis, I won’t be surprised if you come to the following conclusion because I have seen this pattern again and again with many customers –

  1. Instant recovery is not really scalable. It gives an illusion that you can power ON the VM quickly. But the reality is that you have to do it in batches and do storage vMotion. The perils of not having a “scalable” instant recovery are detailed in this blog.
  2. A lot of time is spent in manual sequential steps. People have to look at the spreadsheet, identify which VMs need to be recovered in what order. For ex: Database VM needs to be recovered first, then a set of application server VMs, and then a set of web server VMs.
  3. More manual steps such as assigning IP addresses, VLAN port group, DNS server for each VM not only takes time, but is also error prone. Typically a lot of troubleshooting has to be done before figuring out that some VMs were placed in an incorrect network.
  4. It takes significant number of man-hours from many different teams to accomplish the DR testing. It is stressful, error prone and an experience that everybody wants to forget and hope they never have to do it again.

You conclude that you need a solution that can deliver the following outcomes:

  1. Deliver a stress free, automated, reliable disaster recovery with a push button or 1-Click Disaster Recovery solution.
  2. Maybe even go a step further and have scheduled, unattended, fully automated recovery testing that automatically happens once every month or quarter.
  3. Show a compliance report that proves repeated DR testing every month or quarter. This improves the confidence that business can be indeed restored back within some guaranteed time frame such as 4 hours.

But to deliver these outcomes, you need to pick a solution that has the following capabilities:

  • Ability to create DR plans in a web interface instead of a spreadsheet.
  • For each DR plan, specify multiple logical application groups. A logical application group is just a logical collection of VMs. Each application group can be recovered independently of each other, and preferably simultaneously to keep the RTO low.
    1. Within each logical application group, specify the order in which VMs should be recovered. For each VM specify the vCPU, vMemory, network information like IP addresses, DNS server, VLAN port group.
    2. Flexibility to specify pre and post-scripts before and after each VM or application group is recovered. This gives you a chance to set any external firewall rules or stop\start services inside the VMs, or any other customization. You could also use these scripts to do Data integrity checks in automated scheduled DR tests.
  • Once you have defined these DR plans and saved them, at the time of DR or DR test, all your operator needs to do is login to the web interface, select one or more application groups that you want to recover, specify whether this is a true DR or a DR test, and push a button. The entire DR should be now automated. It should orchestrate the recovery of VMs, assignment of resources, assignment of network, invocation of pre & postscripts.
  • Need reports that document the VMs that were recovered, the user who invoked recoveries, the time it took, and reasons for failures (if any).
  • Need role based access control so admins can create DR plans, and application owners can do DR tests in a self-service manner.

Actifio delivers all of this functionality in a very scalable manner for 100, 300, 500, 1000+ VM environments. Actifio delivers 1-Click orchestrated DR that satisfies the above requirements using Actifio Resiliency Director. For more details, please look at this data sheet for Actifio Resiliency Director.

Actifio is the only platform that can

  1. Deliver backups with flexible retention for days, weeks, months, years, decades
  2. Deliver scalable instant recovery
  3. Deliver 1 click DR orchestration with all the requirements mentioned above
  4. Flexible RPO of 1 hour to 24 hours
  5. Deliver all of this functionality on any storage, thus giving you a completely storage independent solution

Gone are the days when 5 to 10 people need to be involved in doing DR or DR tests. You can truly get scheduled, automated, unattended DR testing done every month. Or you can initiate ad-hoc 1-Click DR testing from your iPad while having a coffee at your favorite Starbucks café.

If you want to see a demo or learn more about 1-Click DR using Actifio Resiliency Director please contact us at info@Actifio.com.

  Featured Download - Transform Data Management With Copy Data Virtualization. Download Now.