Actifio is now part of Google Cloud. Read the full announcement.

Is Your “Instant Recovery” Scalable?

Imagine this – You have a production environment with 100+ VMs, and you have a problem in a subset of your environment, and you now need to recover 30 VMs. Chances are high that your “instant recovery” (using a non Actifio solution) will not scale, and the end to end process will be very stressful. Here is how …

You recover 5 to 7 VMs quickly using “instant recovery” feature of your backup product and let your users know that while you are recovering the rest, they can login to those recovered VMs. But while you are busy recovering other VMs, those users start complaining about low performance from the recovered VMs.

Before you could deal with those problems, you notice that there were some active backup jobs that are running very slow. And as those backup jobs start finishing up, you notice alerts, which indicate that production VMs are having “VM STUN” issues.

You see what happened – recovering 5 VMs “instantly” triggered a chain reaction in your environment. You find yourself in a very unpleasant situation where the VM users & VMware admin are all yelling at you.

If you did these “instant recoveries” in a multi-tenant service provider environment, it can potentially have an even greater negative impact. Recovering VMs for one customer could impact the rest of the customers being protected by that same appliance.

You decide to skip lunch, take a few deep breaths, meditate for 3 minutes, and decide to focus on one problem at a time. You start combing through the documentation of your backup product and realize that instant recovery happens off the deduplicated backup images. “Read I/O” performance in powered on VMs off deduplicated backup images is very slow because of a very high random I/O in the deduplicated storage system. Now it explains why those users of the 5 VMs that were recovered “instantly” were complaining about performance.

But wait a second – why were the backups of other VMs slowing down? After a few seconds you realize that simultaneous writes and reads happening from deduplication appliances is a big problem in terms of performance. So when you do instant recoveries while any backups are still happening, it impacts both – the backups as well the VMs that were recovered.

Moreover now you find out that you need to do storage vMotion of these VMs from the backup appliance to production storage. This means coordinating with VMware & storage admin. You are now the project manager who has to find where they are, pull them out of conference rooms, and drag them to your recovery issues. You discuss with them about all the problems, and in order to minimize the performance impact, you decide that you will recover and storage vMotion in batches. i.e. do “instant recovery” of 5 VMs and do storage vMotion. Once that’s done, you rinse and repeat the whole process for the rest of the 30 VMs in batches. Finally by 6 pm, after baby-sitting for the whole day you managed to recover all 30 VMs.

How useful was this “instant recovery”, which wasn’t scalable at all?? It was unusable and stressful. Such recoveries are useful for data integrity checks, but are not of much use for real world disaster recoveries.

Unfortunately most backup products claim instant recovery with such architecture, which renders itself un-scalable and un-usable. Some didn’t even graduate to offering instant recoveries using UI and are still stuck using CLI.

Fortunately there is a simple, elegant, and scalable solution from Actifio. The engineering team at Actifio specifically had the following design principles in mind while designing and implementing its “scalable instant recovery”.

  1. Deliver the instant recovery solution not just for VMware, but also for everything including physical servers, databases in VMs or physical servers. For very large databases with 10, 30, 50, 100 TB sizes, see this blog and white paper for more details.
  2. Store the data in such a way that, post instant recovery, users can get IO performance at 95% of the underlying storage IO performance. This ensures that you get “scalable instant recovery”

Here is how Actifio works and why it’s the only solution, which can help solve such problems –

  1. Actifio protects data in an incremental forever manner. This is true for VMware VMs, Hyper-v VMs, volumes and file systems on physical servers, databases on any systems.instant_recovery1
  2. Actifio stores this data in its native pristine application block format in any storage that you assign to Actifio software. This is called Actifio’s snapshot pool.

Obviously you would have data retention needs for many days, weeks, months or even years. When you setup a longer retention SLA in Actifio, it copies the changed blocks to a deduplication pool. Thus you get best of both worlds – dedup pool for long-term retention, and snapshot pool for instant recoveries.

  1. After each incremental backup, Actifio synthesizes a virtual full backup image by massaging its metadata. This gives you an illusion as if Actifio always does a full backup.instant_recovery2
  2. When you do instant recoveries, Actifio turns on a VM, or mounts a volume \ file system on physical server instantly. After the VM is turned ON, or I/O starts happening from VM or physical server, the performance scales very well, because the underlying file system in Actifio snapshot pool is NOT deduplicated. Instead it’s stored in its native application format. Thus “instant recoveries” are truly scalable.
  3. Another important differentiation from Actifio is that you can spin up 1 to 40 virtual copies of the same VM or volume or database on many different machines and work on them simultaneously.instant_recovery3

Thus Actifio’s instant recovery is writable and scalable that can be used not just for recovery purposes, but also to provision test data for test dev, DevOps teams.

So when you use Actifio’s “scalable instant recovery” you don’t have to baby-sit and do recoveries in batches of 5 VMs. You just recover all 100+ VMs and then at your convenience do a storage vMotion of all those VMs. This behavior is not just limited to Virtual machines. If the recovered system happens to be an Oracle Database using ASM, you can do ASM rebalance and blocks will automatically move from Actifio snapshot pool to production storage.

Lastly there is an important aspect that we did not discuss. When you are recovering 100, 300, 500, 1000 VMs, you need capabilities beyond scalable instant recovery. You need a 1-click orchestration of complete disaster recovery. That will be topic of my next blog.

The “scalable instant recovery” technology benefits everyone – an enterprise organization having 1000s of VMs, a midmarket organization with 50 to 100 VMs, or a small \ large managed service provider looking for multi-tenant self-service recoveries. If you have any questions, please contact us at info@Actifio.com.

Recent Posts