Close

How to Use Cloud Backup Data for Analytics in Google Cloud Platform

Most organizations hope that they never have to access or restore their backup data in the cloud. Hence they view cheap cloud storage as the graveyard…where backup data goes to die.

So how do you turn a backup graveyard into an analytics playground? What if you could “reuse” the backup data in the cloud for business analytics or Machine Learning using Google BigQuery? How labor-intensive, time-consuming, and cost-effective is it? 

This blog analyzes a traditional backup solution .vs. a modern approach. So let’s dig into how you can use cloud backup for analytics in Google Cloud:

Traditional Approach

Traditional backup products rely on deduplicating backups to reduce storage. In GCP, the backups could be stored in block storage such as Persistent Disk (PD), or object storage such as Google Cloud Storage (GCS), Google Nearline (GNL), Google Coldline (GCL).

The costs per TB per month of these various storage tiers are $170 for PD, $26 for GCS, $10 for GNL, $7 for GCL. 

Clearly, the object storage medium is a much cheaper option to store backups.

To use Google BigQuery, you need to load data into BigQuery from GCS/GNL/GCL. Moreover, the data needs to be in a specific format such as comma-separated values (CSV), newline-delimited (JSON), and other formats.

Thus, as shown in Picture 1, backups have to be restored to PD, and then copied to GCS before Google BigQuery can access the data from GCS.

With the traditional backup vendors, the process and its associated costs to process 1 TB data look like the following. 

  1. Assume that the backup data needs to be accessed everyday for analytics. Hence consider using GCS. Steady State GCS costs = $26 per month. Note that data that doesn’t need to be accessed for analytics can be stored in GNL/GCL thus reducing the backup storage costs from $26 per TB to $10/$7 per TB.
  2. The backup data has to be restored to Persistent Disk (PD). 
    1. Data retrieval costs from GCS = $0 per TB.
    2. 1 TB PD costs = $170 per month.
  3. If the restored data is already in CSV or JSON format, a simple command such as “gsutil cp” command copies the data from PD to GCS. Else, the data needs to be transformed into a CSV or JSON format and copied to GCS. 
    1. 1 TB GCS costs = $26 per month.
  4. And finally, Google BigQuery would be able to leverage the data in GCS for analytics purposes.
  5. Total Costs (including storage for backup) = $222 per TB per month.

Modern Approach

The biggest disadvantage of the above approach is the amount of time it takes to restore from the proprietary deduplicated backup format. And it gets worse with large data sets. 

A better approach would be store backups in a native application format and present data instantly in minutes from GCS/GNL/GCL, as shown in picture 2. 

The instant mount presents a block device but does not consume any storage. You can see and access all the files in the mounted volume. When you access files, the data is transferred from object storage straight into the memory without using any PD storage.

This approach eliminates expensive PD costs, as shown in the picture below. And it also saves time by up to 80% by mounting a multi TB volume near-instantly in minutes as the data is never “restored.”

Total costs with this approach = $52 per month, which is 4x lower or 77% savings. Such substantial savings are possible because of eliminating the need to use PD.

Comparison

Here is a side by side comparison of Traditional Backup vs Modern approach:

Traditional Backup Modern Backup Business Impact
Time to access data Large. Typically in hours or days. Near-Instant. Typically in minutes. Reduce Time to Market by up to 25% with faster analytics
Cloud Infrastructure Costs Large.
$246 per TB
4x Lower.
$52 per TB
Reduce Total Cost of Ownership by 77%

Summary

Backups don’t have to stay dormant in the cloud. Near-instant access to native data opens up opportunities for backup data to be used for on-demand analytics, Dev / QA / UAT / Security testing in the cloud, thus accelerating time to market new business insights and features.

The critical capabilities discussed here were 1) storing backups in native application format, 2) near-instant access of backups straight from the cloud object storage. Check our our e-book for many more critical capabilities desired for a modern backup and DR in the cloud