The Snowflake of Data Management

The Snowflake for Data Management

Snowflake is one of the most popular data warehouse solutions in the market today. It’s very popular because of many reasons such as low costs, great performance & scalability, ease of use with ANSI SQL queries, support for structured & semi-structured data, a fully managed service available on all three major cloud platforms like AWS, Azure, and GCP, and pay per use pricing model.

Many of the above-mentioned benefits can be traced back to Snowflake’s core architecture, which enables them to scale-out storage and compute independently of each either. You could have PB of structured or semi-structured data with ZERO compute on one day; and then spin up many versions of the small, medium, and large compute instances on-demand based on query and analytics load.

Snowflake Architecture

One of the basic characteristics of data warehousing is to grow 10s of TB to 100s of TBs or even PB very quickly and shrink back. This requires the use of a scalable storage layer. Snowflake chose to use cloud-based object storage such as AWS S3 or Google Cloud Storage or Azure Blob Storage. The benefits of choosing AWS S3 like object storage are:

  • It can grow on demand with very low costs
  • High throughput and theoretically, infinite capacity
  • Delivers 11 9s of durability
  • Parallel access by multiple compute units
  • Partial reads of object storage for scanning while querying data
  • A detailed analysis of why they picked S3 over HDFS can be found here.

But if you are a DBA or a data scientist running analytics, accustomed to running complex queries, you might be thinking: “Wouldn’t the latency introduced by object storage for small IOs reduce the overall performance?”

Snowflake solves this problem by considering the following aspects into their architecture:

  • Scaleout compute delivers parallel execution of parts of queries
  • Each compute instance, known as a virtual warehouse (VW), caches data upon reading from object storage in its flash storage and memory. This ensures fast local data access with high IOPs for query execution.

Thus, this architecture ensures that:

  • The steady-state costs of storage are low by using object storage such as AWS S3
  • Very high performance is delivered by caching the reads and writes from AWS S3 in an SSD flash cache and memory inside a virtual warehouse (VW)
  • Many such VWs can be run in parallel and further increase the overall performance and response time to complex queries

Let’s see how such an architecture applies to enterprise data management for backup, disaster recovery, and database cloning.

Legacy Data Management

Legacy data management vendors store backups of on-premises and even cloud workloads in a proprietary format, typically dedup the data, and store in cloud storage such as AWS S3. While this delivers storage reduction, it leads to the following challenges:

  • Very large data restore times because of rehydrating data from AWS S3 to AWS EBS.
  • If Test/Dev/Analytics teams want 10 copies of a 5 TB SQL Server or any other database from the backup image, it leads to long wait cycles because of restoring 5 TB 10 times. It also leads to high costs of expensive EBS storage consumption, like 50 TB in this example.

Hyperconverged Infrastructure Backup Appliances

Designed ground up to be deployed in data centers, and borrowing the architecture of “shared nothing,” some vendors came up with inexpensive bricks or blocks of compute and storage appliances. As more capacity is needed, more such blocks of compute AND storage are added. However, the challenges with this approach are:

  • Compute and storage cannot scale independently of each other. As you run out of storage, you can’t just add storage and vice versa. This increases costs.
  • Dedup requires very high compute and memory. Such an architecture that typically requires a minimum of 4 powerful compute nodes, when deployed in the cloud, leads to very high 24×7 compute costs.
  • Each of these nodes also needs SSD class storage 24×7 to host dedup metadata. For example, a 4 node cluster of M4.4xlarge costs $16K per year to run compute, $20K for 4x400GB SSD storage. 
  • Recovery times are very large because the backups in S3 have to be restored back to AWS EBS volume.
  • Time to clone databases from backups is very high because of having to restore from S3 to EBS and the associated EBS storage costs.

Low RTO, but High TCO Option

Some of these vendors provide an option of rehydrating or restoring data from their backups in S3 to EBS volume at a scheduled period after each backup is completed. Thus, this process ensures that the RTO is low because the data is already restored in an EBS volume.

But the problem with this approach is that it leads to high costs. 

For example: Assume a 100 TB backup environment and you decide to enable this option of low RTO. This will consume 100 TB of EBS storage 24×7. At $100 per TB per month, this will cost you $10,000 per month for just EBS volumes, which gets very expensive.

Ideal Solution

What if there is a data management solution that has an architecture very similar to Snowflake?

In other words, one that uses cloud object storage with ZERO compute to store backups. You can scale your backups from 20 TB today to 200 TB in a week to 2 PB in 3 weeks with zero effort. The backup software just scales out by using a cloud object storage such as AWS S3.

Recoveries: At the time of recovery or DR, you can spin up on-demand compute instance with the data management software. Imagine that many such instances can be provisioned on demand, ONLY at the time of DR, thus keeping the steady-state costs of compute to ZERO. This is again very similar to what Snowflake does when it needs to scale out.

Instant Mount & Recovery: Now imagine that the data management solution can present the backup image in AWS S3 as a virtual block device to an on-demand AWS EC2 machine in just minutes. The data is not restored at all. It’s just presented, in minutes, as a virtual block device to the EC2 instance. Thus a 50 TB file system or a database can be mounted in just minutes, thus lowering the RTO. And this does not consume any block storage even after the recovery!!!

Performance: But what about the I/O performance from the virtual block device mounted from AWS S3? Just like Snowflake, the data management software can use an SSD flash cache to intelligently cache reads from object storage, and any writes after the recovery. Thus you incur SSD flash costs ONLY at the time of recovery, which could last just a few hours or few days. And note that you pay for a very small SSD storage as it’s used as cache and not as a target storage to host 100s of TB of data.

Test/Dev and Analytics: What if you could mount the backup image from S3 to, for example, 10x on-demand AWS EC2 instances as a “virtual database clone” for test/dev or analytics? Each of the virtual database clones works independently, without impacting each other.

Actifio Copy Data Management

Actifio delivers all of the above data management capabilities for backup, disaster recovery, and database cloning.

Using the Actifio solution, you can set up SLAs on what to backup, how often to backup, and the cloud storage to use. Actifio takes care of the rest. 

It does not need any 24×7 compute in the cloud to manage backup copies. All it consumes is cloud object storage such as AWS S3, Azure Blob, or GCS, thus keeping the 24×7 costs very small, like Snowflake.

At the time of recovery, DR, or test/dev/analytics, you can spin up multiple instances of Actifio Sky software in the cloud. Actifio can load balance the instant mount and recoveries of various workloads across these Sky appliances. This leads to low RTO and eliminates operational burden.

Each Actifio Sky appliance, like Snowflake, uses a small SSD flash cache to accelerate the IO performance post-recovery.

ESG (Enterprise Strategy Group) has conducted a recovery and performance benchmark of both a TB sized SAP HANA and SQL Server database. They demonstrated RTO of minutes as compared to hours with a legacy backup approach. They also demonstrated 80%+ cloud infrastructure savings as compared to legacy approaches. The reports also conclude that you can get 80% of SSD flash performance at just 20% of the costs with this architecture.  

Here are the benchmark reports for SAP HANA and SQL Server databases.

Conclusion

The ability to scale out storage and compute independently in the cloud delivers great benefits not just for data warehouse use cases, but also for enterprise data management use cases such as backup, disaster recovery, and database cloning.

The table below summarizes the similarities between Snowflake and Actifio, and the associated benefits.

CapabilitiesSnowflakeActifioBenefits
Scale-out object storage usage using AWS S3, Azure Blog, GCSYesYesDelivers low costs, scale, and durability
Scale-out compute independently of storage, as neededYesYesDelivers low cost & great performance
Use SSD flash and memory for cache inside each compute unitYesYesDelivers very high performance within each compute unit
Simple and Easy to useYes YesUse ANSI SQL with Snowflake. 1-click Actifio orchestration for DR and DB clones
Available as SaaSYesYesPay as you go and grow
Multi-Cloud support: AWS, Azure, GCPYesYesMulti-cloud offers flexibility and eliminates vendor lock-in

Blog: Learn more about the history of copy data management.

If you have questions, we’re happy to help. You can request a conversation with a database management consultant anytime.

Recent Posts