A couple of months ago, I wrote a blog post highlighting 4 Key Benefits of BigQuery. One of the areas that I mentioned was pricing. In this blog, I wanted to highlight the pricing models available from Google BigQuery, AWS RedShift and AWS RedShift Spectrum. As part of the process, I will highlight the strengths and weaknesses of each approach and hopefully help the reader better understand which option would best align with their requirements.
The BigQuery pricing model has not changed since the previous blog. In short, they charge primarily based on storage usage and on-demand queries.
Let’s start with storage, BigQuery charges $20/TB/Month for storage used (the first 10TB is provided for free). Interestingly, this fee drops in half to $10/TB/Month for “..a table that has not been edited for 90 consecutive days” which leads to significant cost reduction for archival data. However, be aware that the minute you edit the table, the pricing reverts back to the $20/TB/Month and the clock resets to zero days. The other primary fee is based on on-demand queries. Each of these costs $5/TB with the first 1TB month included each month. Finally, Google charges an additional fee of $50/TB/month when inserting streaming data in BigQuery. It is interesting that this is twice as expensive as the storage usage cost.
The benefit of this model is that customers do not need to worry about managing and implementing compute instances. They can simply run their queries on-demand and benefit from the results. They never have to worry about over or under provisioning their environment can result in cost savings because often users over provision due to a concern of running out of resources.
RedShift Traditional Pricing:
The RedShift pricing model diverges significantly from BigQuery. AWS centers their costing model on the concept of compute instance usage. This provides significant added flexibility but also creates challenges from a management standpoint as it can be difficult to “right-size” the RedShift environment. In practice, people often over-provision resulting in higher costs than needed. Here is a complete list of RedShift compute/storage options.
|dc2.large||2||7||15 GiB||0.16TB SSD||0.6 GB/sec||$0.25 per Hour|
|dc2.8xlarge||32||99||244 GiB||2.56TB SSD||7.5 GB/sec||$4.80 per Hour|
|ds2.xlarge||4||14||31 GiB||2TB HDD||0.4 GB/sec||$0.85 per Hour|
|ds2.8xlarge||36||116||244 GiB||16TB HDD||3.3 GB/sec||$6.80 per Hour|
In this case, it is noteworthy that RedShift users pay for compute and storage together and so there is no separate storage line item like BigQuery. AWS makes it simple to spin up compute/storage instances based on performance requirements, but the same ability can lead to over-provisioning and higher costs. In order to manage this effectively, users must be diligent about the resources they use and optimize their environment to deliver balance performance and cost.
This approach provides more flexibility than BigQuery, but can also potentially lead to increased overhead as a constant assessment of provisioned vs required capacity is needed. Google’s per query model is much simpler in this area.
On the surface, RedShift Spectrum looks like a completely different product than RedShift. AWS talks about combining the massive scale of object storage with powerful analytics of RedShift. Yet the reality, is that Spectrum actually augments RedShift and is simply an added option that can be applied to existing instances. RedShift Spectrum enables users to not only run queries stored in local disk (as described in the RedShift section), but also to leverage S3 for larger static volumes. Thus, you still must acquire local SSD storage and compute with Spectrum, but you augment that with S3 storage. Spectrum pricing is as follows:
$5 per terabyte of data scanned with a minimum of 10GB
In practice, Spectrum is a great option to augment RedShift infrastructure, and it can be used to store larger and less frequently accessed data sets. Another consideration is that data written to S3 for Spectrum usage needs to be written in a format known to RedShift and so not just any data can be dumped there. RedShift provides functionality to write data to S3, but it can take some extra work to make this happen. As a result, the cost differential between RedShift and RedShift Spectrum will differ primarily based on the amount of inactive data included. If your dataset fits into the storage provided by your RedShift servers then the benefit of RedShift Spectrum will be limited. Alternatively, if you have larger datasets with inactive data then Spectrum can result in significant cost savings.
In summary, Google and AWS have taken different approaches to pricing their analytics services. Google focuses on an analytics-as-a-service model which simplifies management, but provides less flexibility. AWS, on the other hand, focuses on per server licensing which can provide more flexibility but also can increase management overhead and operational cost when over-provisioned. In the end, there is no right answer about which is better. The two approaches provide unique benefits and challenges and so the right solution will depend on the individual data set and use case.
Webinar replay – What you need to know about Cloud DR