Providing elastic resources is what makes the cloud so compelling for data infrastructure, scaling up is quite easy, but how small can you scale down? Not considering storage costs, it's easy to assume that you can scale everything down and spend next to nothing to AWS. Unfortunately, the costs of the cloud can be confusing, the true "idle" cost of the data platform is a bit complex.
Recently we received a question from a smaller scale academic Databricks user. We have seen massive multi-workspace deployments of Databricks with six figure monthly costs but there is also a long tail of smaller organizations using Databricks to run their data science and engineering workloads. In the university context, there are very predictable busy periods: students on-boarding onto the platform or final projects rushing to be completed. There are also very predictable idle periods, such as between semesters. Our academic user asks:
Over the summer when nobody is using the Databricks environment, I would expect about $0 in cost, but every month these accounts each cost around $33, what's going on?
The AWS bills will give you a break down of which services are costing what, as will Cost Explorer. This academic user was seeing about $33 worth of spend under "Elastic Compute Cloud", despite zero workloads running in the Databricks environment! Something must be wrong!
Digging further into the bill we see that the specific service that is costing money is "Amazon Elastic Compute Cloud NatGateway". For those newer to AWS cost analysis, NAT Gateways are often necessary evils in the AWS bill. In short, they provide the network address translation between resources in a Virtual Private Cloud (VPC) and the public internet. VPCs and NAT Gateways and frequently used to ensure that resources which shouldn't exist on the public internet, such as a valuable data platform, are simply not routeable. As stated previously: necessary evil.
When Databricks provisions a workspace in AWS, it creates a VPC, a NAT Gateway, and a whole bunch of other resources. Many of those resources cost practically nothing, except NAT Gateway, which costs $0.045/hour just to exist. Unfortunately this means that even if your Databricks environment has zero clusters turned on, zero workloads running, its mere presence will result in $0.045 charged every hour, each of the 744 hours in a month.
Bummer.
If it is any consolation, every AWS user with VPCs is paying this NAT Gateway tax.
There is not much we can advise you to do. There's no optimizing NAT Gateway out of the Databricks architecture. If a Databricks environment is only needed for a few months out of the year, the best we can suggest is to use the Databricks Terraform provider to create your workspace and the resources within it, and destroy the environment the months when it's not in use. A blunt tool to say the least.
Databricks is a powerful platform and its costs can be drawn down to practically zero. Databricks however is one of many tools at your disposal, especially if your data is stored as Delta tables in S3. For smaller use-cases there may be more appropriate "Serverless" options available for handling your data workloads, which we won't go into here.
The cheapest Databricks environment on AWS costs $33 a month, which is entirely AWS. On the bright side, anything more than that can be optimized!
There's always more to share on the topic of cost effective data platforms. Please be sure to subscribe to our RSS feed, or if you're interested in learning more about what we can help you with, drop us an email!