Initial commit

December 18, 2022

by R. Tyler Croy

news

aws

deltalake

databricks

Welcome to the first post on the Buoyant Data blog! Our primary objectives are to help customers save on their infrastructure bills and to improve the efficiency of cloud-native data platforms. We provide a number of services to our customers such as cost analysis/reporting, data architecture consulting, and infrastructure optimization. But as a small consultancy there are only so many customers we can work at simultaneously, which is why we will use this blog to highlight some good practices, tips, and newer products in the industry.

Our focus is on AWS-based data infrastructure, but many of the patterns and practices we will share can be thoughtfully applied to Azure or Google Cloud (GCP) based data and ML platforms. At the end of the day the fundamental design of cloud-native data storage, processing, and reporting remains similar enough that we hope practitioners who focus on other cloud providers can find some benefit too!

The "reference architecture" that we encourage others to follow for best performance and cost is generally as follows:

  • Data Storage using Delta Lake. We are contributors to the Delta Lake project and believe strongly that the transactional nature of the protocol provides a best-in-class open source storage format for almost all types of data lakes and warehouses. Fundamentally Delta Lake is some JSON-based metadata wrapped around industry standard Apache Parquet files, which provides customers with a lot of flexibility not seen in proprietary data storage tools.

  • Cloud-native ingestion using lightweight compute resources such as AWS Fargate and Lambda minimal or zero transformation data ingestion can be very cost effective. By adopting cloud-native solutions for bringing data into the data lake, customers are charged only for the compute and network resources they are actively using, leading to lower overall cost.

  • Compute/Query Infrastructure with the Databricks platform. This means utilizing collaborative notebooks for development work by data scientists, analysts, and engineers. We also strongly encourage the use of the Databricks platform for automated workloads built with Apache Spark, particularly when those workloads can be accelerated with Photon, which allows many DataFrame-based workloads an improved price/performance ratio. We should note that Photon is not a silver bullet and some thoughtful analysis can help determine whether the total cost of a job in compute hours and DBUs (Databricks Units) is lower with the new engine.

  • Streaming data processing, much of the industry relies on batch data processing, with massive daily, weekly, or monthly data processing tasks. The larger the data sets the larger the compute resources needed. While "the cloud" offers a lot of options for vertical scaling, it also has its limits. We prefer to stream data for transformation with structured streaming for Apache Spark.

Many customers have different storage, compute, and orchestration technologies in their data platforms. While we have a preference, our focus is not in advocating a specific stack but rather identifying successful and cost effective patterns for whatever data infrastructure already exists.

We have a lot more to share on the topic of cost effective data platform infrastructure and the tools we use to build them. Please be sure to subscribe to our RSS feed, or if you're interested in learning more about what we can help you with, drop us an email!