Lessons learned building delta-rs

March 09, 2025

by R. Tyler Croy

deltalake

Building software, especially open source software is a complex social and technological endeavor and often times the most important question we must answer for ourselves is: "what have you learned?" The delta-rs project has been a significant focus of mine over the last four and a half years and has benefited greatly from my lessons learned stewarding other projects such as Jenkins.

This week I will be in Mountain View participating in the "Open Lakehouse Meetup" where I have been invited to share some lessons learned in the Delta Lake community. I will be joined by my compatriots: Denny, Scott, and Robert.

They’ll be sharing lessons learned from working on delta.rs and the Rust + Arrow ecosystem, including why they built delta-kernel-rs to abstract lakehouse format metadata.

Open Lakehouse Meetup banner Register to join here

My lessons can fit into two categories: community and code. As I mentioned, open source projects are social and technological endeavors and therefore cannot succeed on a technology basis alone. I argue that the community is more important than code, especially in the beginning. A group of highly-motivated and collaborative people can accomplish great things together, even if starting with a less-than-ideal technology base.

Community

The Delta Lake project was born out of Databricks, but it was not always the independent project it is today. My introduction to Delta Lake was via the sales team at Databricks. I was first a customer who needed an open table format and part of my selection of Delta Lake was driven by the open source nature. However open source is not enough! The OpenInfra Foundation's Four Opens is what I consider the gold standard for assessing community viability:

  • Open Source the code must be open
  • Open Design the process for designing and improving the code must be open.
  • Open Development the processes for integrating and modifying new code must be open.
  • Open Community participation in the community of users, developers, and ecosystem must be open.

I partnered with Denny early on and chirped incessantly about the four opens and the evolution of Delta Lake inside and outside of Databricks. There was seemed to be desire inside Databricks to operate Delta Lake as an open core project, and the Delta Lake project has Denny to thank for much of the current success and independence of Delta Lake.

Lesson #1: Projects born out of large organizations must have a passionate champion who is willing to advocate for the open source project!

Many companies who wish to foster an open source project tend to view their contributions through the lens of the technology itself. "We have created this incredible thing, why wouldn't people want to use and improve it?" They can fall into the trap of misunderstanding the motivation of contributors.

The co-founder of delta-rs, a fantastically talented developer QP Hou recognized something happening in the data ecosystem during 2020-2021 when the project was created: Python was becoming the standard. He helped introduce the initial Python bindings for delta-rs which exploded the possible user and contributor base. Since then we have seen a number of contributions from folks who were Python users, then tinkered with the Python code, and in some cases grew to understand and contribute to the underlying Rust code.

Lesson #2: creating an open source community is a sales process in that : you need to consider what motivations or incentives might attract the types of developers you want to work with.

The final community lesson I will share I did not learn with delta-rs but instead applied something I learned working with Kohsuke Kawaguchi in growing the Jenkins project: you can always revert code.

In the early days of Jenkins, we operated using Subversion (SVN) for source control. Kohsuke was very liberal with granting commit access to new contributors. "I can always revert commits" he would say. That early granting of access and trust to new contributors was a powerful signal: you are welcome here! In the delta-rs project I have taken to merging pull requests even if I don't entirely like them. In some cases, I will merge a pull request and follow up with tests or refactors rather than asking the contributor. Instead of putting up gates I will take a little more of my time to make new contributors feel welcome.

Lesson #3: contributors will only join if you make it easy for them to participate and value their contributions.

That brings me to..the code itself.

Code

When I first tried to convince Databricks to jointly develop delta-rs with me in April of 2020, I outlined the importance of Rust mostly from a technology standpoint. Rust is fast, efficient, and easily embeddable. At the time I was envisioning Delta Lake being used from Python, Node, Ruby, and so on. Having a portable and embeddable kernel (!) made Rust a clear winner to me.

Rust was critically important to our success, for those reasons, but Rust is also fun to work with. Fish recently completed their Rust rewrite and shared the following which says it best:

We need to get one thing out of the way: Rust is cool. It’s fun.

It’s tempting to try to sweep this under the rug because it feels gauche to say, but it’s actually important for a number of reasons.

For one, fish is a hobby project, and that means we want it to be fun for us. Nobody is being paid to work on fish, so we need it to be fun. Being fun and interesting also attracts contributors.

I don't want to pick on Apache Spark or the original Delta implementation too much but at this point I believe their large Scala code bases are holding them back more than many people realize. When I look at statistics on GitHub, Apache Spark has had contributions from 70 authors in the last month. Over the same period Apache DataFusion had 82 commit authors.

Apache Spark is much more mature which will slow commit velocity, but Apache DataFusion and delta-rs have benefited from Rust both being a compelling and ascendant language, but also from people wanting to write Rust code!

Lesson #4: the code itself must be fun and interesting to work with.

If we hadn't chosen to use Rust, I guarantee we would not have had the success from a community or code standpoint.

There is a related, and often overlooked characteristic of most open source contributors: they want to use the code they are contributing! That means you need to release it!. Automating the release processes with GitHub Actions for the delta-rs Python releases has led to a much more rapid release cadence than I might have otherwise adopted. Ion has pushed a substantial number of Python releases because he wanted to use the code, and the easiest way to use the code is to release it!

Seems pretty simple? Sometimes it is easier said than done. This morning I created the deltalake 0.25.0 release for Rust users, which I was finally able to release after a month since the last release (0.24.0). That is a long time for people to be waiting for bug fixes.

Lesson #5: automate releases! Make it easier for other contributors to release changes and for users to safely adopt new changes quickly. We are building code to use it!

I certainly need to take some of my own advice here, but the faster the release cycle, the tighter the feedback loop with users and contributors can become!

Safely is a pretty important word there, which leads me to the last code-related lesson I can share from building delta-rs. Delta Lake is used by serious businesses to do some pretty important things. I have worked with customers using delta-rs for fintech, health data, entertainment, and all sorts of other use-cases. Data corruption or failures are a pretty serious concern for projects like ours. Rust has testing built-in as part of the ecosystem making unit testing quite simple. As a consequence delta-rs has had pretty decent test coverage since the very early days.

Lesson #6: the best time to write tests for your project is at its inception, the second best time is today.

Looking at our CodeCov metrics we currently have 72% code coverage. There is always room for improvement but having good test coverage allows for everybody to have more confidence in refactoring, adding new features, and creating releases. By investing in good test coverage, the project has been able to iterate rapidly and safely. Our tests are expansive enough that delta-rs has found bugs in a number of our dependencies over the years like arrow-rs, Apache DataFusion, and delta-kernel-rs.


Making it easier for people to use, contribute, and extend delta-rs has been key to its success. Many of our patterns or habits are not accidental but the result of thoughtful and open discussion about our motivations. This leads me to the last, and arguably the most important lesson learned through the development of delta-rs and the Delta Lake project at large:

Lesson #7: collaborate. collaborate. collaborate. Success lies in working together with your users, contributors, and adjacent projects in the ecosystem around the project.

We are better together.