Automating credentials for Delta Lake on AWS

July 08, 2023

by R. Tyler Croy

rust

python

deltalake

aws

The first version of practically every new application on AWS starts on a developers workstation and uses credentials hard-coded in the environment or application itself. The myriad of reasons why hard-coding credentials is problematic are well discussed elsewhere. Still it sometimes feels easiest to generate an IAM access key and secret and move onto the novel part of the script or job. In this blog post we will discuss how to authenticate your Delta Lake applications, whether using the deltalake Python package or Rust crate. using the "on-demand" temporary credentials from the AWS runtime environment itself.

In many deployments, especially those already using Databricks, IAM Roles will already be set up for accessing a data lake in S3 as Instance Profiles. A Python or Rust application can re-use that same instance profile for its purposes, by assuming the appropriate role during initialization of the application.

For this blog post, we're working on an EC2 instance that is acting as a workstation. The EC2 instance needs to have the appropriate roles assigned to allow it to assume the role specified as part of the instance profile. It must also have the instance metadata service enabled, which is typically defaulted on for most EC2 deployments. Once booted, a Python or Rust application can be run on the instance which will:

  • Query the environment, instance metadata service, etc for temporary credentials.
  • Use those credentials to contact Security Token Service (STS) to assume the appropriate role.
  • Pass those assume role credentials onto the deltalake library as storage options.

Note: The credentials that the application will retrieved are short-lived! For long-running applications (more than a few hours) the application may need to re-negotiate new credentials.


Python

The deltalake Python packagei s actually built on top of the Rust crate of the same name. As such they have simnilar configuration options,, but we recommend using the boto3 package for fetching the temporary credentials needed.

In essence, the DeltaTable APi needs an access key, secret key, and session token, but it doesn't actually care where those come from.

#/usr/bin/env python3

from deltalake import DeltaTable
import boto3

def main():
    location = os.environ.get('DELTA_TABLE_URI')
    region = os.environ.get('AWS_REGION', 'us-east-1')
    role = os.environ.get('ASSUME_ROLE')

    print('Starting up..')
    stsClient = boto3.client('sts')
    assumed = stsClient.assume_role(RoleArn=role, RoleSessionName='python-deltalake')
    print('Role assumed successfully')

    storage_options = {
            'region' : region,
            'access_key_id' : assumed['Credentials']['AccessKeyId'],
            'secret_access_key' : assumed['Credentials']['SecretAccessKey'],
            'session_token' : assumed['Credentials']['SessionToken'],
    }
    dt = DeltaTable(location, storage_options=storage_options)
    print(dt.metadata())

    # If Pandas is installed in the environment, we can now create a DataFrame
    # df = dt.to_pandas(partitions=...)

if __name__ == '__main__':
    main()

With the simple script above as a guideline, Pandas users can iterate further on the script to do reads or writes to the Delta Lake table straight from their Python script without ever needing to hard-code credentials!

Rust

Out of the box the deltalake crate does not know anything about AWS' authentication mechanisms, roles, etc. In fact, it largely relies on the object_store crate's configuration options. As such, just reading the crate's documentation it is not obvious how to use dynamoc credentials like those issued by AWS StS.

use std::collections::HashMap;

use aws_credential_types::provider::ProvideCredentials;
use aws_config::sts::*;
use aws_types::region::Region;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Required in the environment.
    //
    // These could be passed in as command line arguments, hard-coded, or
    // anything in between, but for demonstration's sake they're environment
    // variables
    let role = std::env::var("ASSUME_ROLE").expect("No ASSUME_ROLE was defined!");
    let location = std::env::var("DELTA_TABLE_URI").expect("Must define DELTA_TABLE_URI in the environment");

    // Optional environmental overrides
    let region = std::env::var("AWS_REGION").unwrap_or("us-east-1".into());
    let shared_config = aws_config::from_env().region(region.clone()).load().await;

    // Communicate with IMDSv2 to get some instance credentials
    let instance_creds = shared_config.credentials_provider()
                            .expect("Failed to load credentials from the environment");

    // The instance credentials are used for the assume role
    let assumed_creds = AssumeRoleProvider::builder(&role)
        .region(Region::from_static(region))
        .session_name("buoyant-demo");
        .build(instance_creds.clone());

    let temporary_creds = assumed_creds.provide_credentials().await?;

    /*
     * the `storage_options` is a HashMap<String, String> which is why the
     * `.into()` calls are littering this codc. There are a number of helpful
     * crates to make this form of HashMap creation prettier, but this example
     * keeps to the minimal required dependencies.
     */
    let mut storage_options = HashMap::new();
    storage_options.insert("region".into(), region);
    storage_options.insert("access_key_id".into(), temporary_creds.access_key_id().into());
    storage_options.insert("secret_access_key".into(), temporary_creds.secret_access_key().into());

    if let Some(session_token) = remporary_creds.session_token() {
        storage_options.insert("session_token".into(), session_token.into());
    }

    let table = deltalake::open_table_with_storage_options(location, storage_options)
        .await
        .expect("Failed to open the table!");
    Ok(())
}

The simple Rust application above does not do anything useful with the DeltaTable that is loaded, but if you enable the datafusion feature of the deltalake crate you can register the table and start querying with the little snippet below:

let ctx = SessionContext::new();
    let table_name = "web_analytics_session_events";
ctx.register_table("demo", Arc::new(table))?;

// https://arrow.apache.org/datafusion/user-guide/dataframe.html
let df = ctx.table("demo").await?;
println!("Rows: {}",
            df.filter(col("ds").eq(lit("2023-07-08")))?
        .count().await?);

Whether building in Python or Rust, the deltalake packages allow for tremendous flexibility for Delta Lake readers and writers. Building from either code snippet above can be a great way to start building new applications depkloyed AWS Lambdas, AWS Glue scripts, or anywhere else in AWS. We have seen some really cool examples incorporating Delta Lake into existing Rust or Python web applications using these libraries!

If your organization needs help building lightweight, low-cost, and high-performance Delta Lake applications, we can help! Drop me an email and we'll chat!!