The first version of practically every new application on AWS starts on a
developers workstation and uses credentials hard-coded in the environment or
application itself. The myriad of reasons why hard-coding credentials is
problematic are well discussed elsewhere. Still it sometimes feels easiest to
generate an IAM access key and secret and move onto the novel part of the
script or job. In this blog post we will discuss how to authenticate your
Delta Lake applications, whether using the deltalake
Python package or Rust crate. using the "on-demand" temporary credentials from
the AWS runtime environment itself.
In many deployments, especially those already using Databricks, IAM Roles will already be set up for accessing a data lake in S3 as Instance Profiles. A Python or Rust application can re-use that same instance profile for its purposes, by assuming the appropriate role during initialization of the application.
For this blog post, we're working on an EC2 instance that is acting as a workstation. The EC2 instance needs to have the appropriate roles assigned to allow it to assume the role specified as part of the instance profile. It must also have the instance metadata service enabled, which is typically defaulted on for most EC2 deployments. Once booted, a Python or Rust application can be run on the instance which will:
- Query the environment, instance metadata service, etc for temporary credentials.
- Use those credentials to contact Security Token Service (STS) to assume the appropriate role.
- Pass those assume role credentials onto the
deltalake
library as storage options.
Note: The credentials that the application will retrieved are short-lived! For long-running applications (more than a few hours) the application may need to re-negotiate new credentials.
Python
The deltalake
Python packagei s actually built on top of the Rust crate of
the same name. As such they have simnilar configuration options,, but we
recommend using the boto3 package for
fetching the temporary credentials needed.
In essence, the DeltaTable
APi needs an access key, secret key, and session
token, but it doesn't actually care where those come from.
#/usr/bin/env python3
from deltalake import DeltaTable
import boto3
def main():
location = os.environ.get('DELTA_TABLE_URI')
region = os.environ.get('AWS_REGION', 'us-east-1')
role = os.environ.get('ASSUME_ROLE')
print('Starting up..')
stsClient = boto3.client('sts')
assumed = stsClient.assume_role(RoleArn=role, RoleSessionName='python-deltalake')
print('Role assumed successfully')
storage_options = {
'region' : region,
'access_key_id' : assumed['Credentials']['AccessKeyId'],
'secret_access_key' : assumed['Credentials']['SecretAccessKey'],
'session_token' : assumed['Credentials']['SessionToken'],
}
dt = DeltaTable(location, storage_options=storage_options)
print(dt.metadata())
# If Pandas is installed in the environment, we can now create a DataFrame
# df = dt.to_pandas(partitions=...)
if __name__ == '__main__':
main()
With the simple script above as a guideline, Pandas users can iterate further on the script to do reads or writes to the Delta Lake table straight from their Python script without ever needing to hard-code credentials!
Rust
Out of the box the deltalake crate does
not know anything about AWS' authentication mechanisms, roles, etc. In fact, it
largely relies on the object_store
crate's configuration
options.
As such, just reading the crate's documentation it is not obvious how to use
dynamoc credentials like those issued by AWS StS.
use std::collections::HashMap;
use aws_credential_types::provider::ProvideCredentials;
use aws_config::sts::*;
use aws_types::region::Region;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Required in the environment.
//
// These could be passed in as command line arguments, hard-coded, or
// anything in between, but for demonstration's sake they're environment
// variables
let role = std::env::var("ASSUME_ROLE").expect("No ASSUME_ROLE was defined!");
let location = std::env::var("DELTA_TABLE_URI").expect("Must define DELTA_TABLE_URI in the environment");
// Optional environmental overrides
let region = std::env::var("AWS_REGION").unwrap_or("us-east-1".into());
let shared_config = aws_config::from_env().region(region.clone()).load().await;
// Communicate with IMDSv2 to get some instance credentials
let instance_creds = shared_config.credentials_provider()
.expect("Failed to load credentials from the environment");
// The instance credentials are used for the assume role
let assumed_creds = AssumeRoleProvider::builder(&role)
.region(Region::from_static(region))
.session_name("buoyant-demo");
.build(instance_creds.clone());
let temporary_creds = assumed_creds.provide_credentials().await?;
/*
* the `storage_options` is a HashMap<String, String> which is why the
* `.into()` calls are littering this codc. There are a number of helpful
* crates to make this form of HashMap creation prettier, but this example
* keeps to the minimal required dependencies.
*/
let mut storage_options = HashMap::new();
storage_options.insert("region".into(), region);
storage_options.insert("access_key_id".into(), temporary_creds.access_key_id().into());
storage_options.insert("secret_access_key".into(), temporary_creds.secret_access_key().into());
if let Some(session_token) = remporary_creds.session_token() {
storage_options.insert("session_token".into(), session_token.into());
}
let table = deltalake::open_table_with_storage_options(location, storage_options)
.await
.expect("Failed to open the table!");
Ok(())
}
The simple Rust application above does not do anything useful with the
DeltaTable
that is loaded, but if you enable the datafusion
feature of the
deltalake
crate you can register the table and start querying with the little
snippet below:
let ctx = SessionContext::new();
let table_name = "web_analytics_session_events";
ctx.register_table("demo", Arc::new(table))?;
// https://arrow.apache.org/datafusion/user-guide/dataframe.html
let df = ctx.table("demo").await?;
println!("Rows: {}",
df.filter(col("ds").eq(lit("2023-07-08")))?
.count().await?);
Whether building in Python or Rust, the deltalake
packages allow for
tremendous flexibility for Delta Lake readers and writers. Building from either
code snippet above can be a great way to start building new applications
depkloyed AWS Lambdas, AWS Glue scripts, or anywhere else in AWS. We have seen
some really cool examples incorporating Delta Lake into existing Rust or Python
web applications using these libraries!
If your organization needs help building lightweight, low-cost, and high-performance Delta Lake applications, we can help! Drop me an email and we'll chat!!