Monday, May 12, 2025

Speed up light-weight analytics utilizing PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint

For contemporary organizations constructed on knowledge insights, efficient knowledge administration is essential for powering superior analytics and machine studying (ML) actions. As knowledge use instances turn into extra complicated, knowledge engineering groups require subtle tooling to deal with versioning, rising knowledge volumes, and schema adjustments throughout a number of knowledge sources and functions.

Apache Iceberg has emerged as a well-liked alternative for knowledge lakes, providing ACID (Atomicity, Consistency, Isolation, Sturdiness) transactions, schema evolution, and time journey capabilities. Iceberg tables might be accessed from varied distributed knowledge processing frameworks like Apache Spark and Trino, making it a versatile resolution for numerous knowledge processing wants. Among the many accessible instruments for working with Iceberg, PyIceberg stands out as a Python implementation that permits desk entry and administration with out requiring distributed compute assets.

On this submit, we reveal how PyIceberg, built-in with the AWS Glue Knowledge Catalog and AWS Lambda, gives a light-weight method to harness Iceberg’s highly effective options by means of intuitive Python interfaces. We present how this integration allows groups to begin working with Iceberg tables with minimal setup and infrastructure dependencies.

PyIceberg’s key capabilities and benefits

One in all PyIceberg’s major benefits is its light-weight nature. With out requiring distributed computing frameworks, groups can carry out desk operations straight from Python functions, making it appropriate for small to medium-scale knowledge exploration and evaluation with minimal studying curve. As well as, PyIceberg is built-in with Python knowledge evaluation libraries like Pandas and Polars, so knowledge customers can use their current abilities and workflows.

When utilizing PyIceberg with the Knowledge Catalog and Amazon Easy Storage Service (Amazon S3), knowledge groups can retailer and handle their tables in a very serverless atmosphere. This implies knowledge groups can give attention to evaluation and insights quite than infrastructure administration.

Moreover, Iceberg tables managed by means of PyIceberg are suitable with AWS knowledge analytics companies. Though PyIceberg operates on a single node and has efficiency limitations with giant knowledge volumes, the identical tables might be effectively processed at scale utilizing companies equivalent to Amazon Athena and AWS Glue. This permits groups to make use of PyIceberg for speedy improvement and testing, then transition to manufacturing workloads with larger-scale processing engines—whereas sustaining consistency of their knowledge administration method.

Consultant use case

The next are widespread situations the place PyIceberg might be significantly helpful:

  • Knowledge science experimentation and have engineering – In knowledge science, experiment reproducibility is essential for sustaining dependable and environment friendly analyses and fashions. Nonetheless, repeatedly updating organizational knowledge makes it difficult to handle knowledge snapshots for vital enterprise occasions, mannequin coaching, and constant reference. Knowledge scientists can question historic snapshots by means of time journey capabilities and report vital variations utilizing tagging options. With PyIceberg, they will obtain these advantages of their Python atmosphere utilizing acquainted instruments like Pandas. Due to Iceberg’s ACID capabilities, they will entry constant knowledge even when tables are being actively up to date.
  • Serverless knowledge processing with Lambda – Organizations usually have to course of knowledge and keep analytical tables effectively with out managing complicated infrastructure. Utilizing PyIceberg with Lambda, groups can construct event-driven knowledge processing and scheduled desk updates by means of serverless features. PyIceberg’s light-weight nature makes it well-suited for serverless environments, enabling easy knowledge processing duties like knowledge validation, transformation, and ingestion. These tables stay accessible for each updates and analytics by means of varied AWS companies, permitting groups to construct environment friendly knowledge pipelines with out managing servers or clusters.

Occasion-driven knowledge ingestion and evaluation with PyIceberg

On this part, we discover a sensible instance of utilizing PyIceberg for knowledge processing and evaluation utilizing NYC yellow taxi journey knowledge. To simulate an event-driven knowledge processing state of affairs, we use Lambda to insert pattern knowledge into an Iceberg desk, representing how real-time taxi journey information is perhaps processed. This instance will reveal how PyIceberg can streamline workflows by combining environment friendly knowledge ingestion with versatile evaluation capabilities.

Think about your workforce faces a number of necessities:

  • The information processing resolution must be cost-effective and maintainable, avoiding the complexity of managing distributed computing clusters for this moderately-sized dataset.
  • Analysts want the flexibility to carry out versatile queries and explorations utilizing acquainted Python instruments. For instance, they could want to check historic snapshots with present knowledge to research developments over time.
  • The answer ought to have the flexibility to increase to be extra scalable sooner or later.

To deal with these necessities, we implement an answer that mixes Lambda for knowledge processing with Jupyter notebooks for evaluation, each powered by PyIceberg. This method gives a light-weight but sturdy structure that maintains knowledge consistency whereas enabling versatile evaluation workflows. On the finish of the walkthrough, we additionally question this knowledge utilizing Athena to reveal compatibility with a number of Iceberg-supporting instruments and present how the structure can scale.

We stroll by means of the next high-level steps:

  1. Use Lambda to jot down pattern NYC yellow taxi journey knowledge to an Iceberg desk on Amazon S3 utilizing PyIceberg with an AWS Glue Iceberg REST endpoint. In a real-world state of affairs, this Lambda operate can be triggered by an occasion from a queuing part like Amazon Easy Queue Service (Amazon SQS). For extra particulars, see Utilizing Lambda with Amazon SQS.
  2. Analyze desk knowledge in a Jupyter pocket book utilizing PyIceberg by means of the AWS Glue Iceberg REST endpoint.
  3. Question the info utilizing Athena to reveal Iceberg’s flexibility.

The next diagram illustrates the structure.

Overall Architecture

When implementing this structure, it’s vital to notice that Lambda features can have a number of concurrent invocations when triggered by occasions. This concurrent invocation may result in transaction conflicts when writing to Iceberg tables. To deal with this, you need to implement an acceptable retry mechanism and thoroughly handle concurrency ranges. In the event you’re utilizing Amazon SQS as an occasion supply, you’ll be able to management concurrent invocations by means of the SQS occasion supply’s most concurrency setting.

Conditions

The next stipulations are needed for this use case:

Arrange assets with AWS CloudFormation

You should use the supplied CloudFormation template to arrange the next assets:

Full the next steps to deploy the assets:

  1. Select Launch stack.

  1. For Parameters, pyiceberg_lambda_blog_database is ready by default. You may as well change the default worth. In the event you change the database title, bear in mind to interchange pyiceberg_lambda_blog_database together with your chosen title in all subsequent steps. Then, select Subsequent.
  2. Select Subsequent.
  3. Choose I acknowledge that AWS CloudFormation may create IAM assets with customized names.
  4. Select Submit.

Construct and run a Lambda operate

Let’s construct a Lambda operate to course of incoming information utilizing PyIceberg. This operate creates an Iceberg desk referred to as nyc_yellow_table within the database pyiceberg_lambda_blog_database within the Knowledge Catalog if it doesn’t exist. It then generates pattern NYC taxi journey knowledge to simulate incoming information and inserts it into nyc_yellow_table.

Though we invoke this operate manually on this instance, in real-world situations, this Lambda operate can be triggered by precise occasions, equivalent to messages from Amazon SQS. When implementing real-world use instances, the operate code have to be modified to obtain the occasion knowledge and course of it primarily based on the necessities.

We deploy the operate utilizing container photographs because the deployment bundle. To create a Lambda operate from a container picture, construct your picture on CloudShell and push it to an ECR repository. Full the next steps:

  1. Register to the AWS Administration Console and launch CloudShell.
  2. Create a working listing.
mkdir pyiceberg_blog cd pyiceberg_blog

  1. Obtain the Lambda script lambda_function.py.
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/lambda_function.py .

This script performs the next duties:

  • Creates an Iceberg desk with the NYC taxi schema within the Knowledge Catalog
  • Generates a random NYC taxi dataset
  • Inserts this knowledge into the desk

Let’s break down the important components of this Lambda operate:

  • Iceberg catalog configuration – The next code defines an Iceberg catalog that connects to the AWS Glue Iceberg REST endpoint:
# Configure the catalog catalog_properties = {    "sort": "relaxation",    "uri": f"https://glue.{area}.amazonaws.com/iceberg",    "s3.area": area,    "relaxation.sigv4-enabled": "true",    "relaxation.signing-name": "glue",    "relaxation.signing-region": area } catalog = load_catalog(**catalog_properties)

  • Desk schema definition – The next code defines the Iceberg desk schema for the NYC taxi dataset. The desk contains:
    • Schema columns outlined within the Schema
    • Partitioning by vendorid and tpep_pickup_datetime utilizing PartitionSpec
    • Day remodel utilized to tpep_pickup_datetime for each day report administration
    • Type ordering by tpep_pickup_datetime and tpep_dropoff_datetime

When making use of the day remodel to timestamp columns, Iceberg routinely handles date-based partitioning hierarchically. This implies a single day remodel allows partition pruning on the 12 months, month, and day ranges with out requiring express transforms for every degree. For extra particulars about Iceberg partitioning, see Partitioning.

# Desk Definition schema = Schema(     NestedField(field_id=1, title="vendorid", field_type=LongType(), required=False),     NestedField(field_id=2, title="tpep_pickup_datetime", field_type=TimestampType(), required=False),     NestedField(field_id=3, title="tpep_dropoff_datetime", field_type=TimestampType(), required=False),     NestedField(field_id=4, title="passenger_count", field_type=LongType(), required=False),     NestedField(field_id=5, title="trip_distance", field_type=DoubleType(), required=False),     NestedField(field_id=6, title="ratecodeid", field_type=LongType(), required=False),     NestedField(field_id=7, title="store_and_fwd_flag", field_type=StringType(), required=False),     NestedField(field_id=8, title="pulocationid", field_type=LongType(), required=False),     NestedField(field_id=9, title="dolocationid", field_type=LongType(), required=False),     NestedField(field_id=10, title="payment_type", field_type=LongType(), required=False),     NestedField(field_id=11, title="fare_amount", field_type=DoubleType(), required=False),     NestedField(field_id=12, title="additional", field_type=DoubleType(), required=False),     NestedField(field_id=13, title="mta_tax", field_type=DoubleType(), required=False),     NestedField(field_id=14, title="tip_amount", field_type=DoubleType(), required=False),     NestedField(field_id=15, title="tolls_amount", field_type=DoubleType(), required=False),     NestedField(field_id=16, title="improvement_surcharge", field_type=DoubleType(), required=False),     NestedField(field_id=17, title="total_amount", field_type=DoubleType(), required=False),     NestedField(field_id=18, title="congestion_surcharge", field_type=DoubleType(), required=False),     NestedField(field_id=19, title="airport_fee", field_type=DoubleType(), required=False), ) # Outline partition spec partition_spec = PartitionSpec(     PartitionField(source_id=1, field_id=1001, remodel=IdentityTransform(), title="vendorid_idenitty"),     PartitionField(source_id=2, field_id=1002, remodel=DayTransform(), title="tpep_pickup_day"), ) # Outline type order sort_order = SortOrder(     SortField(source_id=2, remodel=DayTransform()),     SortField(source_id=3, remodel=DayTransform()) ) database_name = os.environ.get('GLUE_DATABASE_NAME') table_name = os.environ.get('ICEBERG_TABLE_NAME') identifier = f"{database_name}.{table_name}" # Create the desk if it would not exist location = f"s3://pyiceberg-lambda-blog-{account_id}-{area}/{database_name}/{table_name}" if not catalog.table_exists(identifier):     desk = catalog.create_table(         identifier=identifier,         schema=schema,         location=location,         partition_spec=partition_spec,         sort_order=sort_order     ) else:     desk = catalog.load_table(identifier=identifier)

  • Knowledge era and insertion – The next code generates random knowledge and inserts it into the desk. This instance demonstrates an append-only sample, the place new information are repeatedly added to trace enterprise occasions and transactions:
# Generate random knowledge information = generate_random_data() # Convert to Arrow Desk df = pa.Desk.from_pylist(information) # Write knowledge utilizing PyIceberg desk.append(df)

  1. Obtain the Dockerfile. It defines the container picture in your operate code.
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/Dockerfile .

  1. Obtain the necessities.txt. It defines the Python packages required in your operate code.
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/necessities.txt .

At this level, your working listing ought to include the next three recordsdata:

  • Dockerfile
  • lambda_function.py
  • necessities.txt
  1. Set the atmosphere variables. Substitute together with your AWS account ID:
  1. Construct the Docker picture:
docker construct --provenance=false -t localhost/pyiceberg-lambda . # Affirm constructed picture docker photographs | grep pyiceberg-lambda

  1. Set a tag to the picture:
docker tag localhost/pyiceberg-lambda:newest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:newest

  1. Log in to the ECR repository created by AWS CloudFormation:
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

  1. Push the picture to the ECR repository:
docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:newest

  1. Create a Lambda operate utilizing the container picture you pushed to Amazon ECR:
aws lambda create-function  --function-name pyiceberg-lambda-function  --package-type Picture  --code ImageUri=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:newest  --role arn:aws:iam::${AWS_ACCOUNT_ID}:function/pyiceberg-lambda-function-role-${AWS_REGION}  --environment "Variables={ICEBERG_TABLE_NAME=nyc_yellow_table, GLUE_DATABASE_NAME=pyiceberg_lambda_blog_database}"  --region ${AWS_REGION}  --timeout 60  --memory-size 1024

  1. Invoke the operate not less than 5 instances to create a number of snapshots, which we’ll look at within the following sections. Word that we’re invoking the operate manually to simulate event-driven knowledge ingestion. In actual world situations, Lambda features can be routinely invoked with event-driven trend.
aws lambda invoke  --function-name arn:aws:lambda:${AWS_REGION}:${AWS_ACCOUNT_ID}:operate:pyiceberg-lambda-function  --log-type Tail  outputfile.txt  --query 'LogResult' | tr -d '"' | base64 -d

At this level, you could have deployed and run the Lambda operate. The operate creates the nyc_yellow_table Iceberg desk within the pyiceberg_lambda_blog_database database. It additionally generates and inserts pattern knowledge into this desk. We are going to discover the information within the desk in later steps.

For extra detailed details about constructing Lambda features with containers, see Create a Lambda operate utilizing a container picture.

Discover the info with Jupyter utilizing PyIceberg

On this part, we reveal the right way to entry and analyze the info saved in Iceberg tables registered within the Knowledge Catalog. Utilizing a Jupyter pocket book with PyIceberg, we entry the taxi journey knowledge created by our Lambda operate and look at totally different snapshots as new information arrive. We additionally tag particular snapshots to retain vital ones, and create new tables for additional evaluation.

Full the next steps to open the pocket book with Jupyter on the SageMaker AI pocket book occasion:

  1. On the SageMaker AI console, select Notebooks within the navigation pane.
  2. Select Open JupyterLab subsequent to the pocket book that you simply created utilizing the CloudFormation template.

notebook list

  1. Obtain the pocket book and open it in a Jupyter atmosphere in your SageMaker AI pocket book.

upload notebook

  1. Open uploaded pyiceberg_notebook.ipynb.
  2. Within the kernel choice dialog, depart the default possibility and select Choose.

select kernel

From this level ahead, you’ll work by means of the pocket book by working cells so as.

Connecting Catalog and Scanning Tables

You’ll be able to entry the Iceberg desk utilizing PyIceberg. The next code connects to the AWS Glue Iceberg REST endpoint and masses the nyc_yellow_table desk on the pyiceberg_lambda_blog_database database:

import pyarrow as pa from pyiceberg.catalog import load_catalog import boto3 # Set AWS area sts = boto3.consumer('sts') area = sts._client_config.region_name # Configure catalog connection properties catalog_properties = {     "sort": "relaxation",     "uri": f"https://glue.{area}.amazonaws.com/iceberg",     "s3.area": area,     "relaxation.sigv4-enabled": "true",     "relaxation.signing-name": "glue",     "relaxation.signing-region": area } # Specify database and desk names database_name = "pyiceberg_lambda_blog_database" table_name = "nyc_yellow_table" # Load catalog and get desk catalog = load_catalog(**catalog_properties) desk = catalog.load_table(f"{database_name}.{table_name}")

You’ll be able to question full knowledge from the Iceberg desk as an Apache Arrow desk and convert it to a Pandas DataFrame.

scan table

Working with Snapshots

One of many vital options of Iceberg is snapshot-based model management. Snapshots are routinely created every time knowledge adjustments happen within the desk. You’ll be able to retrieve knowledge from a particular snapshot, as proven within the following instance.

working with snapshots

# Get knowledge from a particular snapshot ID snapshot_id = snapshots.to_pandas()["snapshot_id"][3] snapshot_pa_table = desk.scan(snapshot_id=snapshot_id).to_arrow() snapshot_df = snapshot_pa_table.to_pandas()

You’ll be able to evaluate the present knowledge with historic knowledge from any cut-off date primarily based on snapshots. On this case, you might be evaluating the variations in knowledge distribution between the most recent desk and a snapshot desk:

# Evaluate the distribution of total_amount within the specified snapshot and the most recent knowledge. import matplotlib.pyplot as plt plt.determine(figsize=(4, 3)) df['total_amount'].hist(bins=30, density=True, label="newest", alpha=0.5) snapshot_df['total_amount'].hist(bins=30, density=True, label="snapshot", alpha=0.5) plt.title('Distribution of total_amount') plt.xlabel('total_amount') plt.ylabel('relative Frequency') plt.legend() plt.present()

matplotlib graph

Tagging snapshots

You’ll be able to tag particular snapshots with an arbitrary title and question particular snapshots with that title later. That is helpful when managing snapshots of vital occasions.

On this instance, you question a snapshot specifying the tag checkpointTag. Right here, you might be utilizing the polars to create a brand new DataFrame by including a brand new column referred to as trip_duration primarily based on current columns tpep_dropoff_datetime and tpep_pickup_datetime columns:

# retrive tagged snapshot desk as polars knowledge body import polars as pl # Get snapshot id from tag title df = desk.examine.refs().to_pandas() filtered_df = df[df["name"] == tag_name] tag_snapshot_id = filtered_df["snapshot_id"].iloc[0] # Scan Desk primarily based on the snapshot id tag_pa_table = desk.scan(snapshot_id=tag_snapshot_id).to_arrow() tag_df = pl.from_arrow(tag_pa_table) # Course of the info including a brand new column "trip_duration" from verify level snapshot. def preprocess_data(df):     df = df.choose(["vendorid", "tpep_pickup_datetime", "tpep_dropoff_datetime",                      "passenger_count", "trip_distance", "fare_amount"])     df = df.with_columns(         ((pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime"))          .dt.total_seconds() // 60).alias("trip_duration"))     return df processed_df = preprocess_data(tag_df) show(processed_df) print(processed_df["trip_duration"].describe())

processed-df

Create a brand new desk from the processed DataFrame with the trip_duration column. This step illustrates the right way to put together knowledge for potential future evaluation. You’ll be able to explicitly specify the snapshot of the info that the processed knowledge is referring to through the use of a tag, even when the underlying desk has been modified.

# write processed knowledge to new iceberg desk account_id = sts.get_caller_identity()["Account"]  new_table_name = "processed_" + table_name location = f"s3://pyiceberg-lambda-blog-{account_id}-{area}/{database_name}/{new_table_name}" pa_new_table = processed_df.to_arrow() schema = pa_new_table.schema identifier = f"{database_name}.{new_table_name}" new_table = catalog.create_table(                 identifier=identifier,                 schema=schema,                 location=location             )              # present new desk's schema print(new_table.schema()) # insert processed knowledge to new desk new_table.append(pa_new_table)

Let’s question this new desk constructed from processed knowledge with Athena to reveal the Iceberg desk’s interoperability.

Question the info from Athena

  1. Within the Athena question editor, you’ll be able to question the desk pyiceberg_lambda_blog_database.processed_nyc_yellow_table created from the pocket book within the earlier part:
SELECT * FROM "pyiceberg_lambda_blog_database"."processed_nyc_yellow_table" restrict 10;

query with athena

By finishing these steps, you’ve constructed a serverless knowledge processing resolution utilizing PyIceberg with Lambda and an AWS Glue Iceberg REST endpoint. You’ve labored with PyIceberg to handle and analyze knowledge utilizing Python, together with snapshot administration and desk operations. As well as, you ran the question utilizing one other engine, Athena, which reveals the compatibility of the Iceberg desk.

Clear up

To wash up the assets used on this submit, full the next steps:

  1. On the Amazon ECR console, navigate to the repository pyiceberg-lambda-repository and delete all photographs contained within the repository.
  2. On the CloudShell, delete working listing pyiceberg_blog.
  3. On the Amazon S3 console, navigate to the S3 bucket pyiceberg-lambda-blog--, which you created utilizing the CloudFormation template, and empty the bucket.
  4. After you verify the repository and the bucket are empty, delete the CloudFormation stack pyiceberg-lambda-blog-stack.
  5. Delete the Lambda operate pyiceberg-lambda-function that you simply created utilizing the Docker picture.

Conclusion

On this submit, we demonstrated how utilizing PyIceberg with the AWS Glue Knowledge Catalog allows environment friendly, light-weight knowledge workflows whereas sustaining sturdy knowledge administration capabilities. We showcased how groups can use Iceberg’s highly effective options with minimal setup and infrastructure dependencies. This method permits organizations to begin working with Iceberg tables rapidly, with out the complexity of establishing and managing distributed computing assets.

That is significantly worthwhile for organizations seeking to undertake Iceberg’s capabilities with a low barrier to entry. The light-weight nature of PyIceberg permits groups to start working with Iceberg tables instantly, utilizing acquainted instruments and requiring minimal further studying. As knowledge wants develop, the identical Iceberg tables might be seamlessly accessed by AWS analytics companies like Athena and AWS Glue, offering a transparent path for future scalability.

To be taught extra about PyIceberg and AWS analytics companies, we encourage you to discover the PyIceberg documentation and What’s Apache Iceberg?


Concerning the authors

Sotaro Hikita is a Specialist Options Architect targeted on analytics with AWS, working with massive knowledge applied sciences and open supply software program. Exterior of labor, he at all times seeks out good meals and has lately turn into enthusiastic about pizza.

Shuhei Fukami is a Specialist Options Architect targeted on Analytics with AWS. He likes cooking in his spare time and has turn into obsessive about making pizza nowadays.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles