Streamline AWS WAF log evaluation with Apache Iceberg and Amazon Information Firehose

Organizations are quickly increasing their digital presence, creating alternatives to serve clients higher via internet functions. AWS WAF logs play a significant position on this enlargement by enabling organizations to proactively monitor safety, implement compliance, and strengthen utility protection. AWS WAF log evaluation is crucial throughout many industries, together with banking, retail, and healthcare, every needing to ship safe digital experiences.

To optimize their safety operations, organizations are adopting trendy approaches that mix real-time monitoring with scalable information analytics. They’re utilizing information lake architectures and Apache Iceberg to effectively course of giant volumes of safety information whereas minimizing operational overhead. Apache Iceberg combines enterprise reliability with SQL simplicity when working with safety information saved in Amazon Easy Storage Service (Amazon S3), enabling organizations to concentrate on safety insights moderately than infrastructure administration.

Apache Iceberg enhances safety analytics via a number of key capabilities. It seamlessly integrates with varied AWS companies and evaluation instruments whereas supporting concurrent read-write operations for simultaneous log ingestion and evaluation. Its time journey characteristic permits thorough safety forensics and incident investigation, and its schema evolution assist permits groups to adapt to rising safety patterns with out disrupting present workflows. These capabilities make Apache Iceberg a super alternative for constructing strong safety analytics options. Nonetheless, organizations typically wrestle when constructing their very own options to ship information to Apache Iceberg tables. These embody managing complicated extract, rework, and cargo (ETL) processes, dealing with schema validation, offering dependable supply, and sustaining customized code for information transformations. Groups should additionally construct resilient error dealing with, implement retry logic, and handle scaling infrastructure—all whereas sustaining information consistency and excessive availability. These challenges take beneficial time away from analyzing safety information and deriving insights.

To deal with these challenges, Amazon Information Firehose offers real-time information supply to Apache Iceberg tables inside seconds. Firehose delivers excessive reliability throughout a number of Availability Zones whereas mechanically scaling to match throughput necessities. It’s totally managed and requires no infrastructure administration or customized code improvement. Firehose delivers streaming information with configurable buffering choices that may be optimized for near-zero latency. It additionally offers built-in information transformation, compression, and encryption capabilities, together with computerized retry mechanisms to supply dependable information supply. This makes it a super alternative for streaming AWS WAF logs instantly into an information lake whereas minimizing operational overhead.

On this submit, we exhibit how one can construct a scalable AWS WAF log evaluation answer utilizing Firehose and Apache Iceberg. Firehose simplifies your entire course of—from log ingestion to storage—by permitting you to configure a supply stream that delivers AWS WAF logs on to Apache Iceberg tables in Amazon S3. The answer requires no infrastructure setup and also you pay just for the information you course of.

Resolution overview

To implement this answer, you first configure AWS WAF logging to seize internet site visitors info. This captures detailed details about site visitors analyzed by the net entry management lists (ACLs). Every log entry consists of the request timestamp, detailed request info, and rule matches that had been triggered. These logs are constantly streamed to Firehose in actual time.

Firehose writes these logs into an Apache Iceberg desk, which is saved in Amazon S3. When Firehose delivers information to the S3 desk, it makes use of the AWS Glue Information Catalog to retailer and handle desk metadata. This metadata consists of schema info, partition particulars, and file areas, enabling seamless information discovery and querying throughout AWS analytics companies.

Lastly, safety groups can analyze information within the Apache Iceberg tables utilizing varied AWS companies, together with Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker. For this demonstration, we use Athena to run SQL queries in opposition to the safety logs.

The next diagram illustrates the answer structure.

The implementation consists of 4 steps:

Deploy the bottom infrastructure utilizing AWS CloudFormation.
Create an Apache Iceberg desk utilizing an AWS Glue pocket book.
Create a Firehose stream to deal with the log information.
Configure AWS WAF logging to ship information to the Apache Iceberg desk via the Firehose stream.

You possibly can deploy the required assets into your AWS atmosphere within the US East (N. Virginia) AWS Area utilizing a CloudFormation template. This template creates an S3 bucket for storing AWS WAF logs, an AWS Glue database for the Apache Iceberg tables, and the AWS Identification and Entry Administration (IAM) roles and insurance policies wanted for the answer.

Conditions

Earlier than you get began, ensure you have the next conditions:

An AWS account with entry to the US East (N. Virginia) Area
AWS WAF configured with an internet ACL within the US East (N. Virginia) Area

Should you don’t have AWS WAF arrange, consult with the AWS WAF Workshop to create a pattern internet utility with AWS WAF.

AWS WAF logs use case-sensitive discipline names (like httpRequest and webaclId). For profitable log ingestion, this answer makes use of the Apache Iceberg API via an AWS Glue job to create tables—this can be a dependable method that preserves the precise discipline names from the AWS WAF logs. Though AWS Glue crawlers and Athena DDLs provide handy methods to create Apache Iceberg tables, they convert mixed-case column names to lowercase, which may have an effect on AWS WAF log processing. Through the use of an AWS Glue job with the Apache Iceberg API, case-sensitivity of column names is preserved, offering correct mapping between AWS WAF log fields and desk columns.

Deploy the CloudFormation stack

Full the next steps to deploy the answer assets with AWS CloudFormation:

Check in to the AWS CloudFormation console.
Select Launch Stack.
Select Subsequent.
For Stack title, go away as WAF-Firehose-Iceberg-Stack.
Below Parameters, specify whether or not AWS Lake Formation permissions are for use for the AWS Glue tables.
Select Subsequent.

Choose I acknowledge that AWS CloudFormation may create IAM assets with customized names and select Subsequent.

Evaluation the deployment and select Submit.

The stack takes a number of minutes to deploy. After the deployment is full, you possibly can evaluation the assets created by navigating to the Assets tab on the CloudFormation stack.

Create an Apache Iceberg desk

Earlier than establishing the Firehose supply stream, you need to create the vacation spot Apache Iceberg desk within the Information Catalog. That is accomplished utilizing AWS Glue jobs and the Apache Iceberg API, as mentioned earlier. Full the next steps to create an Apache Iceberg desk:

On the AWS Glue console, select Notebooks underneath ETL jobs within the navigation pane.

Select Pocket book choice underneath Create job.

Below Choices, choose Begin recent.
For IAM position, select WAF-Firehose-Iceberg-Stack-GlueServiceRole-*.
Select Create pocket book.

Enter the next configuration command within the pocket book to configure the Spark session with Apache Iceberg extensions. Remember to replace the configuration for sql.catalog.glue_catalog.warehouse to the S3 bucket created by the CloudFormation template.

%%configure {     "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3:///waflogdata --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",     "--datalake-formats": "iceberg" }

Enter the next SQL within the AWS Glue pocket book to create the Apache Iceberg desk:

# Be aware: This code makes use of Glue model 5.0 (as of April 2024) # Please examine AWS Glue launch notes for the newest model and replace accordingly: # https://docs.aws.amazon.com/glue/newest/dg/release-notes.html # To replace: Change the %glue_version parameter beneath to the newest model %idle_timeout 2880 %glue_version 5.0 %worker_type G.1X %number_of_workers 5 import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from pyspark.conf import SparkConf sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) spark.sql(""" CREATE TABLE glue_catalog.waf_logs_db.firehose_waf_logs(   `timestamp` bigint,   `formatVersion` int,   `webaclId` string,   `terminatingRuleId` string,   `terminatingRuleType` string,   `motion` string,   `terminatingRuleMatchDetails` array                                            >                                      >,   `httpSourceName` string,   `httpSourceId` string,   `ruleGroupList` array                                                                            >                                                                     >                                                 >,                           nonterminatingmatchingrules: array                                                                                                >                                                                    >,                                                                   challengeresponse: struct ,                                                                   captcharesponse: struct                                                                      >                                                              >,                           excludedrules: string                             >                        >, `rateBasedRuleList` array                            >,   `nonTerminatingMatchingRules` array                                                                      >                                                              >,                                         challengeresponse: struct ,                                         captcharesponse: struct                                            >                                      >,   `requestHeadersInserted` array                                   >,   `responseCodeSent` string,   `httpRequest` struct                                   >,                     uri: string,                     args: string,                     httpversion: string,                     httpmethod: string,                     requestid: string                       >,   `labels` array                  >,   `CaptchaResponse` struct ,   `ChallengeResponse` struct ,   `ja3Fingerprint` string,   `overSizeFields` string,   `requestBodySize` int,   `requestBodySizeInspectedByWAF` int ) USING iceberg TBLPROPERTIES ("format-version"="2") """) job.commit()

Navigate to the Information Catalog and waf_logs_db database to verify the desk firehose_waf_logs is created.

Create a Firehose stream

Full the next steps to create a Firehose stream:

On the Information Firehose console, select Create Firehose stream.

Select Direct PUT for Supply and Apache Iceberg Tables for Vacation spot.

For Firehose stream title, enter aws-waf-logs-firehose-iceberg-1.

Within the Vacation spot settings part, allow Inline parsing for routing info. As a result of we’re sending all data to 1 desk, specify the vacation spot database and desk names:
1. For Database expression, enter "waf_logs_db".
2. For Desk expression, enter "firehose_waf_logs".

Be sure that to incorporate double citation marks to make use of the literal worth for the database and desk title. Should you don’t use double citation marks, Firehose assumes that this can be a JSON question expression and can try and parse the expression when processing your stream and fail. Firehose can even path to totally different Apache Iceberg Tables based mostly on the content material of the information. For extra info, consult with Route incoming data to totally different Iceberg Tables.

For S3 backup bucket, enter the S3 bucket created by the CloudFormation template.
For S3 backup bucket error output prefix, enter error/events-1/.

Below Superior settings, choose Allow server-side encryption for supply data in Firehose stream.

For Present IAM roles, select the position that begins with WAF-Firehose-Iceberg-stack-FirehoseIAMRole-*, created by the CloudFormation template.
Select Create Firehose stream.

Configure AWS WAF logs to the Firehose stream

Full the next steps to configure AWS WAF logs to the Firehose stream.

On the AWS WAF console, select Net ACLs within the navigation pane.

Select your internet ACL.
On the Logging and metrics tab, select Allow.

For Amazon Information Firehose stream, select the stream aws-waf-logs-firehose-iceberg-1.
Select Save.

Question and analyze the logs

You possibly can question the information you’ve written to your Apache Iceberg tables utilizing totally different processing engines, resembling Apache Spark, Apache Flink, or Trino. On this instance, we use Athena to question AWS WAF logs information saved in Apache Iceberg tables. Full the next steps:

On the Athena console, select Settings within the high proper nook.
For Location of question end result, enter the S3 bucket created by the CloudFormation template

s3:///athena/

Enter the AWS account ID for Anticipated bucket proprietor and select save.

Within the question editor, in Tables and views, select the choices menu subsequent to firehose_waf_logs and select Preview Desk.

It is best to be capable to see the AWS WAF logs within the Apache Iceberg tables by utilizing Athena.

The next are some further helpful instance queries:

Determine potential assault sources by analyzing blocked IP addresses:

-- High 10 blocked IP addresses SELECT httpRequest.clientip, COUNT() as block_count FROM waf_logs_db.firehose_waf_logs WHERE motion = 'BLOCK' GROUP BY httpRequest.clientip ORDER BY block_count DESC LIMIT 10;

Monitor assault patterns and tendencies over time:

-- Charge of blocked requests over time SELECT DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000)) as hour,        COUNT() as request_count FROM waf_logs_db.firehose_waf_logs WHERE motion = 'BLOCK' GROUP BY DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000)) ORDER BY hour;

Apache Iceberg desk optimization

Though Firehose permits environment friendly streaming of AWS WAF logs into Apache Iceberg tables, the character of streaming writes can lead to many small information being created. It is because Firehose delivers information based mostly on its buffering configuration, which may result in suboptimal question efficiency. To deal with this, common desk optimization is beneficial.

There are two beneficial desk optimization approaches:

Compaction – Information compaction merges small information information to scale back storage utilization and enhance learn efficiency. Information information are merged and rewritten to take away out of date information and consolidate fragmented information into bigger, extra environment friendly information.
Storage optimization – You possibly can handle storage overhead by eradicating older, pointless snapshots and their related underlying information. Moreover, this consists of periodically deleting orphan information to keep up environment friendly storage utilization and optimum question efficiency.

These optimizations will be carried out utilizing both the Information Catalog or Athena.

Desk optimization utilizing the Information Catalog

The Information Catalog offers computerized desk optimization options. Throughout the desk optimization characteristic, you possibly can configure particular optimizers for compaction, snapshot retention, and orphan file deletion. A desk optimization schedule will be managed and standing will be monitored from the AWS Glue console.

Desk optimization utilizing Athena

Athena helps handbook optimization via SQL instructions. The OPTIMIZE command rewrites small information into bigger information and applies file compaction:

OPTIMIZE waf_logs_db.firehose_waf_logs REWRITE DATA USING BIN_PACK

The VACUUM command removes previous snapshots and cleans up expired information information:

ALTER TABLE waf_logs_db.firehose_waf_logs SET TBLPROPERTIES (   'vacuum_max_snapshot_age_seconds'='259200' )

VACUUM waf_logs_db.firehose_waf_logs

You possibly can monitor the desk’s optimization standing utilizing the next question:

SELECT * FROM "waf_logs_db"."firehose_waf_logs$information"

Clear up

To keep away from future fees, full the next steps:

Empty the S3 bucket.
Delete the CloudFormation stack.
Delete the Firehose stream.
Disable AWS WAF logging.

Conclusion

On this submit, we demonstrated how one can construct an AWS WAF log analytics pipeline utilizing Firehose to ship AWS WAF logs to Apache Iceberg tables on Amazon S3. The answer handles large-scale AWS WAF log processing with out requiring complicated code or infrastructure administration. Though this submit centered on Apache Iceberg tables because the vacation spot, Information Firehose additionally seamlessly integrates with Amazon S3 tables. To optimize your tables for querying, Amazon S3 Tables constantly performs computerized upkeep operations, resembling compaction, snapshot administration, and unreferenced file elimination. These operations improve desk efficiency by compacting smaller objects into fewer, bigger information.

To get began with your individual implementation, strive the answer in your AWS account and discover the next assets for added options and greatest practices:

In regards to the Authors

Charishma Makineni is a Senior Technical Account Supervisor at AWS. She offers strategic technical steerage for Impartial Software program Distributors (ISVs) to construct and optimize options on AWS. She makes a speciality of Massive Information and Analytics applied sciences, serving to organizations optimize their data-driven initiatives on AWS.

Phaneendra Vuliyaragoli is a Product Administration Lead for Amazon Information Firehose at AWS. On this position, Phaneendra leads the product and go-to-market technique for Amazon Information Firehose.

Streamline AWS WAF log evaluation with Apache Iceberg and Amazon Information Firehose

Resolution overview

Conditions

Deploy the CloudFormation stack

Create an Apache Iceberg desk

Create a Firehose stream

Configure AWS WAF logs to the Firehose stream

Question and analyze the logs

Apache Iceberg desk optimization

Desk optimization utilizing the Information Catalog

Desk optimization utilizing Athena

Clear up

Conclusion

In regards to the Authors

Related Articles

Generative AI for Allowing | Microsoft Storage

Electra – Senior Undertaking Engineer – sUAS Information

North American robotic orders regular within the first half of 2025, reviews A3

LEAVE A REPLY Cancel reply

Latest Articles

Generative AI for Allowing | Microsoft Storage

Electra – Senior Undertaking Engineer – sUAS Information

North American robotic orders regular within the first half of 2025, reviews A3

What Rights Do Injured Staff Have Below Maritime Damage Legal guidelines?

Roblox cracks down on its user-created content material following a number of baby security lawsuits