5.0 enhances fine-grained entry management (FGAC) primarily based on the insurance policies outlined in our documentation. FGAC enables granular control over access to your data lake’s various information sources at the level of desks, columns, and rows. The significance of this stage lies in its ability to guide organisations in adapting to the ever-evolving landscape of information governance and security regulations, particularly for those handling sensitive or classified data.
With Lake Formation, you can easily build, manage, and govern data lakes that are secure and scalable. The system enables the creation of detailed access control rules through grant and revoke statements, similar to those employed in relational database management systems (RDBMS), which are then automatically enforced by suitable engines such as Apache, PostgreSQL, and MySQL. With the release of AWS Glue 5.0, the same Lake Formation guidelines you’ve established for use with Athena are now applicable to your AWS Glue Spark jobs and interactive classes through built-in Spark SQL and Spark DataFrames. Simplifying the safety and governance of your information lakes ensures.
Discover how to integrate Fine-Grained Access Control (FGAC) with AWS Glue 5.0 and Lake Formation permissions, enabling granular data security and control.
AWS Glue version 5.0 leverages Feature Group-based Automated Classification (FGAC) to streamline data processing and machine learning model development. FGAC simplifies the process of building, testing, and deploying predictive models by providing a unified framework for feature engineering and automated classification. Within this architecture, AWS Glue’s FGAC functionality enables seamless integration with popular machine learning frameworks like TensorFlow, PyTorch, and Scikit-Learn. By utilizing FGAC, data scientists can create high-performing models more efficiently, while also benefiting from improved scalability, reliability, and cost-effectiveness.
By leveraging AWS Glue version 5.0 in conjunction with Lake Formation, you can effortlessly integrate a granular permission control mechanism within each Spark job, thereby enabling Lake Formation’s advanced access control capabilities during job execution on AWS Glue. AWS Glue employs a dual-profile mechanism to guarantee the successful execution of jobs. The person profile executes user-provided scripts, whereas the system profile ensures compliance with Lake Formation governance rules. For additional information, please refer to the documentation.
The diagram illustrates the high-level architecture for accessing data secured by Lake Formation permissions in AWS Glue 5.0.
The workflow comprises the following sequential steps:
- A person calls the
StartJobRun
APIs are exposed on a Lake Formation-enabled AWS Glue job? - AWS Glue sends jobs to a Person Driver and executes them within the scope of that user’s profile. The driver in this context is running a minimalist Spark configuration with limited functionality, lacking the ability to initiate tasks, allocate resources for execution, or interact with Amazon S3 for data storage or AWS Glue’s knowledge catalog for metadata management. It builds a job plan.
- AWS Glue sets up a secondary driver, known as the system driver, which is executed within the system profile under a privileged identity. AWS Glue enables secure communication between two drivers by establishing a fully encrypted Transport Layer Security (TLS) channel. The system’s personnel utilize the channel to transmit job plans to the system driver. The system driver does not execute user-submitted code. The system runs a full Spark environment to process data effectively, seamlessly communicating with both Amazon S3 for storage and retrieval of large datasets, as well as the Knowledge Catalog for streamlined information input. The system generates and assembles a comprehensive Job Plan, breaking it down into a logical sequence of executable stages.
- AWS Glue executes the phases on executors using either a Python driver or system driver. The code written for personal profiles is executed entirely on these dedicated executors.
- Entities accessing information from Knowledge Catalog tables safeguarded by Lake Formation or utilizing safety filters are assigned to system executables.
AWS Glue 5.0 supports Flexible Grouping and Aggregation Condition (FGAC) allowing users to group data based on complex conditions. This feature simplifies query optimization and enables more efficient processing of large datasets. With FGAC, you can define rules for grouping and aggregating data, streamlining your workflows and improving overall performance.
To enable Lake Formation’s FGAC (Fine-Grained Access Control) feature for your AWS Glue 5.0 jobs directly within the AWS Glue console, follow these steps:
- In the AWS Glue console, navigate to the section within the navigation pane.
- Select your job.
- Select the
- For , select .
- For , add following parameter:
- Key:
--enable-lakeformation-fine-grained-access
- Worth:
true
- Key:
- Select .
To enable Lake Formation for FGAC (Fine-Grained Access Control) in AWS Glue notebooks through the AWS Glue console, follow these steps: %%configure magic
:
Instance use case
The following diagram illustrates the high-level architecture of the use case presented in this publication. Here’s a rewritten version in a more polished style:
The primary objective of this use case is to demonstrate how to leverage Lake Formation’s Feature Group Aggregation and Classification (FGAC) capabilities on both CSV and Iceberg tables, ultimately configuring an AWS Glue PySpark job that can learn from these datasets.
Implementation involves the following stages:
- Create an Amazon S3 bucket with a unique name, then upload the entire CSV dataset to that bucket for future analysis and processing.
- SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Category ORDER BY ID) AS RowNum,
SUM(CASE WHEN Type = ‘Knowledge’ THEN 1 ELSE 0 END) OVER (PARTITION BY Category) AS KnowledgeCount,
SUM(CASE WHEN Type = ‘Iceberg’ THEN 1 ELSE 0 END) OVER (PARTITION BY Category) AS IcebergCount
FROM enter_csv)
WHERE RowNum = 1 - Employ Lake Formation’s functionality to enable Feature Store Access Control (FGAC) for both the CSV and Iceberg tables by leveraging row- and column-based filtering capabilities?
- Two AWS Glue jobs are run using a pattern PySpark script, ensuring compliance with Lake Formation FGAC permissions, and the output is written to Amazon S3.
To illustrate the execution process, we employ a pattern showcasing product stock details with the following attributes:
- The processing of the inventory statement has been conducted successfully. This exhibits values
I
to signify insert operations,U
to signify updates, andD
to signify deletes. - The initial pivotal pillar within the inventory system’s commodities division.
- The product’s classification, akin to
Electronics
orCosmetics
. - The Product Identity
- The quantity of a product available in inventory.
- The timestamp when the product report was last updated on the supply database.
To streamline the process, we establish AWS sources mirroring an S3 bucket, integrate Feature Group Access Control (FGAC) with Lake Formation, and design AWS Glue jobs to query these tables effectively.
Conditions
Prior to commencing, ensure that all necessary conditions are met.
- An AWS account with IAM roles configured to meet specific requirements.
- To complete the subsequent steps, the necessary authorizations are needed.
- Write data from a source to an Amazon S3 bucket securely and reliably.
- AWS Glue creates crawlers and jobs for data processing and analytics. To start using this service, follow these steps:
1. **Sign in to the AWS Management Console** with your AWS account credentials.
2. **Navigate to the AWS Glue console**: In the top-level menu, click on Services, then search for “Glue” and select AWS Glue.
3. **Create a crawler**: A crawler is a job that discovers data sources such as Amazon S3 buckets, DynamoDB tables, or PostgreSQL databases, and creates a catalog of your data assets.
4. **Define the crawler’s source and target**: Specify where you want to crawl from (e.g., an S3 bucket) and what type of schema you want to create in Glue’s metadata repository.
5. **Create a job**: A job is a set of SQL statements that processes or analyzes your data. You can create different types of jobs, such as ETL, ELT, or data validation.
6. **Define the job’s source and target**: Specify where you want to read data from (e.g., an S3 bucket) and what type of schema you want to apply to your data in the job’s output.
7. **Run a crawler or job**: Start the crawler to create a catalog of your data assets, or run a job to process or analyze your data.By following these steps, you can effectively use AWS Glue crawlers and jobs for your data processing and analytics needs.
- What’s the most efficient method for querying massive datasets in a knowledge catalog?
- Effortlessly manage Athena workgroups and execute complex queries with precision.
- Prior arrangements for a Lake Formation administrator position are already in place within the account, aligned with guidelines outlined in this publication. To deepen your understanding of configuring access controls for an information lake administrator role, refer to .
What publishing platform do we utilize for this publication? eu-west-1
Amazon Web Services (AWS) areas allow for flexibility, enabling you to combine services from different regions into a single region that supports all required features and functionality.
Let’s move forward to implementing these ideas.
Create an S3 bucket
To create an S3 bucket for storing unprocessed enter data and serving as a landing zone for Iceberg tables, follow these steps:
- From the Amazon S3 console, navigate to the desired bucket by clicking on the ” Buckets” option in the left-hand side navigation menu.
- Select .
- Enter the (for instance,
glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}
No adjustments needed. - Select .
- Please click on the “Bucket Particulars” webpage and select.
- Create two subfolders:
raw-csv-input
andiceberg-datalake
. - What specific file would you like to add? Please provide more context or clarify which type of file you are referring to.
raw-csv-input
folder of the bucket.
Create tables
To create an enter and output table within the Knowledge Catalog, follow these steps:
- From the Athena console, proceed to the question editor with ease.
- What commands are you planning to execute?
- Please verify that the input data meets the necessary requirements and then process the CSV file accordingly?
What is the desired outcome of this scenario?
- Validation of iceberg desk information is crucial to ensure accurate calculations and decision-making. Can we confirm the following details: What type of iceberg is being referred to (e.g., tabular, pinnacle, or irregular)? Are there any specific measurements required, such as depth, length, or volume? Do we need to consider factors like temperature, salinity, or other environmental conditions that might impact our analysis?
The next screenshot exhibits the question outcome.
The DDL syntax was employed to define data descriptors. To access and manage your catalog, consider utilising a Knowledge Catalog API, the AWS Glake console, the Lake Formation console, or an AWS Glue crawler, each serving as a distinct gateway to this valuable resource.
Configure Lake Formation permissions for subsequent data processing? raw_csv_input
desk and iceberg_datalake
desk.
Configure Lake Formation permissions
To confirm the aptitude, let’s outline the FGAC permissions for the two Knowledge Catalog tables we designed.
For the raw_csv_input
We permit specific permissions for certain rows; for example, allowing learning entries solely for the Furnishings
class. Equally, for the iceberg_datalake
Desks are equipped with information filters that prevent unauthorised access to confidential data. Electronics
The Product class must allow for learning entries to be restricted to a specific subset of columns exclusively?
To configure Lake Formation permissions for the two tables, follow these steps:
- In the Lake Formation console, navigate to the “Beneath” option within the navigation pane.
- Select .
- Enter the trail of your S3 bucket to enable registration of the situation?
- Please select one of your existing roles that is not a service-linked position.
- For , choose .
- Select .
The IT department has granted desk permissions on the usual desk.
The subsequent step is to grant desktop permissions on the newly created user account. raw_csv_input
Can I move my desk to the AWS Glue job?
- From the Lake Formation console, navigate to and then select “Beneath” from the navigation pane.
- Select .
- For , select .
- To successfully run a job with AWS Glue, choose the appropriate IAM role from your available options?
- For , select .
- For , select
glue5_lf_demo
. - For , select
raw_csv_input
. - For , select .
- What specific data would you like me to present within the dialogue?
- For , enter
product_furniture
. - For , choose .
- Choose .
- For , enter
class='Furnishings'
. - Select .
- For , enter
- For , choose the filter
product_furniture
you created. - SELECTED and QUALIFIED
- Select .
The IT department has granted permission for the Iceberg team to use the Iceberg desk. To confirm your access, please reply with a confirmation code sent to you via email by our security team. This will ensure that only authorized personnel can operate the equipment and data stored on the desk. If you have any questions or concerns, please reach out to our support team at 555-1234 or [it@company.com](mailto:it@company.com).
The subsequent step is to grant desk permissions on the newly provisioned user account to ensure seamless access to required applications and resources. iceberg_datalake
Positioning a desk on the AWS Glue job?
- In the Lake Formation console, navigate to the “Beneath” option within the navigation pane.
- Select .
- For , select .
- Select the IAM role that’s going to be used on an AWS Glue job.
- For , select .
- For , select
glue5_lf_demo
. - For , select
iceberg_datalake
. - For , select .
- The company’s plans for expansion?
- For , enter
product_electronics
. - For , choose.
- For , select
class
,last_update_time
,op
,product_name
, andquantity_available
. - Select .
- For , enter
class='Electronics'
. - Select .
- For , enter
- For , choose the filter
product_electronics
you created. - What kind of improvements are you looking for?
- Select
Let’s create an AWS Glue PySpark job to process the input data.
What questions do you have about querying a typical desk using an AWS Glue 5.0 job?
CREATE an AWS Glue job to load information FROM a MySQL database INTO Amazon S3 and process USING Amazon Athena. raw_csv_input
desk:
- In the AWS Management Console, navigate to the AWS Glue dashboard and click on the desired option in the navigation panel.
- For , select .
- For , select .
- For , select .
- Select .
- Use the command `aws s3 cp file.txt s3://your-bucket-name/path/to/destination/` to upload the file. This instance script writes its output to Parquet format, allowing for flexibility based on your specific use case.
- On the tab, click for.
glue5-lf-demo
. - Assign the IAM role of ‘GlueJobRole’ with necessary permissions to execute AWS Glue jobs, as well as read/write access to the designated S3 bucket for data storage and retrieval?
- For , select .
- For , add following parameter:
- :
--enable-lakeformation-fine-grained-access
- :
true
- :
- Select after which .
- When the job is complete, click on the “Runs” tab located on the backside of the job to proceed.
The console validates the output immediately after submission.
The printed desk is clearly visible in the provided screenshot. Two key pieces of information were returned as a direct outcome of their Furnishings
class merchandise.
Can you transform data from the Iceberg desk into a format suitable for processing using AWS Glue 5.0?
CREATE an AWS Glue job to load information FROM the Amazon S3 bucket that contains the CSV files. iceberg_datalake
desk:
- On the AWS Glue console, navigate to the Databases tab by clicking within the navigation pane.
- For , select .
- For , select .
- For , select .
- Select .
- For , exchange the next parameters:
- Change
aws_region
along with your Area. - Change
aws_account_id
Alongside your AWS account ID. - Change
warehouse_path
Alongside your S3 Warehouse Path for the Iceberg dashboard. - Change
<s3_output_path>
Alongside your S3 output path.
- Change
This instance script writes the output in Parquet format; adaptable to suit specific use cases.
- What would you like to enter on the tab?
glue5-lf-demo-iceberg
. - Assign an IAM role with a custom policy that includes the necessary permissions for running an AWS Glue job, such as “glue:StartJob”, “glue:GetJob”, “s3:GetObject”, “s3:PutObject” and “s3:ListBucket”, and also provides access to read and write data from/to the S3 bucket.
- For , select .
- For , add following parameters:
- :
--enable-lakeformation-fine-grained-access
- :
true
- :
--datalake-formats
- :
iceberg
- :
- Select after which .
- When the job roster is complete, click “Done”.
Redirected to the CloudWatch console for validating the output.
The printed desk is clearly demonstrated in the provided screenshot. The results of their analysis were limited to just two pieces of information. Electronics
class merchandise, and the product_id
column is excluded.
As it stands now, you’re currently able to verify the data on the table? raw_csv_input
and the desk iceberg_datalake
Can be efficiently retrieved with Lake Formation information-cell filters configured accordingly?
Clear up
Wash thoroughly with soap and water, ensuring all residue is removed. Rinse multiple times until the source is sparkling clean. Inspect carefully for any remaining debris or impurities, re-cleaning as needed. Finally, dry meticulously with a soft cloth to prevent water spots from forming.
- Delete the AWS Glue jobs
glue5-lf-demo
andglue5-lf-demo-iceberg
. - Delete the Lake Formation permissions.
- DELETE the output records data written to the Amazon S3 bucket.
- DELETE THE BUCKET YOU CREATED FOR THE ENTER DATASETS, WHICH COULD HAVE A REPUTATION JUST LIKE ANOTHER USER’S;
glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}
.
Conclusion
This publication explains how to enable Lake Formation FGAC in AWS Glue jobs and notebooks that execute entry management using Lake Formation grant guidelines. You can now easily implement Fine-Grained Access Control (FGAC) in your AWS Glue jobs using Spark DataFrames or Spark SQL, rendering the need to combine AWS Glue DynamicFrames for this purpose unnecessary with this launch. This feature also operates seamlessly with traditional file formats such as CSV, JSON, and Parquet, as well as with Apache Iceberg.
This function enables effortless migration and encourages portability by seamlessly transferring Spark scripts between distinct serverless environments, such as AWS Glue and Amazon EMR.
Concerning the Authors
As a Principal Options Architect at AWS, he enables clients to modernize their information architecture by designing end-to-end data solutions that incorporate information security, accessibility, governance, and more. He is also the author of several notable works, including. Sakti, while outside of work, derives pleasure from delving into new technologies, watching movies, and exploring destinations with loved ones. He will be reached through.
Serving as a key leader in the AWS Glue team, I am a Principal Massive Knowledge Architect. Additionally, he serves as the author of the guide. The developer is responsible for designing and building software components that support client needs. He appreciates his free time and relishes the experience of cycling alongside his trusty street bike.
Serves as a Senior Product Supervisor within the Amazon Web Services (AWS) Glue team. With a passion for empowering clients, he thrives on helping them unearth valuable insights and drive informed decision-making by leveraging the power of AWS Analytics solutions. When he’s not occupied with other pursuits, he delights in snowboarding and tending to his garden.
Serving as a Software Program Improvement Engineer within the AWS Glue workforce. He enthusiastically tackles complex problems on a grand scale, developing innovative products that redefine industry standards. Outside of his professional endeavors, he finds joy in watching basketball, as well as nurturing relationships with loved ones through quality time spent with family and friends.