Wednesday, April 2, 2025

Migrating sensitive data securely between Amazon RDS databases via transparent Extract, Transform, Load (ETL) workflows within AWS Glue Studio.

Transferring and remodeling information between databases is a common requirement for many organisations. Duplicating information from a manufacturing database into a sandbox environment, masked to conceal personally identifiable data, enables seamless improvement, testing, and reporting without compromising core operations or exposing sensitive customer information. While manual anonymization of cloned data can pose a significant challenge for safety and database teams.

To effortlessly organize data duplication and conceal sensitive personally identifiable information (PII) without needing to write code. AWS Glue Studio’s visual editor provides a user-friendly, low-code interface for building, executing, and monitoring extract, transform, and load (ETL) scripts in a graphically driven environment. AWS Glue seamlessly manages the behind-the-scenes complexities of resource provisioning, job monitoring, and intelligent retries. Without a pre-existing infrastructure, swift establishment of harmonious data pathways is crucial among core procedures.

To illustrate, I’ll walk you through how to migrate data from one database to another while protecting sensitive personally identifiable information (PII) using .NET. Discover techniques to create a multi-account environment allowing secure access to AWS Glue databases and learn how to design an end-to-end ETL workflow that automatically anonymizes Personally Identifiable Information (PII) during the data transfer process, ensuring sensitive data remains protected throughout. By leveraging this innovative technology, you’ll effortlessly establish data flow pathways between data sources and destinations, safeguarding individual privacy by concealing sensitive Personally Identifiable Information (PII), all without the requirement to manually write code.

Resolution overview

The next diagram elucidates the answer framework.

AWS Glue is employed as an ETL engine to extract data from the Amazon RDS-supplied database. Data transformation processes are executed seamlessly, ensuring that sensitive columns containing personally identifiable information (PII) are meticulously sanitized through predefined masking capabilities. The script finally inserts privacy-protected information into the securely configured Amazon RDS database.

The organization utilizes multiple Amazon Web Services (AWS) accounts to streamline its cloud infrastructure management and ensure segregation of duties. Amazon Web Services offers multiple account environments as its greatest feature to help isolate and manage your workloads and data effectively. The AWS Glue account depicted in the diagram serves as a dedicated account for managing and administering all key AWS Glue resources, thereby streamlining its usage. This solution functions seamlessly across a wide range of applications, allowing for streamlined orchestration within a single dedicated AWS account.

The students were asked to submit their projects on time and in accordance with the guidelines.

  1. As the three AWS accounts mentioned are part of a corporation, however this isn’t necessarily required for the purpose of this response.
  2. This automated response can be employed in situations where instantaneous duplication isn’t essential, and the execution may be triggered by scheduled events or specific occurrences.

Walkthrough

Implementing this plan ensures seamless execution.

  1. Enable seamless connectivity from the AWS Glue account to both supply and goal accounts.
  2. The Amazon Web Services (AWS) Glue workflow for processing large datasets involves three main components: Extract, Transform, and Load. Each part plays a crucial role in ensuring the data is accurately retrieved from various sources, transformed to meet business requirements, and loaded into the desired target system.

    1. **Extract**: The extraction phase involves retrieving the necessary data from various data sources such as relational databases, NoSQL databases, flat files, or other cloud-based storage services like Amazon S3. This part requires a thorough understanding of the source systems, their schema, and any relevant constraints that need to be considered during the data retrieval process.

    2. **Transform**: The transformation phase is where data quality and integrity are ensured by applying various logical and arithmetic operations to the retrieved data. These operations can include filtering, aggregation, sorting, grouping, and many more based on business requirements. It’s essential to have a deep understanding of the target system schema, data relationships, and any necessary data transformations.

    3. **Load**: The load phase involves loading the transformed data into the target system. This could be an Amazon Redshift cluster for analytics workloads or an Amazon Aurora database for transactional workloads. The load process should ensure that the data is loaded efficiently and without compromising data integrity, considering factors like concurrency, batch size, and data volume.

    SKIP

  3. DELETE FROM public.ETL_Jobs WHERE JobName=’MyJob’?

    UPDATE ETL_Job SET JobStatus=’RUNNING’ WHERE JobName=’MyJob’;

    SELECT
    a.*,
    JSON_EXTRACT(a.data, ‘$.”column1″‘) AS column1,
    JSON_EXTRACT(a.data, ‘$.”column2″‘) AS column2
    FROM
    public.ETL_Jobs a
    WHERE
    a.JobName=’MyJob’
    AND a.JobStatus=’RUNNING’;

    CREATE TABLE public.etl_data (
    id INT PRIMARY KEY,
    data JSON NOT NULL
    );

    INSERT INTO public.etl_data (id, data)
    VALUES
    (1, ‘{“column1”: “value1”, “column2”: “value2”}’),
    (2, ‘{“column1”: “value3”, “column2”: “value4”}’);

    SELECT * FROM public.etl_data;

    CREATE OR REPLACE FUNCTION process_data() RETURNS VOID AS $$
    DECLARE
    data JSON;
    BEGIN
    FOR data IN SELECT data FROM etl_data LOOP
    — perform some processing here
    END LOOP;
    END; $$ LANGUAGE plpgsql;

    SELECT process_data();

  4. Confirm outcomes

Stipulations

Utilizing Amazon RDS for PostgreSQL version 13.14-R1 in this walkthrough. The versatility of our solution lies in its ability to seamlessly integrate with various database engines and JDBC driver versions, allowing it to function effectively across diverse AWS Glue environments. See for additional particulars.

To comply with all requirements of this submission, it is essential that you meet the following conditions:

  • Three AWS accounts as follows:
    1. Amazon hosts the supply of RDS for PostgreSQL databases. The database is housed within a secure environment, comprising a private subnet that safeguards sensitive information stored on a desktop. File the VPC ID, security group, and personal subnets associated with the Amazon Relational Database Service (RDS) instance for future reference.
    2. Creates a PostgreSQL database instance on Amazon RDS, mirroring the same database structure as the source, starting from an empty state. The database resides within a private network segment that is inaccessible to external entities. Identify and record the corresponding VPC ID, security group ID, and private subnets for future reference.
    3. This dedicated account maintains a Virtual Private Cloud (VPC), a private, isolated subnet, and a security group, ensuring secure network operations. Within the context of the discussion, the safety group has a self-referential inbound rule for All TCP and TCP ports (0-65535To enable seamless communication between AWS Glue components.

Does the AWS Glue account’s security group exhibit a self-referential inbound rule?

  • Ensure that the three VPC CIDRs do not overlap with each other, as illustrated in the table below.
 VPC Personal subnet
Supply account   10.2.0.0/16 10.2.10.0/24
AWS Glue account   10.1.0.0/16 10.1.10.0/24
Goal account   10.3.0.0/16 10.3.10.0/24

The subsequent illustration depicts the atmospheric conditions in their entirety.

To facilitate the setup process, please adhere to the guidelines outlined in the README file provided.

Database tables

Each supply and goal database includes a comprehensive list of essential items. buyer Desk with identical construction. The data provided serves as a foundation for further investigation and analysis.

The AWS Glue ETL job being created aims to conceal sensitive information within specific column values. These are last_name, electronic mail, phone_number, ssn and notes.

If you wish to use the same desk design and details, the relevant SQL statements are provided within.

Establish seamless connectivity between the AWS Glue account and the supply and goal accounts by configuring the necessary permissions and access controls.

To create an AWS Glue ETL job that accesses JDBC databases, specify the necessary AWS IAM functions, VPC ID, subnet ID, and security groups desired for AWS Glue to connect with these databases? See for additional particulars.

Within the designated AWS Glue account, the function, teams, and various data are housed. To enable AWS Glue to connect with databases, ensure that your AWS Glue account’s subnet and security group permit incoming and outgoing traffic from both source and target databases.

To enable entry, initially connect the VPCs by establishing a secure network connection. This can be accomplished using either commas or parentheses. Through VPC peering, we utilize. Instead of relying solely on your application to store and manage large files, consider leveraging Amazon S3 as a robust and scalable intermediate storage solution? See for additional particulars.

Observe these steps:

  1. Can I peer the AWS Glue account’s VPC with the database VPCs safely and securely?
  2. Replace the route tables
  3. Replace the database safety teams

Peer AWS Glue account VPC with peerable database VPCs.

Configure the following settings in the AWS VPC console:

1. Allocate an available IP address range for your subnet, ensuring there is sufficient space to accommodate future growth and flexibility?

2. Designate a CIDR block for your subnet, adhering to best practices regarding subnet mask lengths, private IP ranges, and overlap avoidance?

3. Determine the number of availability zones (AZs) required to support your application’s geographic needs, considering factors such as latency, redundancy, and disaster recovery?

4. Set up routing tables within your VPC, ensuring seamless communication between subnets and across AZs?

5. Configure security groups for each subnet, applying strict access controls and whitelisting approved traffic patterns?

  1. Establish two VPC peering connections as outlined in the guidelines, one for the supply account Virtual Private Cloud (VPC) and another for the goal account VPC.
  2. Once confirmed, settle on the VPC peering request. For directions, see
  3. When the opportunity arises, consider approving the VPC peering request with a smile.
  4. Allowing stateless on every peering connection simplifies network management and improves overall scalability. This allows AWS Glue to resolve the public IP address of your databases. For directions, comply with .

After completing the preceding procedures, the list of peer connections registered within your AWS Glue account should now be visible.
Notwithstanding that supply and goal account VPCs cannot be peer-together. No connectivity is desired between the two accounts.

Replace subnet route tables

This step enables site visitors from the AWS Glue account’s Virtual Private Cloud (VPC) to securely access VPC subnets that are associated with databases in both supply and goal accounts.

Create a VPC instance. Then, configure subnets and security groups as needed. Next, launch an Amazon EC2 instance and assign it to your newly created VPC. Finally, configure route tables and network ACLs to manage traffic flow within your VPC.

  1. For each VPC peering connection established, ensure that a separate route is added to every relevant personal subnet associated with the database, thereby facilitating seamless communication and data transfer between these connected networks. These routes enable AWS Glue to establish connections with databases and limit access to AWS Glue users, permitting them to access only the subnets associated with those databases.
  2. On the corresponding pages of the personal subnets related to the database, add one route for the VPC peering connection with the AWS Glue account. This revised route allows site visitors to reaccess their AWS Glue account.
  3. Repeat step 2 on the .

To learn about replacing route tables, refer to .

Replace database safety teams

To enable site users from the AWS Glue account’s security group to access and manage databases across the source and target security teams.

If you’re looking to replace your safety team, consider the following steps:

Create a new Amazon Virtual Private Cloud (VPC) instance.

  1. On the firewall, add an inbound rule with a specific port number to allow incoming traffic. PostgreSQL and the AWS Glue account’s safety group.
  2. Repeat step 1 from the .

The following diagram illustrates the atmosphere with connectivity enabled from the AWS Glue account to the supply and target accounts:

The following scripts create AWS Glue parts for the ETL job:
python
“`
import boto3
glue = boto3.client(‘glue’)

# Define ETL job name
etl_job_name = ‘my_etl_job’

# Create a Python shell script that performs data transformation using pandas
pandas_script = “””
import pandas as pd

df = pd.read_csv(‘s3://my-bucket/myfile.csv’)
print(df.head())
“””

# Create an AWS Glue ETL job with the specified name, this will store the job in the glue database
glue.create_job(Name=etl_job_name, ExecutionProperty={‘MaxConcurrentRuns’: 1})

# Get the job ID for future use
job_id = glue.get_job(Name=etl_job_name)[‘Job’][‘JobId’]

# Create an AWS Glue script that uses pandas for data transformation
pandas_job_script = {
“script”: pandas_script,
“jobName”: etl_job_name
}
glue.create_job_run(JobName=etl_job_name, JobRunId=”1″, Arguments=pandas_job_script)
“`

“`

The team will develop AWS Glue components to harmonize the supply and target database schemas with the existing.

Observe these steps:

  1. CREATE OR REPLACE PROCEDURE create_rds_backup(
    p_db_instance_identifier IN VARCHAR2,
    p_backup_type IN VARCHAR2 DEFAULT ‘full’,
    p_backup_retention_period IN INTEGER DEFAULT 1
    ) AS
    BEGIN
    DBMS_SCHEDULER.CREATE_JOB (
    job_name => ‘RDS_BACKUP_’ || UPPER(p_db_instance_identifier),
    job_type => ‘PLSQL_BLOCK’,
    job_action => ‘BEGIN dbms_rman.create_backup(p_db_instance_identifier=> ”’ || p_db_instance_identifier || ”’, p_backup_type=> ”’ || p_backup_type || ”’); END;’,
    start_date => SYSTIMESTAMP,
    repeat_interval => ‘FREQ=DAILY; INTERVAL=1’,
    enabled => TRUE,
    comments => ‘Daily backup of ‘ || p_db_instance_identifier
    );
    END create_rds_backup;
  2. The Information Catalog will be populated with data from various sources and systems, leveraging APIs, ETL processes, and manual uploads.
  3. Run the crawlers.

Create AWS Glue connections

Connections enable AWS Glue to access and integrate with your existing databases. The primary advantage of establishing AWS Glue connections is the significant time-saving benefit they provide, as they eliminate the need to repeatedly enter connection details when creating jobs. You can potentially reuse connections when creating jobs in AWS Glue Studio without needing to manually input connection details each time. This streamlined approach to job creation expedites the process.

Create all necessary steps for your AWS Glue account:

  1. Click on the hyperlinked icon located in the navigation pane on the AWS Glue console.
  2. Wizard? SKIP
    1. Why should you choose us for your information needs?
    2. In :
    3. In , for enter Supply DB connection-Postgresql.
  3. Reinforce processes and iterate efficiently. Identify the connection Goal DB connection-Postgresql.

You now possess two secure connections, one for each of your Amazon RDS databases.

Create AWS Glue crawlers

AWS Glue crawlers enable automated information discovery and cataloging from various data sources and targets, simplifying the process of extracting insights and building analytics applications. Crawlers detect information shops, automatically generating metadata to enrich the Information Catalog by cataloging discovered tables within its repository. This enables you to discover and interact with data to build ETL processes.

To develop a crawler for each Amazon RDS database within an AWS Glue account, follow these steps.

  1. In the AWS Glue console, navigate to the “Databases” section by selecting the corresponding option from the navigation pane.
  2. Select and comply with the directions within the wizard:
    1. In , for enter Supply PostgreSQL database crawler
    2. In , select
    3. For selection as proven within the following determines:
    4. For , select Supply DB Connection - Postgresql.
    5. For instant retrieval, enter the path to your database along with the schema. For our instance, the trail unfolds like a serpentine journey through dense foliage and sun-drenched meadows. sourcedb/cx/% the place sourcedb Is the identity of the database, subsequently cx the schema with the buyer desk.
    6. SELECT * FROM INVENTORIES WHERE DATE_CREATED >= ‘2022-01-01’ AND (QUANTITY < 10 OR PRICE > 50)?
    7. Since we don’t have a dedicated database for storing supply chain data but utilize the Information Catalog to house supply database metadata, select and create a database named “Supply_Metadata”? sourcedb-postgresql.
  3. Create a Python script that utilizes the BeautifulSoup library to repeat steps one and two in creating a crawler for the Goal Database.
    1. In , for enter Goal PostgreSQL database crawler.
    2. In for , select Goal DB Connection-Postgresql, and for enter targetdb/cx/%.
    3. In forenter targetdb-postgresql.

Currently, you will have two crawlers, one for each Amazon RDS database.

Run the crawlers

Subsequent, run the crawlers. When running a crawler, it establishes a connection to the assigned data provider and automatically populates the Information Catalog with metadata specifications, including column definitions, data types, partitions, and more. This feature streamlines schema setup, eliminating the need for manual definition and saving valuable time.

From the record, choose each Supply PostgreSQL database crawler and Goal PostgreSQL database crawler, and select .

Upon completion, each crawler generates a comprehensive catalog entry within the Information Repository. These visual representations of metadata serve as illustrations. buyer tables.

Congratulations, you now possess all the necessary assets to begin crafting AWS Glue ETL jobs!

As you’ve crafted an efficient workflow, let’s now focus on creating and running your AWS Glue ETL job?

The proposed Extract-Transform-Load (ETL) job executes four distinct responsibilities.

  1. Establishes a secure connection to the Amazon RDS supply database, extracting relevant information for duplication purposes seamlessly.
  2. .
  3. Adjusts unnecessary and redundant information to enhance clarity and effectiveness.
  4. Establishes a secure connection to the target Amazon RDS database, then inserts relevant data while ensuring sensitive personally identifiable information (PII) remains properly masked.

Let’s dive into AWS Glue Studio to create an AWS Glake ETL job.

  1. Join our community today by registering alongside your
  2. Open the Navigation Pane.
  3. Determine what is selected as proven.

What specific data do you wish to extract from the given dataset? Please specify the type of data (e.g., numerical values, categorical variables), and provide a brief description of your desired output.

Create a new node in your AWS CloudFormation template that integrates with the Amazon Relational Database Service (Amazon RDS) supply database.

  1. Select from the . This function adds a new node to the canvas.
  2. On the , choose sourcedb-postgresql database and source_cx_customer Data extraction from the Information Catalog as substantiated below.

Detecting Personally Identifiable Information (PII) and Scrubbing It

To detect and mask sensitive personally identifiable information (PII), select a node from the available tabs.

What specific attributes of the node do you want to explore in more detail within the Properties panel?

  1. You can choose the format that best suits your needs. Which option do you think will work best in this situation? While selecting previous scans all rows for complete PII identification, the approach differs from the latter method that scans patterns to locate PII at decreasing values.

Deciding on the animation’s keyframe data allows for precise control over specific parts of the motion, enabling subtle adjustments to be made with ease. When familiar with your data, precise steps enable the omission of specific columns from analysis. You can further customize entity detection for each column in your dataset by specifying the entities to look out for, effectively bypassing any known irrelevant entities in specific columns. This feature enhances job performance by removing unnecessary detection requirements, allowing for tailored actions specific to each column-entity combination.

Given the nuances of our dataset, selecting specific columns requires a granular approach. Let’s explore the possibility of doing so. We’ll explore further details below.

  1. You subsequently select a range of subtle information to identify. After exploring various options, I found that the three completely different choices are:

    Spending countless hours poring over maps and guides
    Unraveling the mystery behind an ancient treasure
    Venturing into uncharted territories for the thrill

Given that our instance has already occurred due to our shared understanding of the facts, let’s proceed with the selection for option. Particular person’s identify, Electronic mail Tackle, Credit score Card, Social Safety Quantity (SSN) and US Telephone As demonstrated below. Notably, certain patterns mirroring Social Security Numbers (SSNs) are specifically designed for use in the United States and may not effectively identify Personally Identifiable Information (PII) for various international locations. Despite the availability of classes specific to various international locations, you can leverage common expressions in AWS Glue Studio to develop detection entities that cater to your unique requirements.

  1. Determining the optimal level of detection sensitivity requires careful consideration of various factors, including environmental conditions, system specifications and operational requirements. There was no default worth. However, I assume you meant to say that you want me to improve the text in a different style as a professional editor and return the direct answer ONLY without any explanation or comment.

    In that case, here is the improved text:

    What is the point of having a default worth if it’s not being utilized?Excessive).

  2. To subsequently address the globally recognized motion addressing detected entities. Choose and enter **** because the .
  3. Subsequent fine-grained actions you may specify override. While overrides are optional, our specific scenario demands excluding certain columns from detection, scanning specific PII entities on designated columns only, and defining distinct redaction text settings for diverse entity types.

Define precise, granular motions for each entity, as demonstrated below.

Activity 3 – Information transformation

When the node runs, it seamlessly converts id Column to String Sort, it provides a column named? DetectedEntities No personally identifiable information (PII) detected in provided text. Here’s the result:

The revised text in a different style as a professional editor:

What is the desired outcome you would like me to achieve through this editing process? No longer having to retail such metadata at the goal desk, we must convert the id Let’s convert the column to an integer and add a remodel node to the ETL job, as specified below. Allowing this to make the necessary adjustments for us.

To successfully complete the remodeling process, ensure that you select the checkbox for the remodel node, allowing the newly created space to be properly dropped.

As data is retrieved from various sources, it’s crucial to ensure that the goal information loading process is seamless and accurate. This activity focuses on developing a mechanism to effectively gather and integrate relevant data points for each goal, thereby streamlining the overall workflow. By leveraging robust APIs and data models, our system will efficiently populate goals with pertinent details, reducing manual intervention and increasing precision.

The concluding task for the ETL process involves establishing a connection to the target database and securely uploading the data while masking sensitive Personally Identifiable Information (PII).

  1. Select from the . The code creates a new node that can be added to the canvas.
  2. On the panel, select targetdb-postgresql and target_cx_customerAs demonstrated below.

Run the scheduled ETL job to extract data from the source system.

  1. What would you like to do from this tab? ETL - Replicate buyer information.
  2. What AWS Glue function was selected for execution based on specific conditions?
  3. Select , then select .

Monitor the job until it efficiently completes on the navigation pane.

After conducting a thorough analysis, verify that the proposed solutions align with the original goals and objectives.

Establish a connection to the Amazon RDS target database and verify that the replicated rows accurately incorporate the sanitized PII information, ensuring sensitive data remains masked throughout the database-to-database transmission process as demonstrated by the following validation:

And that’s it! Using AWS Glue Studio, you can craft ETL workflows that seamlessly migrate data between databases and transform it en route without requiring any manual coding. Ensure the integrity and confidentiality of sensitive data by employing various robust methods to safeguard against duplication attempts? Additionally, strive to integrate and juxtapose a diverse range of information sources and perspectives to foster a comprehensive understanding and nuanced exploration of the topic at hand.

Clear up

To thoroughly clean and sanitize the assets developed:

  1. Delete all associated AWS Glue entities: ETL jobs, crawlers, information catalog databases, and connections to ensure a seamless removal of data processing assets.
  2. Delete the VPC peering connections.
  3. Remove the routes from the route tables and delete the inbound guidelines distributed to the safety teams across all three AWS accounts.
  4. Delete associated Amazon S3 objects on the AWS Glue account. Files stored securely within the Amazon S3 bucket aws-glue-assets-- in its identify, the place account-id Is your AWS Glue account identifier, and are you certain that this is the correct ID to reference? area Are you referring to the Amazon Web Services (AWS) region you utilized?
  5. If you decide not to retain your Amazon RDS databases, delete them promptly to avoid unnecessary expenses and data accumulation. Should you use the command, then delete the AWS CloudFormation stacks immediately?

Conclusion

You learned how to leverage AWS Glue Studio to build an ETL job that replicates data from one Amazon RDS database to another, automatically detecting and masking personally identifiable information (PII) during transit without requiring any manual coding efforts.

By leveraging AWS Glue for database replication, companies can eliminate manual processes searching for concealed PII and customized scripting to transform it, instead establishing centralized, transparent data sanitization workflows. This enhancement optimizes both safety and regulatory adherence, while expediting the delivery of critical insights and data provisioning.


In regards to the Creator

Serves as a Senior Options Architect within the Monetary Services, Fintech group at Amazon Web Services (AWS). She collaborates closely with clients leveraging Blockchain and Crypto technologies on Amazon Web Services (AWS), empowering them to accelerate their time-to-value by optimizing their AWS utilization. Located in the heart of New York City, when not occupied with her professional commitments, this individual finds solace in exploring the world beyond her doorstep.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles