Tuesday, April 1, 2025

The CDK provides a straightforward way to provision and manage Redshift Serverless using the Information Options framework. Using this approach, you can create a stack for your Redshift Serverless cluster by defining constructs like `CfnCluster` and `CfnDBParameterGroup`. Then, you can specify the necessary details such as the cluster name, database username, password, and number of nodes. You can also leverage CDK’s built-in support for IAM roles to ensure secure access to your Redshift Serverless cluster. By creating an IAM role with the necessary permissions, you can grant fine-grained access to specific users or services. Furthermore, the Information Options framework allows you to specify custom options and parameters for your Redshift Serverless cluster, such as the query results location in Amazon S3. This enables you to tailor your cluster’s configuration to meet specific requirements. Overall, using the CDK with the Information Options framework provides a robust and scalable way to provision and manage Amazon Redshift Serverless clusters.

By February 2024, we are launching an innovative open-source framework for building information solutions on Amazon Web Services (AWS). DSF is built using the AWS Cloud Development Kit (CDK), which packages infrastructure components on top of AWS services. L3 constructs implement recurring architectural patterns in software development, fostering the creation of interoperable assets that can be readily combined to deliver functional solutions.

This publication demonstrates how to leverage the AWS Cloud Development Kit (CDK) and Data Service Framework (DSF) to construct a scalable, multi-data warehousing platform utilizing best practices. DSF streamlines the setup process for Redshift Serverless, enables seamless knowledge initialization and cataloging, and facilitates effortless transitions between distinct data warehousing deployments. By leveraging a programmatic approach using AWS Cloud Development Kit (CDK) and Data Science Frameworks (DSF), you can seamlessly integrate GitOps principles into your analytics workflows, thereby reaping the benefits of streamlined data engineering and enhanced collaboration.

  • You’ll be able to deploy utilizing steady integration and supply (CI/CD) pipelines, along with definitions for Redshift objects – such as databases, tables, and shares.
  • You’ll be able to roll out continuous modifications across multiple environments seamlessly.
  • Bootstrap scalable information warehouses efficiently by crafting custom code and streamlining test environment setup through model management.
  • You’ll be able to review modifications before deployment using our intuitive platform.

Additionally, DSF’s Redshift Serverless L3 constructs offer a range of integrated features that can accelerate development by providing best-practice guidance. For instance:

  • With the automated creation and configuration of a connection resource, extracting, transforming, and loading (ETL) jobs to and from Amazon Redshift are made remarkably straightforward. This resource requires minimal setup from information engineers, allowing them to seamlessly integrate it into their AWS Glue ETL workflows without additional configuration.
  • By leveraging Amazon Redshift’s capabilities, the Data Science Framework (DSF) provides a practical approach to configuring an AWS Glue crawler and populating the AWS Glue Data Catalog, facilitating effortless discovery and seamless table referencing when building ETL workflows. The configured AWS Glue crawler utilizes an IAM role that adheres to the principle of least privilege.
  • Sharing data between Amazon Redshift information warehouses enables seamless collaboration across disparate business units without duplicating efforts. DSF provides practical approaches for seamless data circulation between each content creator and consumer.

Answer overview

An example of where an information warehouse serves as a serving layer for enterprise intelligence (BI) workloads atop data lake data showcases how this architecture enables efficient querying and analysis of large datasets. Supply data is stored in Amazon S3 buckets, subsequently ingested into a Redshift data warehouse to generate materialized views and aggregated insights, which are then disseminated to end-users through BI queries executed on a Redshift client. The accompanying diagram provides a bird’s-eye view of the overall architecture.

The CDK provides a straightforward way to provision and manage Redshift Serverless using the Information Options framework. Using this approach, you can create a stack for your Redshift Serverless cluster by defining constructs like `CfnCluster` and `CfnDBParameterGroup`. Then, you can specify the necessary details such as the cluster name, database username, password, and number of nodes. You can also leverage CDK’s built-in support for IAM roles to ensure secure access to your Redshift Serverless cluster. By creating an IAM role with the necessary permissions, you can grant fine-grained access to specific users or services. Furthermore, the Information Options framework allows you to specify custom options and parameters for your Redshift Serverless cluster, such as the query results location in Amazon S3. This enables you to tailor your cluster’s configuration to meet specific requirements. Overall, using the CDK with the Information Options framework provides a robust and scalable way to provision and manage Amazon Redshift Serverless clusters.

The publishing process utilizes instance codes written in Python. DSF additionally helps .

Conditions

As a result of utilizing AWS CDK, complete the steps beforehand when implementing the solution.

Provision a Redshift Serverless namespace and workgroup to kickstart your analytics journey: Initialize a new venture by creating a Redshift Serverless namespace and workgroup, ensuring seamless integration with existing Amazon Web Services (AWS) resources.

Let’s initiate the venture in collaboration with DSF as a key dependency. You’ll be able to run this code directly in your local terminal without any need for additional setup.

mkdir dsf-redshift-blog && cd dsf-redshift-blog cdk init --language python

Open the Venture folder in your Integrated Development Environment (IDE).

  1. Open the app.py file.
  2. Be sure to enable the core functionality? env This configuration relies on the AWS profile used during the deployment process.
  3. Add a configuration flag within the cdk.context.json Create a file on the root of the venture, if one does not already exist.
     SKIP

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true enables certain assets having the removal_policy parameter set to RemovalPolicy.DESTROY When an AWS CDK stack is deleted, all its associated resources are destroyed. This safeguard utilizes data encryption to prevent accidental deletion of information.

With the venture now configured, you are ready to start incorporating assets into the stack.

  1. Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

This is the location where we define the assets that will be deployed.

  1. To initiate building an end-to-end demonstration, incorporate the following import statements at the top of the file, enabling the definition of assets from both the AWS CDK core library and DynamoDB Streams Framework (DSF): from awscdk.core import *; from aws.cdk import *; from aws.dynamodbstreams import *
    from aws_cdk import (RemovalPolicy, Stack) from aws_cdk.aws_s3 import Bucket from aws_cdk.aws_iam import ServicePrincipal import constructs from cdklabs.aws_data_solutions_framework import dsf

The following DSF-specific constructs are employed to develop the demonstration:

  • This creates three Amazon S3 buckets, named Bronze, Silver, and GoldTo differentiate between distinct tiers of disparate data.
  • The system effectively orchestrates the transfer of information from one repository to another.
  • This establishes a Redshift Serverless namespace, where database objects and customers are stored.
  • This setup establishes a Redshift Serverless workgroup, encapsulating essential compute and networking configurations for the data warehouse. That serves as an entry point for several practical features offered by DSF, including cataloging Redshift tables, executing SQL statements within AWS CDK (such as creating tables, ingesting data, merging tables, and more), and sharing datasets across various Redshift clusters without data transfer.
  1. Here are the three levels of detail: bronze, silver, and gold information layers in your AWS S3 bucket, organized for easy retrieval.

    Let me know if you have any further requests or need assistance with creating an S3 bucket.

The fundamental descriptions of each stratum’s essence are thus:

  • Bronze symbolizes raw data, serving as a repository for information gathered from diverse sources. No schema is required.
  • Silver is meticulously cleansed to reveal its inherent radiance, potentially accompanied by subtle enhancements that refine its luster. The schema is enforced at this level.
  • Gold is refined and aggregated data serving a specific business objective.

Using best practices, you can create these three S3 buckets:

  • Encryption at rest, enabled through AWS Key Management Service (KMS), is activated.
  • SSL is enforced
  • Using is turned on
  • Amazon’s default S3 lifecycle rule specifies that incomplete multipart uploads will be deleted after one day.
    data_lake = dsf.storage.DataLakeStorage(self, 'DataLake', removal_policy=dsf.RemovalPolicy.DESTROY)
  1. After creating the S3 buckets, we utilize the ‘for’ loop to copy the necessary information into them. Silver The bucket as a residual container after being thoroughly cleansed.
    data_copy = dsf.utils.S3DataCopy(self)
  2. To enable Amazon Redshift to consume data from Amazon S3, a correctly configured IAM role with necessary privileges is required. The Redshift serverless namespace functionality requires this function to operate seamlessly within its newly created context.
    lake_role = Position(self, "LakeRole",                         assumed_by=ServicePrincipal("redshift.amazonaws.com")) data_lake.silver_bucket.grant_read_to(lake_role)
  3. To provision Redshift Serverless, configure three AWS CloudFormation assets: a Redshift Serverless cluster, an IAM role that grants permissions to access the data, and an Amazon S3 bucket to store the data. DSF provides L3 constructs for each.

    Each construct observes simultaneously.

    • The default digital personal cloud (Virtual Private Cloud) employs private subnets with public access blocked.
    • Data encryption occurs seamlessly via Amazon Web Services Key Management Service (AWS KMS), utilizing automated key rotation for enhanced security.
    • Admin credentials are stored securely within a computerized rotation managed by Amazon Redshift.
    • A connection is regularly established through individual relationships. AWS Glue crawlers and ETL jobs can leverage this capability to integrate with Amazon Redshift.

    The RedshiftServerlessWorkgroup Assembled serves as the foundational entry point for various functionalities, seamlessly integrating with the AWS Glue Data Catalog, Amazon Redshift Data API, and Data Ingestion API.

    1. Here is the rewritten text:

      Utilize the assembled instance to link the IAM function created previously, thereby granting Amazon Redshift access to the data lake for information ingestion.

      namespace = dsf.consumption.RedshiftServerlessNamespace(self, 'Namespace', db_name='defaultdb', identity='producer', removal_policy=dsf.RemovalPolicy.DESTROY, default_iam_role=lake_role) workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 'Workgroup', identity='producer', namespace=namespace, removal_policy=dsf.RemovalPolicy.DESTROY)

Create tables and ingest information

To construct a sturdy desk, one should employ the skills of woodworking and craftsmanship, utilizing materials such as solid wood, MDF or particleboard. runCustomSQL methodology within the RedshiftServerlessWorkgroup assemble. This methodology enables you to execute customised SQL queries during the creation of a valuable resource, much like create desk or create materialized viewWhen it’s being deleted reminiscent of a digital footprint left on a server, drop desk or drop materialized view).

Add the next code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable',      database_name="defaultdb",      sql=""" CREATE TABLE amazon_reviews (     market character varying(16383) ENCODE lzo,     customer_id character varying(16383) ENCODE lzo,     review_id character varying(16383) ENCODE lzo,     product_id character varying(16383) ENCODE lzo,     product_parent character varying(16383) ENCODE lzo,     product_title character varying(16383) ENCODE lzo,     star_rating integer ENCODE az64,     helpful_votes integer ENCODE az64,     total_votes integer ENCODE az64,     vine character varying(16383) ENCODE lzo,     verified_purchase character varying(16383) ENCODE lzo,     review_headline character varying(max) ENCODE lzo,     review_body character varying(max) ENCODE lzo,     review_date date ENCODE az64,     last_12_months integer ENCODE az64 ) DISTSTYLE AUTO; """,     delete_sql="DROP TABLE amazon_reviews;") load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data',      'defaultdb',      'amazon_reviews',      data_lake.silver_bucket,      'silver/amazon-reviews/',      'FORMAT=parquet') load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)

To accommodate the asynchronous creation of certain resources and manage potential concurrency issues, we introduce dependencies between select assets within our AWS CDK construct, thereby enabling the framework to sequence their creation in a logical order. Prior dependencies established the foundation for what followed.

  • Prior to data loading, the S3 information duplicate is fully populated, thereby ensuring that all data resides within the supply bucket utilized during ingestion.
  • The Redshift database has already had a goal table created in its namespace prior to loading any data.

Bootstrapping instance (materialized views)

The workgroup.run_custom_sql() The methodology offers flexibility in bootstrapping your Redshift data warehouse using the AWS Cloud Development Kit (CDK). By leveraging materialized views, organizations can significantly boost query performance by proactively aggregating insights from Amazon reviews, thereby streamlining analysis and decision-making processes.

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',      database_name='defaultdb',      sql=f"CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(*) AS review_total, SUM(star_rating) AS score FROM amazon_reviews WHERE market='US' GROUP BY 1,2;",     delete_sql="DROP MATERIALIZED VIEW mv_product_analysis;") materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup Consistently develops a valuable asset that benefits both teams and individuals. It is revealed immediately upon assembly of the workgroup through the glue_connection property. Using this connection, the workgroup’s assembly enables a practical approach to listing the tables within the corresponding Redshift Serverless namespace. The next an instance code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

The following SQL command establishes a new database called ‘MyDatabase’ within the existing ‘Information Catalog’:

CREATE DATABASE MyDatabase IN INFORMATION_SCHEMA.CATALOG. mv_product_analysis With the existing crawler having an integrated Amazon IAM function and a pre-configured community setting, When left to its own devices, this process automatically extracts and processes all tables present in the public schema of the default database, as specified during the creation of the Redshift Serverless namespace. To override this behaviour, you need to supply a value for the third parameter. catalogTables Methodology enables you to define a clear outline for your project, which allows you to identify key steps and tasks that need to be completed.

You’ll be able to run the crawler using the AWS Glue console, or invoke it programmatically via the SDK, AWS CLI, or AWS CDK using a script.

Information sharing

DSF facilitates seamless Redshift information sharing between producers and consumers across similar accounts and cross-account scenarios. Let’s create an additional Redshift Serverless namespace and workgroup to illustrate this interaction seamlessly.

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, "Namespace2", db_name="defaultdb", identity="client", default_iam_role=lake_role, removal_policy=RemovalPolicy.DESTROY) workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, "Workgroup2", identity="client", namespace=namespace2, removal_policy=RemovalPolicy.DESTROY)

For producers

For Producers, Fill in the Next Steps:

  1. CREATE TABLE `orders`
    (
    `id` INT AUTO_INCREMENT,
    `customer_name` VARCHAR(255),
    `order_date` DATE,
    `total` DECIMAL(10,2),
    PRIMARY KEY (`id`)
    );

    CREATE TABLE `customers`
    (
    `id` INT AUTO_INCREMENT,
    `name` VARCHAR(255),
    `email` VARCHAR(255),
    `address` TEXT,
    PRIMARY KEY (`id`)
    );

    CREATE TABLE `order_items`
    (
    `id` INT AUTO_INCREMENT,
    `order_id` INT,
    `product_name` VARCHAR(255),
    `quantity` INT,
    `price` DECIMAL(10,2),
    PRIMARY KEY (`id`),
    FOREIGN KEY (`order_id`) REFERENCES `orders` (`id`)
    );

    CREATE TABLE `products`
    (
    `id` INT AUTO_INCREMENT,
    `name` VARCHAR(255),
    `description` TEXT,
    `price` DECIMAL(10,2),
    PRIMARY KEY (`id`)
    );

    data_share = workgroup.create_share('DataSharing', 'defaultdb', 'defaultdbshare', 'public', ['mv_product_analysis']) data_share.get('new_share_custom_resource').node.add_dependency(materialized_view)
  2. Create entry grants:
    • To grant permissions to a cluster within the same account:
      share_grant = workgroup.grant_access_to_share("GrantToSameAccount", data_share, namespace2.namespace_id) resource_node = useful.resource.node resource_node.add_dependency(data_share.new_share_custom_resource) resource_node.add_dependency(namespace2)
    • To grant a unique account with exclusive privileges and access.
      workgroup.grant_access_to_share('GrantToDifferentAccount',      tpcdsShare,      undefined,      '<ACCOUNT_ID_OF_CONSUMER>',      true)

The within the grant_access_to_share The methodology enables seamless cross-account access for information sharing purposes. Without specifying an authorization parameter, default behavior assumes no authorization; therefore, a Redshift administrator must explicitly grant cross-account access via the AWS Management Console, AWS CLI, or AWS SDK.

For shoppers

To replicate a similar account setup and create a new database from that share, utilize the following code:

 create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", "advertising", data_share.data_share_name, data_share.producer_namespace) create_db_from_share.useful_resource.node.add_dependency(share_grant.useful_resource) create_db_from_share.useful_resource.node.add_dependency(workgroup2)

When creating a cross-account grant, the syntax remains consistent; however, it’s crucial to note the producer account ID explicitly.

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare',      'tpcds',      <PRODUCER_SHARE_NAME>,      <PRODUCER_NAMESPACE_ID>,      <PRODUCER_ACCOUNT_ID>)

To view the entire demonstration, please consult the accompanying code.

The infrastructure was successfully deployed utilizing the AWS Cloud Development Kit (CDK).

To successfully deploy the assets, execute the following command:

As you review the assets developed, a clear visual representation of their quality is demonstrated in the accompanying screenshot.

Are all necessary configurations and prerequisites verified before initiating deployment? After waiting a couple of minutes for the venture’s deployment, closely track its progress using either the AWS Command Line Interface (CLI) or the console.

Upon successful deployment, ideally, you should observe two distinct Redshift workgroups: a producer workgroup and a client workgroup.

To utilize Amazon Redshift Question Editor v2, you can log into your producer Redshift workgroup using Secrets Manager, as depicted in the following screenshot.

Producer QEV2 Login

After logging in, you’ll likely spot the tables and views that you recently created using Data Studio’s Facility. defaultdb database.

QEv2 Tables

Login to the Redshift workgroup named “Patron”, navigate to the shared dataset hosted by the producer Redshift workgroup, which resides beneath the advertising database.

Clear up

You’ll be able to run cdk destroy You can use the `kill -9` command to forcefully terminate the process and remove it from the system’s process list. The customer expects timely delivery of high-quality services. Can they count on us to meet these expectations? RemovalPolicy.DESTROY The system automatically destroys and disposed of all data after 7 days of inactivity, ensuring that sensitive information remains confidential. cdk destroy Deleting the stack from the AWS CloudFormation console will effectively eliminate any provisioned resources.

Conclusion

In this publication, we showcased how to leverage the AWS CDK in conjunction with the DSF to manage Redshift Serverless as code. By standardizing asset deployment, organizations achieve uniformity across diverse environments. Beyond provisioning infrastructure, DSF empowers you to rapidly bootstrap Amazon Redshift by creating desks, ingesting knowledge, and more – all through seamless integration with AWS CDK’s object handling capabilities. Modifications could be model-managed, reviewed, and even unit-tested.

With Redshift Serverless, along with Data Warehousing Foundation (DSF), reminiscent of Amazon SageMaker Autopilot, Apache Spark Structured Streaming, and numerous additional tools. Our data is publicly accessible, and we eagerly anticipate your distinctive requests, offerings, and proposals.

You will begin using DSF by following our straightforward instructions.


Concerning the authors


Serves as Principal Options Architect at Amazon Internet Services. He assists clients in crafting innovative and sustainable solutions on the Amazon Web Services (AWS) cloud infrastructure.
At Amazon Web Services (AWS), he excels as an Analytics Specialist Options Architect, where he thrives in resolving clients’ complex analytics, NoSQL, and streaming issues. With extensive expertise in distributed data processing engines and resource orchestration platforms.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles