Friday, October 10, 2025

Seamlessly Combine Knowledge on Google BigQuery and ClickHouse Cloud with AWS Glue

Migrating from Google Cloud’s BigQuery to ClickHouse Cloud on AWS permits companies to leverage the velocity and effectivity of ClickHouse for real-time analytics whereas benefiting from AWS’s scalable and safe setting. This text supplies a complete information to executing a direct knowledge migration utilizing AWS Glue ETL, highlighting the benefits and greatest practices for a seamless transition.

AWS Glue ETL permits organizations to find, put together, and combine knowledge at scale with out the burden of managing infrastructure. With its built-in connectivity, Glue can seamlessly learn knowledge from Google Cloud’s BigQuery and write it to ClickHouse Cloud on AWS, eradicating the necessity for customized connectors or advanced integration scripts. Past connectivity, Glue additionally supplies superior capabilities reminiscent of a visible ETL authoring interface, automated job scheduling, and serverless scaling, permitting groups to design, monitor, and handle their pipelines extra effectively. Collectively, these options simplify knowledge integration, scale back latency, and ship important price financial savings, enabling sooner and extra dependable migrations.

Conditions

Earlier than utilizing AWS Glue to combine knowledge into ClickHouse Cloud, you need to first arrange the ClickHouse setting on AWS. This contains creating and configuring your ClickHouse Cloud on AWS, ensuring community entry and safety teams are correctly outlined, and verifying that the cluster endpoint is accessible. As soon as the ClickHouse setting is prepared, you possibly can leverage the AWS Glue built-in connector to seamlessly write knowledge into ClickHouse Cloud from sources reminiscent of Google Cloud BigQuery. You possibly can observe the subsequent part to finish the setup.

  1. Arrange ClickHouse Cloud on AWS
    1. Comply with the ClickHouse official web site to arrange setting (bear in mind to permit distant entry within the config file if utilizing Clickhouse OSS)
      https://clickhouse.com/docs/get-started/quick-start
  2. Subscribe the ClickHouse Glue market connector
    1. Open Glue Connectors and select Go to AWS Market
    2. On the listing of AWS Glue market connectors, enter ClickHouse within the search bar. Then select ClickHouse Connector for AWS Glue
    3. Select View buy choices on the appropriate prime of the view
    4. Overview Phrases and Situations and select Settle for Phrases
    5. Select Proceed to Configuration as soon as it’s enabled
    6. On Comply with the seller’s directions half within the connector directions as beneath, select the connector enabling hyperlink at step 3

Configure AWS Glue ETL Job for ClickHouse Integration

AWS Glue permits direct migration by connecting with ClickHouse Cloud on AWS by means of built-in connectors, permitting for seamless ETL operations. Inside the Glue console, customers can configure jobs to learn knowledge from S3 and write it on to ClickHouse Cloud. Utilizing AWS Glue Knowledge Catalog, knowledge in S3 might be listed for environment friendly processing, whereas Glue’s PySpark assist permits for advanced knowledge transformations, together with knowledge kind conversions, to assist compatibility with ClickHouse’s schema.

  1. Open AWS Glue within the AWS Administration Console
    1. Navigate to Knowledge Catalog and Connections
    2. Create a brand new connection
  2. Configure BigQuery Connection in Glue
    1. Put together a Google Cloud BigQuery Atmosphere
    2. Create and Retailer Google Cloud Service Account Key (JSON format) in AWS Secret Supervisor, you could find the small print in BigQuery connections.
    3. The JSON Format content material instance is as following:
      {   "kind": "service_account",   "project_id": "h*********g0",   "private_key_id": "cc***************81",   "private_key": "-----BEGIN PRIVATE KEY-----nMI***zEc=n-----END PRIVATE KEY-----n",   "client_email": "clickhouse-sa@h*********g0.iam.gserviceaccount.com",   "client_id": "1*********8",   "auth_uri": "https://accounts.google.com/o/oauth2/auth",   "token_uri": "https://oauth2.googleapis.com/token",   "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",   "client_x509_cert_url": "https://www.googleapis.com/robotic/v1/metadata/x509/clickhouse-sapercent40h*********g0.iam.gserviceaccount.com",   "universe_domain": "googleapis.com" }

      • kind: service_account.
      • project_id: The ID of the GCP undertaking.
      • private_key_id: A novel ID for the non-public key throughout the file.
      • private_key: The precise non-public key.
      • client_email: The e-mail handle of the service account.
      • client_id: A novel consumer ID related to the service account.
      • auth_uri, token_uri, auth_provider_x509_cert_url
      • client_x509_cert_url: URLs for authentication and token trade with Google’s id and entry administration techniques.
      • universe_domain: The area identify of GCP, googleapis.com
    4. Create Google BigQuery Connection in AWS Glue
    5. Grant the IAM function related together with your AWS Glue job permission for S3, Secret Supervisor, Glue providers, and AmazonEC2ContainerRegistryReadOnly for accessing connectors bought from AWS Market (reference doc)
  3. Create ClickHouse connection in AWS Glue
    1. Enter clickhouse-connection as its connection identify
    2. Select Create connection and activate connector
  4. Create a Glue job
    1. On the Connectors view as beneath, choose clickhouse-connection and select Create job
    2. Enter bq_to_clickhouse as its job identify and configure gc_connector_role as its IAM Function
    3. Configure BigQuery connection and clickhouse-connection to the Connection property
    4. Select the Script tab and Edit script. Then select Verify on the Edit script popup view.
    5. Copy and paste the next code onto the script editor which might be referred from clickhouse official doc
    6. The supply code is as following:
      import sys from pyspark.sql import SparkSession from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JOB_NAME']) spark = SparkSession.builder.getOrCreate() glueContext = GlueContext(spark.sparkContext) job = Job(glueContext) job.init(args['JOB_NAME'], args) connection_options = {     "connectionName": "Bigquery connection",     "parentProject": "YOUR_GCP_PROJECT_ID",     "question": "SELECT * FROM `YOUR_GCP_PROJECT_ID.bq_test_dataset.bq_test_table`",     "viewsEnabled": "true",     "materializationDataset": "bq_test_dataset" } jdbc_url = " jdbc:clickhouse://YOUR_CLICKHOUSE_CONNECTION.us-east-1.aws.clickhouse.cloud:8443/clickhouse_database?ssl=true " username = "default" password = "YOUR_PASSWORD" question = "choose * from clickhouse_database.clickhouse_test_table" # Add this earlier than writing to check connection attempt:     # Learn from BigQuery with Glue Connection     print("Studying knowledge from BigQuery...")     GoogleBigQuery_node1742453400261 = glueContext.create_dynamic_frame.from_options(         connection_type="bigquery",         connection_options=connection_options,         transformation_ctx="GoogleBigQuery_node1742453400261"     )     # Convert to DataFrame     bq_df = GoogleBigQuery_node1742453400261.toDF()     print("Present knowledge from BigQuery:")     bq_df.present()          # Write BigQuery Knowledge to Clickhouse with JDBC     bq_df.write      .format("jdbc")      .possibility("driver", 'com.clickhouse.jdbc.ClickHouseDriver')      .possibility("url", jdbc_url)      .possibility("person", username)      .possibility("password", password)      .possibility("dbtable", "clickhouse_test_table")      .mode("append")      .save()          print("Write BigQuery Knowledge to ClickHouse efficiently")          # Learn from Clickhouse with JDBC     reaf_df = (spark.learn.format("jdbc")     .possibility("driver", 'com.clickhouse.jdbc.ClickHouseDriver')     .possibility("url", jdbc_url)     .possibility("person", username)     .possibility("password", password)     .possibility("question", question)     .possibility("ssl", "true")     .load())          print("Present Knowledge from ClickHouse:")     reaf_df.present()      besides Exception as e:     print(f"ClickHouse connection take a look at failed: {str(e)}")     elevate e lastly:     job.commit()

    7. Select Save and Run on the appropriate prime of the present view

Testing and Validation

Testing is essential to confirm knowledge accuracy and efficiency within the new setting. After the migration completes, run knowledge integrity checks to verify document counts and knowledge high quality in ClickHouse Cloud. Schema validation is crucial, as every knowledge discipline should align accurately with ClickHouse’s format. Working efficiency benchmarks, reminiscent of pattern queries, will assist confirm that ClickHouse’s setup delivers the specified velocity and effectivity good points.

  1. The Schema and Knowledge in supply BigQuery and vacation spot Clickhouse

  2. AWS Glue output logs

Clear Up

After finishing the migration, it’s necessary to wash up unused assets—reminiscent of BigQuery for pattern knowledge import and database assets in ClickHouse Cloud—to keep away from pointless prices. Concerning IAM permissions, adhering to the precept of least privilege is advisable. This includes granting customers and roles solely the permissions crucial for his or her duties and eradicating pointless permissions when they’re now not required. This strategy enhances safety by minimizing potential risk surfaces. Moreover, reviewing AWS Glue job prices and configurations may also help establish optimization alternatives for future migrations. Monitoring total prices and analyzing utilization can reveal areas the place code or configuration enhancements could result in price financial savings.

Conclusion

AWS Glue ETL provides a sturdy and user-friendly resolution for migrating knowledge from BigQuery to ClickHouse Cloud on AWS. By using Glue’s serverless structure, organizations can carry out knowledge migrations which are environment friendly, safe, and cost-effective. The direct integration with ClickHouse streamlines knowledge switch, supporting excessive efficiency and adaptability. This migration strategy is especially well-suited for firms trying to improve their real-time analytics capabilities on AWS.


In regards to the Authors

Ray Wang

Ray Wang

Ray is a Senior Options Architect at AWS. With 12+ years of expertise within the IT business, Ray is devoted to constructing trendy options on the cloud, particularly in NoSQL, large knowledge, machine studying, and Generative AI. As a hungry go-getter, he handed all 12 AWS certificates to make his technical discipline not solely deep however vast. He likes to learn and watch sci-fi motion pictures in his spare time.

Robert Chung

Robert Chung

Robert is a Options Architect at AWS with experience throughout Infrastructure, Knowledge, AI, and Modernization applied sciences. He has supported quite a few monetary providers clients in driving cloud-native transformation, advancing knowledge analytics, and accelerating mainframe modernization. His expertise additionally extends to trendy AI-DLC practices, enabling enterprises to innovate sooner. With this background, Robert is well-equipped to handle advanced enterprise challenges and ship impactful options.

Tomohiro Tanaka

Tomohiro Tanaka

Tomohiro is a Senior Cloud Help Engineer at Amazon Net Companies (AWS). He’s captivated with serving to clients use Apache Iceberg for his or her knowledge lakes on AWS. In his free time, he enjoys a espresso break along with his colleagues and making espresso at house.

Stanley Chukwuemeke

Stanley Chukwuemeke

Stanley is a Senior Associate Options Architect at AWS. He works with AWS know-how companions to develop their enterprise by creating joint go-to-market options utilizing AWS knowledge, analytics and AI providers. He’s labored with knowledge most of his profession and captivated with database modernization and cloud adoption technique to assist drive enterprise modernization initiatives throughout industries.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles