Monday, March 31, 2025

Generative AI Upgrades Unleash New Capabilities for Apache Spark on AWS Glate.

Every month, organizations execute tens of millions of Apache Spark jobs on Amazon Web Services (AWS), leveraging the power of scalable big data processing to transform, analyze, and prepare data for actionable insights in analytics and machine learning applications. As historic structures mature, preserving them in a safe and environmentally responsible manner becomes increasingly challenging. Practitioners of knowledge must stay abreast of the latest Spark updates to leverage enhanced efficiencies, cutting-edge features, and vital bug fixes, thereby bolstering their expertise in a rapidly evolving landscape. Despite their benefits, these enhancements can still be costly, complex, and require a significant investment of time.

We’re thrilled to unveil the preview of generative AI enhancements for Spark, a groundbreaking capability that empowers knowledge professionals to efficiently streamline and future-proof their Spark applications running on Amazon Web Services (AWS). Starting with Spark jobs in AWS, this feature enables users to seamlessly upgrade from an older AWS Glue model to the latest AWS Glue model 4.0. This innovative feature significantly streamlines the process of modernizing Spark applications for knowledge engineers, enabling them to focus on building novel knowledge pipelines and delivering timely, actionable insights.

Understanding the Spark improve problem

Upgrading Spark applications typically demands substantial guidance expertise and hands-on familiarity. Knowledge practitioners must thoroughly review incremental Spark launch notes to comprehend the subtleties and complexities of iterative improvements, several of which may not be formally documented. To seamlessly transition between different Spark applications, users wish to swap out their existing scripts and configurations, effortlessly updating any desired settings, connections, and library dependencies.

The testing of these upgrades entails performing a comprehensive evaluation by operating the appliance while identifying and resolving any issues that arise. As each check runs, it may uncover additional problems, necessitating multiple cycles of refinement. Once the upgraded software operates seamlessly, professionals should verify the freshly generated results against their expected outcomes in production. This course often evolves into year-long initiatives that cost tens of millions of dollars and consume tens of thousands of engineering hours.

Spark’s generative AI upgrades unlock new capabilities by integrating machine learning and natural language processing. These enhancements enable data scientists to craft more accurate predictions, personalize customer interactions, and streamline workflows. By leveraging graph-based models, Spark’s AI can identify relationships within massive datasets, fostering deeper insights and predictive power. This technology also enables developers to build intuitive interfaces that analyze complex data sets in real-time, allowing for swift decision-making.

The Spark Upgrades function leverages Artificial Intelligence (AI) to streamline the detection and verification of necessary updates for your AWS Glue Spark projects, thereby automating the process. Let’s uncover how these capacities function synergistically to streamline your improvement journey.

AI-driven improve plan technology

While improving performance, the service employs AI-driven analytics to identify necessary modifications across every PySpark code and Spark configuration. Throughout the preview, Spark Upgrades seamlessly facilitates upgrades from Glue 2.0 (Spark 2.4.3, Python 3.7) to Glue 4.0 (Spark 3.3.0, Python 3.10), efficiently handling adjustments that typically require meticulous examination of public documentation and comprehensive migration guides, followed by rigorous testing and verification. Spark Upgrades streamlines performance by focusing on four pivotal zones of refinement:

  • strategies and features
  • strategies and operations
  • Language updates, coupled with module deprecations and syntax refinements.
  • configuration settings

Migrating from Spark 2.4.3 to Spark 3.3.0 requires a staggering 100-plus updates, underscoring the complexity of these upgrades when you consider the sheer scale of version-specific adjustments involved? The complexity of guide upgrades stems from a multitude of factors.

  • By combining imperative and declarative programming paradigms, the platform empowers users to easily craft Spark applications. Despite this, upgrading complex systems may still pose significant challenges in identifying affected code.
  • Distributed Spark software’s transformations improve efficiency but complicate runtime verification of software updates for customers, posing a challenge to their adoption.
  • Changes to default settings or the integration of cutting-edge configurations across different versions can significantly impact software behavior in various ways, potentially leading to complexities that make it challenging for users to discern changes during updates.

In Apache Spark 3.2, Spark SQL introduces several enhancements to improve query performance and expand its capabilities. TRANSFORM Operators cannot assist aliases in their respective inputs. In Spark versions prior to 3.1, you can write a scripted model like this: SELECT TRANSFORM(column1 AS columnC1, column2 AS columnC2) USING 'category' FROM myTable;.

# Authentic code (Glue 2.0) question = """ SELECT TRANSFORM(merchandise as product_name, worth as product_price, quantity as product_number)    USING 'cat' FROM items WHERE items.worth > 5 """ spark.sql(question) # Up to date code (Glue 4.0) question = """ SELECT TRANSFORM(merchandise, worth, quantity)    USING 'cat' AS (product_name, product_price, product_number) FROM items WHERE items.worth > 5 """ spark.sql(question)

The timestamp ‘1970-01-01 15:30:00’ cannot be parsed at index 13: invalid escape sequence? INT96 in Parquet recordsdata causes errors. In Spark 3.0, this would not necessarily fail but could potentially lead to timestamp shifts due to calendar rebasing. To revive the outdated behavior in Spark 3.1, one would need to configure the relevant Spark SQL settings for spark.sql.legacy.parquet.int96RebaseModeInRead and spark.sql.legacy.parquet.int96RebaseModeInWrite to LEGACY.

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY") knowledge = [(1, "1899-12-31 23:59:59"), (2, "1900-01-01 00:00:00")] schema = StructType([StructField("id", IntegerType(), True), StructField("timestamp", TimestampType(), True)]) df = spark.createDataFrame(knowledge, schema=schema) df.write.mode("overwrite").parquet("path/to/parquet_file")

Automated validation in your surroundings

Spark upgrades verifies the enhanced software functionality by running it as a fully-fledged AWS Glue job within your secure AWS environment. The service employs a series of validation runs, up to 10 iterations, continually reviewing and addressing errors encountered in each cycle, refining the improvement plan until a successful and profitable outcome is achieved. To assess the performance of your Spark project, you can execute a validation run in your growth account using sample datasets supplied by Glue job parameters, which are designed to simulate real-world scenarios.

After Spark Upgrades efficiently validates the adjustments, a comprehensive improvement plan is presented for your review. You may then settle for and apply the adjustments to your job within the growth account before replicating them to your job within the manufacturing account. The Spark Improvement Plan comprises the following:

  • Throughout the development process, a series of code updates were implemented to ensure seamless integration and optimal performance.
  • What about utilising the ultimate script instead of relying on your current script?
  • Points recognized and resolved during validation runs are documented in log files.

Before implementing any improvements or modifications, you should thoroughly review all aspects of the process, including intermediate validations and error resolutions, to ensure that the changes will not negatively impact your production workflow. This approach provides unparalleled transparency and control throughout the enhancement process, coupled with the advantages of artificial intelligence-driven efficiency.

As we venture forth into the realm of generative AI Spark upgrades, the sheer potential for innovation is palpable? We’re on the cusp of a revolution in artificial intelligence, where cutting-edge tools and frameworks empower us to create novel applications that blur the lines between human ingenuity and machine-driven creativity.

Let’s walk through the process of upgrading an AWS Glue 2.0 job to AWS Glue 4.0. Full the next steps:

  1. Within the AWS Glue console, navigate to the “Job” tab in the left-hand sidebar and click on it.
  2. Select your AWS Glue 2.0 job to execute.
  3. For , enter s3://aws-glue-assets-<account-id>-<area>/scripts/upgraded/ AWS Account ID: 123456789012
    AWS Region: us-east-1?
  4. Select .
  5. Looking ahead to the evaluation that will soon be accomplished on the tab.

    During an evaluation process, you can access and review up to 10 intermediate job evaluation attempts under the tab for validation purposes. What’s more, the Spark Improve service has meticulously documented its progress, continually refining the upgrade plan through each iteration. Each attempt reveals a unique failure cause, which the service strives to mitigate through subsequent retries via code or configuration adjustments.
    Following a successful assessment, the refined script along with a summary of modifications will be uploaded to Amazon S3.
  6. What are the key adjustments to ensure they align with my requirements?

The efficiency of your job has now seamlessly transitioned to the advanced capabilities of AWS Glue model 4.0, significantly streamlining data processing and analysis for enhanced business insights. You may test the tab to confirm the updated script, and you may also tab to review the modified configuration.

Improving our comprehension of the improvement process through a concrete example?

Introducing the next evolution in manufacturing: Glue 2.0 is poised for transformation into a cutting-edge solution, Glue 4.0, courtesy of the innovative Spark Improve function.

The Glue 2.0 application daily ingests data from an Amazon S3 bucket, partitioned by date, which contains newly posted book reviews from an online marketplace. It then leverages SparkSQL to extract valuable insights regarding customer voting patterns for each review.

Authentic Code (Glue 2.0), an innovation that preceded improvements.

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) from collections import Sequence from pyspark.sql.sorts import DecimalType from pyspark.sql.features import lit, to_timestamp, col def is_data_type_sequence(coming_dict):     return True if isinstance(coming_dict, Sequence) else False def dataframe_to_dict_list(df):     return [row.asDict() for row in df.collect()] books_input_path = (     "s3://aws-bigdata-blog/generated_synthetic_reviews/knowledge/product_category=Books/" ) view_name = "books_temp_view" static_date = "2010-01-01" books_source_df = (     spark.learn.choice("header", "true")     .choice("recursiveFileLookup", "true")     .choice("path", books_input_path)     .parquet(books_input_path) ) books_source_df.createOrReplaceTempView(view_name) books_with_new_review_dates_df = spark.sql(     f"""         SELECT          {view_name}.*,             DATE_ADD(to_date(review_date), "180.8") AS next_review_date,             CASE                  WHEN DATE_ADD(to_date(review_date), "365") < to_date('{static_date}') THEN 'Sure'                  ELSE 'No'              END AS Actionable         FROM {view_name}     """ ) books_with_new_review_dates_df.createOrReplaceTempView(view_name) aggregate_books_by_marketplace_df = spark.sql(     f"SELECT market, rely({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by market" ) aggregate_books_by_marketplace_df.present() knowledge = dataframe_to_dict_list(aggregate_books_by_marketplace_df) if is_data_type_sequence(knowledge):     print("knowledge is legitimate") else:     elevate ValueError("Knowledge is invalid") aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(     "average_total_votes_decimal", col("average_total_votes").forged(DecimalType(3, -2)) ) aggregated_target_books_df.present()

What’s the current state of the text you’d like me to improve in a different style? Please provide the text, and I’ll get back to you with the revised version.

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from collections.abc import Sequence from pyspark.sql.sorts import DecimalType from pyspark.sql.features import lit, to_timestamp, col sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session spark.conf.set("spark.sql.adaptive.enabled", "false") spark.conf.set("spark.sql.legacy.allowStarWithSingleTableIdentifierInCount", "true") spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", "true") job = Job(glueContext) def is_data_type_sequence(coming_dict):     return True if isinstance(coming_dict, Sequence) else False def dataframe_to_dict_list(df):     return [row.asDict() for row in df.collect()] books_input_path = (     "s3://aws-bigdata-blog/generated_synthetic_reviews/knowledge/product_category=Books/" ) view_name = "books_temp_view" static_date = "2010-01-01" books_source_df = (     spark.learn.choice("header", "true")     .choice("recursiveFileLookup", "true")     .load(books_input_path) ) books_source_df.createOrReplaceTempView(view_name) books_with_new_review_dates_df = spark.sql(     f"""         SELECT          {view_name}.*,             DATE_ADD(to_date(review_date), 180) AS next_review_date,             CASE                  WHEN DATE_ADD(to_date(review_date), 365) < to_date('{static_date}') THEN 'Sure'                  ELSE 'No'              END AS Actionable         FROM {view_name}     """ ) books_with_new_review_dates_df.createOrReplaceTempView(view_name) aggregate_books_by_marketplace_df = spark.sql(     f"SELECT market, rely({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by market" ) aggregate_books_by_marketplace_df.present() knowledge = dataframe_to_dict_list(aggregate_books_by_marketplace_df) if is_data_type_sequence(knowledge):     print("knowledge is legitimate") else:     elevate ValueError("Knowledge is invalid") aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(     "average_total_votes_decimal", col("average_total_votes").forged(DecimalType(3, -2)) ) aggregated_target_books_df.present()

Improve abstract

In Apache Spark 3.2, the `spark.sql.adaptive.enabled` property is now enabled by default. To revert to pre-Spark 3.2 behavior, set `spark.sql.adaptive.enabled` to `false`. No suitable migration rule was found in the provided context for this specific error. The modification was primarily driven by an error message suggesting that 'Sequence' could not be imported from the 'collections' module. Python 3.10's relocation of Sequence to the collections.abc module simplifies the development process. Spark 3.1 introduces restrictions on using multiple paths simultaneously when invoking DataFrameReader.load(), DataFrameWriter.save(), DataStreamReader.load(), or DataStreamWriter.begin() methods with path parameters; these strategies cannot coexist. Instead, you need to specify a single path. The code snippet, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")`, is more readable and free of errors. In Spark models prior to version 3.0, when multiple path parameters were passed to DataFrameReader.load() or individual path parameters were passed to above strategies, the default path choice was overwritten and incorporated into the overall set of general paths. To use legacy functionality prior to Spark 3.1, set the spark.sql.legacy.pathOptionBehavior.enabled property to true. Spark 3.0's date_add and date_sub functions now exclusively accept int, smallint, or tinyint as their second argument; non-literal strings, including those representing fractional values, are no longer valid, such as: date_add(to_date('1964-05-23'), '12.34') which results in an AnalysisException. However, note that string literals are still permitted; Spark will raise an AnalysisException should the string content not represent a valid integer.  In Spark versions prior to 2.5, if the second argument is fractional or string valued, it is coerced to integer valued and returns a date value of '1964-06-04'. In contrast, Spark 3.2 restricts the use of rely(tblName.*) to prevent ambiguous results from being produced. As a result of relying on `rely(*)` and `rely(tblName.*)`, outputs will vary if there are any null values present. To emulate the behavior preceding Spark 3.2, set `spark.sql.legacy.allowStarWithSingleTableIdentifierInCount` to true. In Spark 3.0, decimal scale damage isn't allowed by default; for instance, data types of literals like 1E10BD are DecimalType(11, 0). Prior to Spark 2.4, the default precision for decimal types was DecimalType(2,-9). To restore behavior prior to Spark 3.0, set `spark.sql.legacy.allowNegativeScaleOfDecimal` to `true`.

A comprehensive analysis of the Glue 4.0 (Spark 3.3.0) script diff reveals significant improvements compared to the earlier Glue 2.0 (Spark 2.4.3) script, with six distinct code and configuration updates implemented across six attempts in the Spark Improvement Evaluation.

  • Why did Try #1 fail to meet expectations?spark.sql.adaptive.enabledTo enhance the execution of queries in Spark SQL, a new feature called adaptive question execution has been introduced starting with Spark 3.2, enabling more effective and efficient query processing. Customers are empowered to review this configuration alteration, with the flexibility to either enable or disable it according to their preferences.
  • The issue was resolved in version 3.10 by introducing a new abstract base class summarizingabcfrom collections.abc import * Sequence.
  • The resolution to issue #3 was necessitated by changes introduced in Spark 3.1, which affected the initial phase of conducting proceedings. path Can choices really not exist with different options? DataFrameReader operations.
  • The issue with the signature was resolved in #4? DATE_ADD Since Spark 3.0, which now exclusively accepts integers for its second argument?
  • The team successfully resolved an error that arose from a recent change in the company’s conduct, ensuring seamless operation and maintaining customer trust. rely(tblName.*) beginning Spark 3.2. Conduct was revitalized with the launch of a state-of-the-art system. spark.sql.legacy.allowStarWithSingleTableIdentifierInCount
  • The newly crafted script successfully executed on Glue 4.0, free from any freshly reported errors, following a seamless assessment in just six short iterations. The ultimate attempt resolved an error that arose from improper utilisation of the damaging scaling mechanism for. forged(DecimalType(3, -6) in beginning Spark 3.0. The challenge was resolved through the implementation of a recently developed innovation. spark.sql.legacy.allowNegativeScaleOfDecimal.

Necessary issues for preview

As you begin leveraging automated Spark upgrades during the preview period, several key considerations come into play for optimal utilization of the service.

  • The preview launch targets a seamless upgrade of PySpark code from AWS Glue version 2.0 to version 4.0. At the time of writing, the service supports execution of PySpark code that does not rely on additional library dependencies. In your AWS account, you can run up to 10 automated upgrade processes simultaneously, enabling you to seamlessly modernize multiple jobs while maintaining system stability and ensuring minimal disruption.
  • Using generative AI, the service iteratively validates the improvement plan through multiple iterations, each running as an AWS Glue job within your account; therefore, optimizing job run configurations is crucial to ensure cost-effectiveness. To optimize the improvement process, we recommend designating a specific run configuration at the outset of an evaluation, thereby ensuring consistency and comparability across various iterations.
    • Using non-production developer accounts, we establish representative mock data sets that embody our manufacturing expertise at a reduced scale, facilitating effective validation of Spark upgrades.
    • By leveraging the optimal scale of computing resources, comparable to those employed by G.1X experts, and selecting a suitable number of personnel for processing your pattern knowledge.
    • Enabling real-time adjustments to source allocation according to workload requirements.

    In instances where manufacturing jobs process vast amounts of data with 20 G2X employees, consider configuring the improvement job to process a few gigabytes of consultancy data with 2 G2X employees, enabling auto-scaling for validation purposes.

  • During the preview period, we recommend starting your improvement process with low-risk, non-production projects. By employing this approach, you gain hands-on experience with the improvement process, getting a sense of how the service tackles diverse Spark code patterns and improves their efficiency.

Your input and recommendations are crucial for helping us refine and elevate this capability. We invite you to contribute your thoughts, suggestions, and any obstacles you face through AWS Assist or your account community. These suggestions aim to help us refine our service by incorporating features that are most valuable to you during your preview phase.

Conclusion

Automated Spark upgrades in AWS Glue demonstrate the capability to seamlessly migrate Spark workloads, simplifying the process of keeping your applications up-to-date and secure. By leveraging generative AI, the migration process is streamlined as it automatically detects and adjusts required script changes across various Spark versions.

To learn more about this feature in AWS Glue, explore.

We would like to express our sincere gratitude to the following individuals who contributed to the successful launch of generative AI upgrades for Apache Spark in AWS Glue: Shuai Zhang, Mukul Prasad, Liyuan Lin, Rishabh Nair, Raghavendhar Thiruvoipadi Vidyasagar, Tina Shao, Chris Kha, Neha Poonia, Xiaoxi Liu, Japson Jeyasekaran, Suthan Phillips, Raja Jaya Chandra Mannem, Yu-Ting Su, Neil Jonkers, Boyko Radulov, Sujatha Rudra, Mohammad Sabeel, Mingmei Yang, Matt Su, Daniel Greenberg, Charlie Sim, McCall Petier, Adam Rohrscheib, Andrew King, Ranu Shah, Aleksei Ivanov, Bernie Wang, Karthik Seshadri, Sriram Ramarathnam, Asterios Katsifodimos, Brody Bowman, Sunny Konoplev, Bijay Bisht, Saroj Yadav, Carlos Orozco, Nitin Bahadur, Kinshuk Pahare, Santosh Chandrachood, and William Vambenepe.


In regards to the Authors

Is a principal huge knowledge architect within the AWS Glue group. He is accountable for developing software components that support customers. In his free time, he delights in cycling alongside his newly acquired highway bicycle.

As a seasoned Senior Software Development Engineer at AWS Glue, I leverage my expertise in marrying generative AI and knowledge integration technologies to craft comprehensive solutions that meet customers’ data and analytics requirements with precision and finesse.

Serves as a senior product supervisor at Amazon Web Services (AWS) Analytics. The leader drives significant advancements in generative AI functionality across organizations like AWS Glue, Amazon EMR, and Amazon MWAA, leveraging AI/ML to streamline and enhance the capabilities of data scientists building knowledge-centric applications on AWS.

Is a software program growth supervisor at the AWS Glue team. With a passion for delivering client-centric solutions, he excels at harnessing the power of Amazon Web Services (AWS) Cloud to craft highly scalable and resilient offerings that meet clients’ needs. He enjoys hiking and playing sports, particularly those that involve a net.

Serves as a skilled Software Program Engineer at AWS Glue. He has a fascination with designing scalable distributed solutions for big data processing, analytics, and management, leveraging his expertise in creating seamless connections between systems and teams to drive meaningful insights and informed decision-making. He’s particularly enthusiastic about harnessing the potential of generative AI technologies to deliver innovative experiences to customers. When not occupied with work, he has a passion for sports and particularly savors the thrill of playing tennis.

As a software program engineer at AWS Glue, I am enthralled by designing robust, scalable solutions to tackle complex customer challenges. He exhibits an insatiable enthusiasm for generative AI, driven by a desire to uncover groundbreaking approaches that can be leveraged to create industry-leading solutions, capitalizing on the transformative power of cutting-edge artificial intelligence technologies.

Serves as a senior software program growth supervisor for the AWS Glue and Amazon EMR teams, leveraging expertise to drive innovation and excellence. Their team develops decentralized approaches that empower customers with intuitive interfaces and AI-powered functionalities to efficiently process vast amounts of data across knowledge lakes on Amazon S3, as well as databases and data repositories in the cloud.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles