Monday, August 4, 2025

Automate information lineage in Amazon SageMaker utilizing AWS Glue Crawlers supported information sources

The following era of Amazon SageMaker is the middle for all of your information, analytics, and AI. Bringing collectively extensively adopted Amazon Net Providers (AWS) machine studying (ML) and analytics capabilities, it delivers an built-in expertise for analytics and AI with unified entry to all of your information. From Amazon SageMaker Unified Studio, a single information and AI growth surroundings, you may entry your information and use a collection of highly effective instruments for information processing, SQL analytics, mannequin growth, coaching and inference, and generative AI growth.

With information lineage, now a part of Amazon SageMaker Catalog, you may centralize lineage metadata of your information property in a single place. You may observe the movement of knowledge over time, figuring out a transparent understanding of the place it originated, the way it has modified, and its utilization throughout the enterprise. By offering this stage of transparency, information lineage helps information customers achieve belief that the info is right and compliant for his or her use circumstances. With information lineage captured on the desk, column, and job stage, information producers can conduct influence evaluation of adjustments of their information pipelines and reply to information points when wanted, for instance, when a column within the ensuing dataset is lacking the standard required by the enterprise.

Information lineage is a robust instrument that may rework how organizations perceive and handle their information flows. On this put up, we discover its real-world influence by way of the lens of an ecommerce firm striving to spice up their backside line.

For example this sensible utility, we stroll you thru how you should use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to robotically seize lineage for information property saved in Amazon Easy Storage Service (Amazon S3) and Amazon DynamoDB. Utilizing this workflow, you may seize lineage robotically from extra information sources utilizing AWS Glue crawlers. Discuss with the Information lineage assist matrix within the SageMaker Unified Studio Consumer Information for supported sources. We additionally use SageMaker Unified Studio to navigate these information property and find out about their origin, transformations, and dependencies, due to the lineage metadata captured utilizing the AWS Glue crawlers.

Key options of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you may discover and uncover information property of your group suited to your use case. As you dive into these information property, you may be taught extra about its enterprise context, schema, high quality, and lineage. Whenever you resolve to work with a subset of those property, you may subscribe to them in a self-service trend and begin working with them. For extra element, go to Information discovery, subscription, and consumption within the SageMaker Unified Studio Consumer Information.

SageMaker Studio gives a visible lineage graph that reveals how an information asset has advanced from its supply by way of transformations to its remaining state. This helps information scientists, engineers, and analysts reply key questions equivalent to:

  • The place did this information come from?
  • What transformations has it gone by way of?
  • Which downstream property can be impacted by a change?

With this stage of visibility, groups can carry out sooner influence evaluation, discover the foundation trigger of knowledge high quality points, and guarantee fashions are constructed on trusted information. It additionally helps higher collaboration so customers can confidently use and share information throughout the group. The next screenshot reveals how SageMaker Unified Studio visualizes information lineage, making it simple to hint information movement and perceive dependencies.

  • Column-level lineage – You may develop column-level lineage when obtainable in dataset nodes. This robotically reveals relationships with upstream or downstream dataset nodes if supply column data is obtainable.
  • Column search – If the dataset has greater than 10 columns, the node presents pagination to navigate to columns not initially introduced. To rapidly view a selected column, you may search on the dataset node that lists solely the searched column.
  • Particulars pane – Every lineage node captures and shows the next particulars:
    • Each dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the totally different variations of lineage occasion captured for that node.
    • The job node has a particulars pane to show job particulars with the tabs Job data and Historical past. The small print pane additionally captures queries or expressions run as a part of the job.
  • View dataset nodes solely – If you wish to filter out the job nodes, you may select the open view management icon within the graph viewer and toggle the show dataset nodes solely, which is able to take away all of the job nodes from the graph and allow you to navigate solely the dataset nodes.
  • Model tabs – All lineage nodes in Amazon DataZone information lineage can have versioning, captured as historical past, primarily based on lineage occasions captured. You may view lineage at a specific timestamp that opens a brand new tab on the lineage web page to assist examine or distinction between the totally different timestamps.

You may attempt a few of these options as you discover the info property of this put up. To be taught extra on information lineage in SageMaker, we encourage you to dive deep into the Information lineage in Amazon SageMaker Unified Studio.

Answer overview

Think about a situation the place an ecommerce firm goals to optimize conversion charges and improve buyer expertise by gaining deeper insights into the client journey. They should join the dots between consumer interactions and precise purchases, however with information scattered throughout a number of sources, the place do they start? That is the place information lineage turns into invaluable. To carry out their evaluation, they want information from two main sources:

  • Clickstream information saved in Amazon S3 (in JSON or Parquet format)
  • Transactional order information saved as gadgets in Amazon DynamoDB

To make these datasets discoverable throughout the enterprise, you might want to:

  1. Create a undertaking in SageMaker Unified Studio that can be used to supply and handle the datasets
  2. Allow information lineage seize within the SageMaker Unified Studio undertaking
  3. Arrange the assets for this use case, which incorporates an AWS Glue information supply (arrange in SageMaker Unified Studio) and AWS Glue crawler (arrange in AWS Glue)
  4. Run the AWS Glue crawler to catalog the datasets in AWS Glue Information Catalog
  5. Supply the metadata of the info property into the SageMaker Catalog by working the info supply
  6. Use SageMaker Unified Studio to navigate by way of the lineage of the info property and visualize their origin
  7. Perceive how schema evolution is captured within the information asset’s lineage

Stipulations

To finish the steps on this put up, you want an SageMaker Unified Studio area already deployed in your AWS account. To get began rapidly in a testing surroundings, we propose creating your SageMaker area utilizing the fast setup choice as defined in Create an Amazon SageMaker Unified Studio area – fast setup.

Answer steps

To seize information lineage for AWS Glue tables managed with AWS Glue crawlers utilizing SageMaker Unified Studio, full the steps within the following sections.

Arrange a SageMaker undertaking with SQL functionality

In SageMaker Unified Studio, a undertaking profile defines an uber template for initiatives in your Amazon SageMaker unified area. By organising a undertaking with the precise tooling (undertaking profile), you’ll provision assets you should use to work with information, which could embrace cataloging it in SageMaker, reworking it into new information property, analyzing it to drive enterprise worth, and even use it for ML or AI functions.

To display information lineage successfully, we use SageMaker SQL analytics undertaking profile for a streamlined setup. Though this profile affords complete information analytics capabilities, we focus particularly on two key elements:

  • AWS Glue database – A lakehouse for storing and managing technical metadata
  • Information supply job – Mechanically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complicated guide configurations so we will concentrate on the core ideas of knowledge lineage.

To create a brand new undertaking in your SageMaker area utilizing the SQL analytics undertaking profile, comply with the steps detailed in SQL analytics undertaking profile. Preserve all default configurations when creating the undertaking.

After creating your undertaking in SageMaker Studio, you’ll unlock highly effective information lineage capabilities that make monitoring and understanding your information flows intuitive. Via the info sourcing function, you may simply monitor how information strikes from supply to the AWS Glue database. This visibility turns into significantly priceless when debugging information points—you may rapidly hint information again to its supply, perceive how adjustments influence downstream processes, and determine affected analyses or experiences. Subsequent, populate the AWS Glue database with pattern information to look at these options in motion and display how they will streamline your information operations.

For additional steerage on how you can entry the main points of the brand new SageMaker undertaking, consult with Get undertaking particulars. After you entry the info supply particulars, within the Database identify area, pay attention to the AWS Glue database identify related to the SageMaker undertaking.

Allow information lineage seize within the SageMaker undertaking’s information supply

To allow lineage seize, comply with these steps:

  1. Broaden the Actions menu, then select Edit information supply.
  2. Go to the connections and choose Import information lineage to configure lineage seize from the supply, as proven within the following screenshot.
  3. Make different adjustments to the info supply fields as desired, then select Save.

Enabling lineage will be certain the info supply job will seize lineage within the subsequent run.

Deploy assets for the use case

Observe these steps:

  1. To deploy the assets required for this put up, obtain the AWS CloudFormation template amazon-datazone-examples within the AWS Samples GitHub repository. Deploy it in your AWS account.

For additional steerage on how you can deploy a CloudFormation stack, consult with Create a stack from the CloudFormation console. That you must present a Stack identify and the identify of the AWS GlueDatabaseName related to the undertaking of your SageMaker area, as proven within the following screenshot.

  1. Select Subsequent.

The template will deploy the next assets:

  • A S3 bucket with a pattern file of clickstream information. The bucket identify and site of the file will comply with the trail sample s3://ecomm-analytics--/clickstream///
    /information.json

    . The file will include a pattern report with the next construction:

{     "session_id": "abc123",     "user_id": "u789",     "event_type": "product_view",     "product_id": "prod456",     "timestamp": "2025-06-04T09:23:12Z" }

  • A DynamoDB desk with a pattern merchandise of order information (transactions). The desk can be named OrderTransactionTable. The pattern merchandise can have the next construction:
{     "order_id": "ord789",     "user_id": "u789",     "product_id": "prod456",     "order_total": 79.99,     "order_timestamp": "2025-06-04T09:27:10Z" }

  • An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB desk deployed as a part of the stack and retailer the metadata within the AWS Glue database related to the SageMaker undertaking. You may entry the crawler’s particulars within the AWS console, as proven within the following screenshot.

Run the AWS Glue crawler

The AWS Glue crawler deployed within the earlier step will assist you to seize metadata from the 2 information sources, Amazon S3 and DynamoDB, and retailer it in AWS Glue Information Catalog, particularly within the database related to the SageMaker undertaking. After the metadata is saved, will probably be accessible to SageMaker.

Earlier than working the crawler, you might want to present AWS Lake Formation permissions to the IAM function that the AWS Glue crawler will use to work together together with your information supply and goal AWS Glue database. The next command will grant the permissions wanted for the crawler to retailer metadata into the AWS Glue database of the SageMaker undertaking.

To invoke this command, we advocate utilizing AWS CloudShell on the AWS console as defined in AWS CloudShell Ideas. Replace the , and  placeholders with the precise values to your AWS Area, AWS account ID, and identify of the AWS Glue database related to the SageMaker undertaking.

aws lakeformation grant-permissions    --region     --principal DataLakePrincipalIdentifier=arn:aws:iam:::function/glue-crawler-role     --permissions CREATE_TABLE    --resource '{ "Database": { "Identify": "" } }'   

Subsequent, run the AWS Glue Crawler on the AWS console. After the crawler efficiently finishes, two new tables, clickstream and ordertransactiontable, can be created within the AWS Glue database related to the SageMaker undertaking. Discuss with Viewing crawler outcomes and particulars to be taught extra about AWS Glue crawler outcomes.

Supply metadata from the AWS Glue database into SageMaker

To supply metadata from information property within the AWS Glue database, together with their lineage, into SageMaker, use the info supply that was deployed as a part of the SageMaker undertaking creation.

  1. To run the info supply, go to the info supply particulars web page.
  2. Select Run. (Information sources might be scheduled to run as nicely, nevertheless, for this demonstration we set off a guide run).

After the info supply run is full, metadata from each information property within the AWS Glue database can be imported into the SageMaker area because the undertaking’s stock property. You could find the main points of the info supply run from inside SageMaker Unified Studio, which embrace:

  • The information property from the AWS Glue database that had been ingested into SageMaker.
  • The standing of the info lineage import for every information asset, which incorporates an occasion ID for traceability. This lineage occasion ID can be utilized to debug inconsistencies within the ensuing lineage graph. You should use the GetLineageEvent API to retrieve the uncooked payload of the lineage occasion.

Visualizing the info lineage graph of the info property in SageMaker Unified Studio

With SageMaker Unified Studio, you have got a single place to handle and uncover information property. When accessing an information asset revealed within the SageMaker central catalog or in your undertaking’s personal stock, you may dive into the asset’s metadata, which incorporates its schema, enterprise description, customized metadata kinds, high quality, lineage, and extra. To visualise the lineage graph of every information asset of this put up, comply with these steps:

  1. In SageMaker Studio, navigate to the Property part of the SageMaker undertaking particulars web page and select INVENTORY
  2. Choose the asset that you just wish to discover. You can too entry the asset instantly from the info supply run by deciding on the asset identify.
  3. To view the lineage graph of the info asset as much as its origin, proven within the following screenshots, select the LINEAGE tab.
    • For clickstream desk (Sourced from S3)

    • For order transactions desk (Sourced from DynamoDB)

With lineage, now you can affirm that the info originated from sources equivalent to Amazon S3 and Amazon DynamoDB and perceive the way it has been reworked alongside the way in which. Due to this end-to-end visibility, you may belief the info, make knowledgeable choices, and supply compliance with confidence. The lineage graph captures important metadata that kinds the inspiration of lineage monitoring.

  • This consists of desk schemas, column definitions and their information sorts.
  • Column-level lineage turns into significantly highly effective on this context. Think about your clickstream’s AWS Glue desk powers an Amazon QuickSight dashboard analyzing buyer buy patterns and spot discrepancies in your income experiences. With column lineage, you may immediately hint the supply of these columns.
  • This granular visibility not solely accelerates debugging but in addition proves invaluable throughout schema adjustments, as we present within the following part by altering the supply schema.
  • The crawler particulars equivalent to crawlerRunId (current within the supply identifier of the lineage node) and crawler begin and finish instances can be utilized to debug which crawler runs up to date the desk.

Understanding your information asset’s schema evolution by way of lineage in SageMaker Unified Studio

Think about the order transactions supply in DynamoDB was up to date with new data. As a result of this supply powers an Amazon QuickSight report for the client utilizing the AWS Glue database desk, it’s vital for customers to know what adjustments within the information pipeline up to date the report.

  1. Edit the DynamoDB desk merchandise with extra columns to learn the way lineage graph can be utilized to view historic updates:
{     "order_id": "ord789",     "user_id": "u789",     "product_id": "prod456",     "order_total": 79.99,     "order_timestamp": "2025-06-04T09:27:10Z", 	"customerSegment": "new-customer",     "conversionSource": "primeDayEmailCampaign" }

  1. Enter the OrderTransactionsCrawler Glue crawler once more on the AWS console. After completion, you’ll discover that it up to date the ordertransactiontable AWS Glue desk, as proven within the following screenshot.

  1. Run once more the info supply related to the undertaking in SageMaker Unified Studio to import the newest metadata into the SageMaker Catalog. After completion, you’ll discover the info supply up to date the ordertransactiontable information asset within the SageMaker Catalog, as proven within the following screenshot.

This part explores how lineage might be helpful to trace the updates.

Navigate to the ordertransactiontable information asset in SageMaker Catalog by deciding on it from the info supply run and select the LINEAGE tab, as proven within the following screenshot.

Discover how the brand new columns can be found within the lineage graph. A brand new crawler run ID is current because the supply identifier of the crawler lineage node. The historical past tab reveals a number of crawler runs. You may navigate to examine the state of the system in the course of the first run.

Cleanup

After you’re executed, we advocate to cleansing up the assets created for this put up to keep away from unintended prices:

  1. Delete the stock property that had been cataloged within the SageMaker undertaking’s stock, as defined in Delete an Amazon SageMaker Unified Studio asset.
  2. Delete the SageMaker undertaking that was created as a part of this put up, as defined in Delete a undertaking.
  3. Delete the CloudFormation stack that was deployed as a part of this put up, as defined in Delete a stack from the CloudFormation console.
  4. The S3 bucket created as a part of the CloudFormation stack will stay after its deletion as a result of it accommodates an information file in it. Empty and delete the bucket, as defined in Deleting a normal function bucket.

Conclusion

On this put up, you had been capable of discover the info lineage capabilities of Amazon SageMaker, particularly when working with AWS Glue crawlers. You realized how one can arrange an AWS Glue crawler to deduce metadata from information property in a number of sources equivalent to Amazon S3 and DynamoDB and retailer it the AWS Glue Information Catalog. You additionally imported this metadata, together with information lineage, into Amazon SageMaker by way of the info supply functionality of a SageMaker undertaking. Lastly, you explored the ensuing lineage graph of knowledge property in SageMaker Unified Studio and noticed among the functionalities obtainable to grasp the origin path of them, perceive how columns are reworked, and what influence seems to be like when performing adjustments to any step of the pipeline.We encourage you to now take a look at the capabilities you explored on this put up with your individual information. By following the sample introduced on this put up, many shoppers have been capable of obtain governance of their information lake and lakehouse platforms on prime of Amazon SageMaker with information lineage and extra.


Concerning the authors

Mohit Dawar is a Senior Software program Engineer at Amazon Net Providers (AWS) engaged on Amazon DataZone. Over the previous 3 years, he has led efforts across the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys engaged on large-scale distributed methods, experimenting with AI to enhance consumer expertise, and constructing instruments that make information governance really feel easy. Join with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Options Architect for Startups at Amazon Net Providers (AWS) primarily based in Austin, TX, US. He’s captivated with serving to prospects architect fashionable platforms at scale for information, AI, and ML. As a former senior architect in AWS Skilled Providers, he enjoys constructing and sharing options for widespread complicated issues in order that prospects can speed up their cloud journey and undertake finest practices. Join with him on LinkedIn: Jose Romero.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles