Friday, December 13, 2024

Kaplan, Inc. leveraged modern data pipelines by integrating Amazon Managed Workflow for Apache Airflow (MWAA), Amazon AppFlow, and Amazon Redshift to create a robust information infrastructure.

, Inc. Empowers individuals, educational institutions, and businesses with a diverse range of solutions, catering to the varied and dynamic needs of our students and partners throughout their academic and professional paths? Established on a foundation of empowering individuals, our Kaplan legacy fosters an environment where people achieve their goals. Committed to nurturing a culture of learning, Kaplan is revolutionizing the landscape of education.

Kaplan information engineers enable advanced data analytics by leveraging Tableau’s powerful visualization tools. The infrastructure provides advanced analytics capabilities to numerous in-house analysts, data scientists, and student-facing frontend developers. The information engineering team is committed to revolutionizing their information integration platform, transforming it into a nimble, responsive, and intuitive tool that seamlessly integrates data. To achieve this, they opted for the scalability and reliability of the Amazon Web Services (AWS) cloud infrastructure and its trusted providers. Numerous pipeline types must be migrated from the current integration platform to AWS Cloud, comprising various source types, including Oracle, Microsoft SQL Server, MongoDB, APIs, software-as-a-service (SaaS) applications, and Google Sheets. At the time of writing, more than 250 objects are being extracted from a diverse range of Salesforce environments, comprising three distinct scenarios.

Here are some key takeaways from the recent submission highlighting the innovative work of Kaplan’s information engineering team in leveraging information integration from Salesforce to Amazon Redshift. The solution leverages Amazon Lake Formation as an information lake, Amazon Redshift as a data warehouse, Amazon SageMaker Autopilot (MWAA) as an orchestrator, and Tableau as the presentation layer.

Answer overview

The data movement starts by storing high-level information in Amazon S3 and then seamlessly integrates it into Amazon Redshift, leveraging multiple AWS services. The accompanying diagram clearly illustrates this underlying structure.

Amazon Managed Workflows for Apache Airflow (MWAA) serves as the primary tool for orchestrating our information pipelines and integrates seamlessly with various instruments for data migration. While seeking an instrument to extract data from a cloud-based software application such as Salesforce and migrate it to Amazon Redshift, we have explored several options. After conducting a thorough assessment, we found that Amazon AppFlow effectively meets our needs by extracting data seamlessly from Salesforce. Amazon AppFlow enables seamless integration, allowing for real-time migration of data from Salesforce to Amazon Redshift with ease. Despite these considerations, our framework deliberately separates information ingestion and storage processes for two primary reasons:

  • We sought to store data in Amazon S3, utilizing its capabilities as an information lake, thereby establishing a secure archive and a centralized hub for our organization’s information infrastructure.
  • As data continues to evolve over time, it is probable that we will need to reprocess information initially stored in Amazon Redshift before its permanent archival. By leveraging Amazon S3 as an intermediary storage solution, we can decouple transformation logic into a standalone module without significantly disrupting overall data flow.
  • Apache Airflow is the linchpin of our data infrastructure, with various pipelines built using a multitude of tools such as. As a critical component of our overall infrastructure, Amazon AppFlow plays a vital role in facilitating seamless data flows across various sources and destinations, necessitating a cohesive approach to ensure consistent results.

To optimize efficiency, we bifurcated the pipeline into two distinct segments:

  • Utilizing Amazon AppFlow, seamlessly migrate data from Salesforce to Amazon S3.
  • Amazon SageMaker Autopilot (MWAA) enables seamless data loading from Amazon Simple Storage Service (S3) to Amazon Redshift, empowering data-driven decision making. By integrating with these services, you can effortlessly transfer large datasets for analysis, reducing data latency and improving business insights. With MWAA’s automated workflow, you can streamline your ETL processes, minimizing manual effort and ensuring data consistency across systems.

This approach enables us to leverage the advantages of each service while maintaining adaptability and expandability within our data framework. Amazon AppFlow can handle the majority of the pipeline seamlessly without requiring additional tools, thanks to its capabilities such as setting up connections between source and target, scheduling data transfers, and configuring filters; users can also choose from incremental or full load data movement options. After successfully migrating data from Salesforce to an Amazon S3 bucket, we are now able to leverage its scalability and cost-effectiveness for further processing and analysis. Following our setup, we designed a Directed Acyclic Graph (DAG) within Amazon Managed Workflows for Apache Airflow (MWAA), which executes an Amazon Redshift COPY command against the data stored in Amazon S3 and loads it into Amazon Redshift.

As we faced subsequent obstacles, our approach remained steadfast.

  • In order to implement incremental information updates, a manual adjustment of the filter dates within each Amazon AppFlow flow is currently required, which lacks elegance. To streamline our process, we required automating the alteration of the date range.
  • As a result, the individual components of the pipeline were not in harmony because the absence of real-time data made it impossible to determine when the initial section was complete, thereby preventing the subsequent section from commencing. To streamline these processes efficiently.

Implementing the answer

Using Amazon Managed Workflow for Apache Airflow (MWAA) enabled us to streamline automation and address the identified issues. We developed a Directed Acyclic Graph (DAG) that serves as the management middleware for Amazon AppFlow. Here is the rewritten text: We designed a custom Airflow operator that leverages various Amazon AppFlow features by integrating with Amazon AppFlow APIs, enabling capabilities such as flow creation, updates, deletions, and starts. This operator is then utilized within DAGs. Amazon AppFlow stores the connection information in a secure key-store with the prefix `appflow`. The cost of storing a key is factored into the overall pricing for Amazon AppFlow. By leveraging a single Directed Acyclic Graph (DAG), we’ve successfully enabled the seamless orchestration of the entire information flow.

The entire information movement comprises the following key stages:

  1. To orchestrate and automate complex workflows within the Amazon AppFlow framework, you can leverage Directed Acyclic Graphs (DAGs). A DAG is a visual representation of your workflow, where each node represents an activity or task that needs to be executed in a specific order. By using Amazon Glue’s support for DAGs, you can create custom workflows that seamlessly integrate with AppFlow, enabling the efficient processing and movement of data within your applications.

  2. Replacing the outdated movements with state-of-the-art filter dates, leveraging the power of the Domain Adaptation Graph (DAG), we’re poised to revolutionize data processing workflows.
  3. Upon completing the update of its trajectory, the Dynamic Action Group (DAG) initiates its predetermined motion sequence.
  4. The DAG continually monitors the movement’s status until the movement reaches its desired state.
  5. When a hit stands, it confirms that data has successfully transitioned from Salesforce to Amazon S3.
  6. Once the info migration is complete, the DAG executes a COPY command to replicate data from Amazon S3 to Amazon Redshift.

Through this method, we successfully addressed the initial concerns, transforming our information pipelines into robust, intuitive, and user-friendly systems that require minimal guidance and are less prone to errors due to centralized control via Amazon MWAA. Amazon AppFlow, Amazon S3, and Amazon Redshift utilize encryption configurations to safeguard sensitive information. We implemented logging and monitoring capabilities, as well as auditing mechanisms, to track information movement and entry using both and. The final determination uncovers a comprehensive outline of the methodology employed.

Conclusion

Kaplan’s information engineering team efficiently implemented an automated information integration pipeline from Salesforce to Amazon Redshift, leveraging AWS services such as Amazon AppFlow, Amazon S3, Amazon Redshift, and Amazon MWAA.

Using a tailored Airflow operator, we streamlined information flow by integrating Amazon AppFlow capabilities within a single DAG, ensuring seamless data movement orchestration. This approach has effectively addressed the complexities of incremental data loading and synchronization across diverse pipeline tiers, while also rendering the information streams more robust, manageable, and less prone to errors. We reduced the timeframe required to develop a pipeline for a newly introduced object from our previous standard, as well as the pipeline for a novel supplier, by half. This move further simplified the process of retrieving incremental data, thereby reducing the cost per table by 80-90% compared to loading entire datasets at once.

With this cutting-edge information integration platform firmly established, Kaplan is uniquely situated to provide its analysts, data scientists, and student-facing teams with timely and reliable insights, thereby enabling informed decision-making and cultivating a culture of continuous learning and growth.

Streamline data processing workflows by harnessing the power of Airflow in conjunction with Amazon Managed Workflow for Apache Airflow (MWAA), unlocking seamless orchestration of complex pipelines and amplifying business insights through optimized information flow.

For in-depth details and code examples of Amazon MWAA, consult the official documentation and tutorials.


Concerning the Authors

As a seasoned Knowledge Engineer at Kaplan India Pvt Ltd., I utilize my expertise in crafting and overseeing ETL pipelines on AWS, concurrently enhancing process efficiencies and empowering team members through targeted technique improvements.

As a seasoned leader at Kaplan Inc., He collaborates with Knowledge Engineers at Kaplan to design and develop information repositories on Amazon Web Services (AWS) infrastructure. He serves as the facilitator for all migration courses. He devotes himself to designing and developing scalable distributed systems that efficiently manage vast amounts of data in the cloud? Whenever free from professional pursuits, he cherishes time spent traveling alongside his family, discovering new destinations and creating lasting memories.

 As a seasoned AWS Options Architect with expertise in AI/ML technologies. Jimmy operates from Boston, supporting large-scale businesses as they transition to cloud-based solutions, thereby creating environmentally conscious and sustainable platforms for their operations. He has a strong passion for his home life, his cars, and mixed martial arts.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles