Friday, December 27, 2024

Jumia launches a cutting-edge information infrastructure built around metadata-driven design guidelines.

Established in 2012, is a pioneering know-how firm that has expanded its footprint to 14 African countries, with its primary hub situated in Lagos, Nigeria. Jumia’s business model revolves around three core components: a marketplace, a logistics service, and a payment platform. The logistics service enables a network of local partners to deliver packages, while the fee service simplifies online payment transactions within Jumia’s platform. Jumia, listed on the New York Stock Exchange (NYSE), boasts a market capitalization of approximately $554 million.

On this publication, we explore the transformative journey taken by Jumia in collaboration with AWS Professional Services to migrate its data platform from a legacy Hadoop-based architecture to cloud-native, serverless solutions built on Amazon Web Services (AWS). The primary drivers of modernisation have been the prohibitively high costs of maintenance, the inability to rapidly scale computing resources as needed, cumbersome job queuing processes, limited innovation in adopting cutting-edge technologies, the need for advanced automation of infrastructure and applications, and the inability to develop solutions locally.

Answer overview

The modernization challenge centers on developing metadata-driven frameworks that are not only reusable, scalable, but also capable of responding effectively to various stages of the modernization process? These five phases are: information orchestration, information migration, information ingestion, information processing, and information upkeep?

To enhance efficiency and minimize risk, a standardized approach was conceived to simplify event workflows and prevent potential errors caused by varied methodologies. By doing so, it facilitated seamless migration of diverse data sets, regardless of the specific context or application. By implementing this approach, information remains consistent, environmentally friendly, and straightforward to manage across various tasks and teams. Although the utilization circumstances have autonomy over their domain from a governance standpoint, a centralised governance model prevails to govern the entry point into the shared architectural components. To ensure seamless data protection, the framework incorporates robust encryption methods across entire organizations, leveraging secure storage solutions like Amazon S3. This design philosophy adheres to the principle of least privilege, thereby bolstering overall system security while reducing potential vulnerabilities.

The subsequent diagram outlines the frameworks that have been established. The newly developed information platform structures its workload into distinct categories based on specific use cases. Each use case necessitates the crafting of a distinct set of YAML records for every segment, encompassing information migration to information flow orchestration, which serve as the primary input into the system. The output consists of sets of directed acyclic graphs (DAGs), which execute specific tasks or duties.

Overview

Targets within each section are discussed, alongside the implementation and key takeaways.

Information orchestration

To design a metadata-centric architecture that harmonizes information flows throughout the entirety of the modernization process.

The orchestration framework provides a robust and scalable solution featuring capabilities such as dynamically creating directed acyclic graphs (DAGs), seamless integration with non-AWS entities, enabling dependency creation based on prior executions, and offering accessible metadata tracking for each run. After implementing Amazon Managed Workflows (MWAA), the Apache Airflow engine efficiently delivers these capabilities while seamlessly abstracting users from administrative operations.

The following outlines the metadata record schema that will be provided as part of the information orchestration section for a specific use case, enabling data processing using Apache Spark:

proprietor: # Use Case Proprietor
dags:
  - title: # Use Case Title
    kind: # DAG Type (Migration, Ingestion, Transformation, or Upkeep)
    tags:
    notification:
      on_success_callback: true
      on_failure_callback: true
    spark:
      entrypoint: # Spark Script Entry Point
      arguments: # Required Arguments for the Spark Script
      spark_submit_parameters: # Spark Submit Parameters 

The underlying intention of these frameworks is to develop reliable and efficient building blocks that enable event groups to streamline their processes while maintaining consistency. The framework provides the capability to create Directed Acyclic Graph (DAG) objects within Amazon Managed Warehouse for Analytics (MWAA), driven by configuration files (YAML files).

This explicit framework builds upon layers that cumulatively introduce distinct functionalities, ultimately yielding a comprehensive Directed Acyclic Graph.

  • The Directed Acyclic Graphs (DAGs) are primarily built upon the provided metadata, leveraging this information as a foundation for construction. Information engineers no longer need to manually craft Python code to generate Directed Acyclic Graphs (DAGs), as they are dynamically created through this module’s capabilities.
  • Validations: This layer ensures the integrity of YAML files by validating their contents and preventing corrupted data from impacting the creation of other Directed Acyclic Graphs (DAGs).
  • Dependencies – This layer manages complex relationships between disparate Directed Acyclic Graphs (DAGs), enabling the handling of sophisticated interconnected structures.
  • Notifications – This layer manages the types of notifications and alerts that can be integrated into various workflows seamlessly.

Orchestration

When leveraging Amazon MWAA, it’s essential to consider that this managed service demands some level of upkeep from customers, necessitating a solid grasp of various DAGs and processes to fine-tune the deployment and achieve desired performance. Throughout the engagement, parameters refined include core.dagbag_import_timeout, core.dag_file_processor_timeout, core.min_serialized_dag_update_interval, core.min_serialized_dag_fetch_interval, scheduler.min_file_process_interval, scheduler.max_dagruns_to_create_per_loop, scheduler.processor_poll_interval, scheduler.dag_dir_list_interval, and celery.worker_autoscale.

The layer corresponding to validation is one of several depicted in a preceding diagram. This fundamental aspect proved crucial in generating highly responsive directed acyclic graphs. Upon entering the framework comprising YAML files, it became necessary to pre-filter any corrupted files before attempting to generate DAG objects. By implementing a robust contingency plan, Jumia is likely to avoid unforeseen disruptions that could compromise its overall strategy. The module responsible for constructing Directed Acyclic Graphs (DAGs) exclusively accepts configuration data complying with specific requirements to facilitate their efficient creation. When corrupted record data occurs, the exact details are recorded to enable developers to rectify the issue.

Information migration

This section aims to develop a metadata-driven framework for efficiently migrating data from HDFS to Amazon S3, utilizing the Apache Iceberg storage format, thereby minimizing operational overhead, scaling seamlessly during peak hours, and ensuring data integrity and confidentiality throughout.

The accompanying diagram provides a visual representation of the organizational framework.

Migration

Within this section, a metadata-driven framework native to PySpark accepts a configuration file as input, enabling the execution of certain migration tasks within an Amazon EMR Serverless job. The task leverages the PySpark framework due to its deployment context. The orchestration framework utilised earlier enables the creation of a migration-directed acyclic graph (DAG), which executes subsequent tasks in a predetermined order.

  1. The primary function generates DDL scripts in Iceberg format using the migration framework within an Amazon EMR Serverless job.
  2. Following table creation, the subsequent task migrates HDFS data to an S3 landing zone using synchronization to update customer records. This course brings together information from all different layers of the data lake.
  3. Once the course is fully populated, a third job transforms data into Iceberg format by migrating it from the touchdown bucket to the destination bucket (raw, processed, or analytical), leveraging the migration framework embedded within an Amazon EMR Serverless job again using another option.

For optimal information switch efficiency, file sizes ranging from 128 to 256 megabytes are recommended; therefore, compressing files at their source can significantly enhance the transfer process. By reducing the number of records, metadata evaluation and integrity checks are streamlined, accelerating the migration process.

Information ingestion

The goal of this section is to integrate an additional framework primarily focused on metadata that accommodates both data ingestion patterns. In batch mode, charges are incurred for extracting information from disparate data sources (such as Oracle or PostgreSQL), whereas micro-batch-based processing extracts data from a Kafka cluster with the capacity to execute native streams in real-time, depending on configurable parameters.

The diagram below outlines the architecture for both batch, micro-batch, and streaming processing methods.

Ingestion

A metadata-driven framework constructs the logic required to extract and process information from various sources, including Kafka, databases, and external companies, which will be executed through an ingestion DAG deployed within Amazon Managed Workflow for Apache Airflow (MWAA).

Spark Structured Streaming was leveraged to ingest data from Kafka topics. The framework accepts configuration records in YAML format, detailing the topics to study, required extraction procedures, reading modes (streaming or micro-batch), and destination table for storing data, along with other customizations.

Batch ingestion was successfully implemented using a metadata-driven framework, powered by Pyspark’s scalability and flexibility. With the same methodology as previously employed, the framework acquired a configuration in YAML format that specified the migration targets – namely, the tables in need of migration – along with their corresponding destinations.

When migrating data, it’s crucial to consider the synchronization of information between the ingestion and migration sections to ensure seamless integration, eliminate gaps in data, and prevent redundant processing. To achieve this, a solution was implemented that captures and stores the timestamps of the last historical records migrated to a specific DynamoDB table per desk. Frameworks utilize this data upon initial execution. When utilizing micro-batching for specific situations that leverage Spark Structured Streaming, the value stored in DynamoDB is learned and assigned by processing the Kafka data. startingTimeStamp parameter. Precedence for all different executions is likely to be given to the metadata contained within the designated checkpoint folder. To ensure a seamless transition, synchronise positive ingestion with information migration.

Information processing

To effectively manage updates and deletions within an object-oriented file system, our solution prioritized Iceberg’s adoption throughout the challenge due to its reliable ACID capabilities in handling delta lake records. Although all stages leverage Iceberg as delta records, the processing component extensively employs Iceberg’s capabilities for incremental data processing, establishing a processing layer via UPSERT utilizing Iceberg’s ability to execute MERGE INTO commands.

Below is a depiction of the framework in visual form.

Processing

The ingestion process maintains a similar architecture, albeit with adjustments to accommodate the information flow being routed through Amazon S3. This approach expedites the supply chain while ensuring exceptional quality at a production-ready level.

By default, Amazon EMR Serverless allows you to run scalable and on-demand big data analytics workloads without provisioning or managing servers. spark.dynamicAllocation.enabled parameter set to True. This feature dynamically adjusts the number of executors registered within the utility in accordance with changing workloads. Utilizing Iceberg tables in data warehousing enables efficient handling of diverse workloads, but also introduces concerns that must be carefully managed. While populating an Iceberg data desk, the Amazon EMR Serverless capability leverages multiple executors to accelerate processing tasks efficiently. If this may lead to exceeding Amazon S3’s limits on requests per second per prefix, potentially causing issues with scalability and performance. Given the importance of effective data management, it’s crucial to employ robust information segmentation strategies.

When considering data integrity in these situations, another crucial aspect to take into account is the file format used for storing articles. By default, Iceberg uses the Hive storage format; however, it can be configured to utilize ObjectStoreLocationProvider. When setting this property, a deterministic hash is generated for each file, with the resulting hash value appended immediately thereafter. write.information.path. This improvement can substantially reduce throttle requests driven primarily by object prefixes, while simultaneously maximizing throughput for Amazon S3-related I/O operations, since data written is evenly distributed across multiple prefixes.

Information upkeep

To ensure optimal performance when working with information lakes utilizing Iceberg’s desk codecs, it is crucial to perform regular maintenance tasks to effectively manage metadata files, preventing unnecessary data accumulation and expeditiously removing any unused files. The objective of this segment was to design an additional architecture capable of executing many of these functions within the data lake’s tables.

The accompanying diagram effectively illustrates the underlying structure.

Maintenance

The framework accepts a YAML-based configuration file that specifies tables and maintenance tasks along with their relevant parameters, in addition to its opposing counterparts. Developed on PySpark, this construct enables seamless execution within Amazon EMR’s serverless environment, facilitating orchestration through frameworks akin to others built as part of this solution?

The upcoming maintenance tasks are underpinned by a robust framework.

  • Snapshots can be leveraged for roll-back operations, complementing time-travel query capabilities. Despite their initial benefits, such minor inefficiencies will inevitably compound, ultimately leading to a decline in overall performance? It’s crucial to regularly purge unnecessary snapshots to optimize storage resources effectively?
  • Metadata records data can accumulate over time, much like a series of snapshots forming a comprehensive picture. Eliminating recurring errors can have a significant impact, especially in situations involving continuous processing, such as real-time data stream processing or micro-batch workflows, which was a key consideration driving the overall solution.
  • As data variability grows, so too does the volume of metadata stored in manifest files, ultimately leading to inefficient query performance due to small file sizes? The use of a streaming and micro-batching utility enables efficient writing of records into Iceberg tables, resulting in compact data sizes. Given the need to optimize data storage and retrieval, a compression technique was essential to enhance overall performance.
  • The requirement was to develop the capacity for efficiently purging data that exceeded a specific age threshold. To eliminate outdated snapshots and delete redundant metadata files.

The upkeep duties have been scheduled with varying frequencies depending on the specific use case and job requirements. The scheduling information for these tasks is detailed within each YAML file specific to a given use case.

When this framework was implemented, there was no computerized maintenance solution available for the Iceberg tables. At AWS re:Invent 2024, Amazon has announced the launch of Performance, a new feature designed to automate the maintenance of Iceberg tables. This tool streamlines file compaction, simplifies snapshot management, and eliminates unused files with ease.

Conclusion

By leveraging standardized frameworks and metadata-driven architectures, a knowledge platform can streamline process execution, optimize data migration and ingestion, and ensure seamless orchestration – ultimately accelerating implementation and growth while providing unparalleled visibility and control across all stages. Additionally, leveraging Amazon EMR Serverless and DynamoDB enables seamless access to the benefits of serverless architectures, including automatic scaling, streamlined management, diverse integration options, enhanced resilience, and reduced costs.

With this structure, Jumia successfully scaled back the price of its information lake by 50%. Moreover, through this approach, collaboration between information and DevOps teams has enabled the deployment of comprehensive infrastructure and data processing capabilities by jointly crafting metadata records alongside Spark SQL datasets. This method has resulted in a significant reduction of turnaround time to manufacturing, subsequently decreasing failure rates. Furthermore, Glue provides the capability to collaboratively manage and govern datasets across various storage layers, both within the AWS platform and beyond.

Helder Russa, Head of Information Engineering at Jumia Group.

Streamline the information migration process with AWS: Take the first step.


In regards to the Authors

Serves as a Senior Buyer and Supply Architect at Amazon Net Companies. He spearheaded the initiative with unwavering commitment to leveraging expertise for the company’s benefit.

As an Information Architect at Amazon Net Companies, she excels in crafting analytical solutions that illuminate complexity, translating intricate data processes into lucid and practical findings. Her research prioritizes distilling complex data into actionable insights to inform decisive action.

As the Head of Information Engineering at Jumia Group, I spearhead the development of cutting-edge information platforms that drive informed decision-making, operational agility, and data-driven insights.

As a Principal Information Engineer at Jumia Group, I am responsible for crafting and governing the information architecture, with a focus on AWS Platform and data lakehouse technologies, ensuring robust and adaptable data solutions and analytics capabilities.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles