Monday, March 31, 2025

How PostNL processes billions of IoT occasions with Amazon Managed Service for Apache Flink

As the primary mail delivery partner for the Netherlands, it offers a tripartite business framework comprising postal services, parcel delivery, and logistics solutions tailored to e-commerce and international transactions. With a staggering 5,800 retail locations, 11,000 mailboxes, and over 900 automated parcel lockers, the corporation occupies a crucial role within the complex logistics supply chain. Our goal is to become the go-to supply chain for seamless parcel and mail delivery, ensuring a hassle-free experience for both shippers and recipients. As a vital part of society’s infrastructure, PostNL’s approximately 34,000-strong workforce plays a crucial role. On an average weekday, the corporation processes a median of approximately 1.1 million parcels and 6.9 million letters across Belgium, the Netherlands, and Luxembourg.

We detail the enduring legacy of PostNL’s stream processing solution, highlighting the obstacles it faced, and how the company chose to revamp its Internet of Things (IoT) data stream processing architecture to stay ahead in today’s fast-paced digital landscape. We provide a comprehensive framework that outlines the steps we took to migrate, along with the key takeaways and lessons learned in the most effective manner.

As a result of this migration, PostNL has successfully established a robust, scalable, and flexible stream processing solution for its IoT platform, enabling it to handle growing data volumes with ease and precision. Apache Flink is a natural fit for IoT applications, leveraging its scalable and fault-tolerant architecture to efficiently process the vast amounts of data generated by connected devices. By scaling horizontally, organizations can effectively process the exponentially growing volumes of data produced by IoT devices. With occasion-based semantics, you can effectively handle events in the order they were generated, even from sometimes disconnected devices?

PostNL is escalating its enthusiasm for Apache Flink’s potential, and is poised to leverage Managed Service for Apache Flink across various streaming use cases, while also migrating additional business logic upstream into the platform.

What are the key benefits of using Apache Flink in a cloud-based data processing pipeline?

Apache Flink is a widely-used open-source framework for distributed stream and batch processing that enables users to process vast amounts of data in real-time with robust support for stateful computations. The framework provides a unified API for creating both batch and streaming processing tasks, enabling developers to efficiently manage data flows of varying scales. Managed Service for Apache Flink on Amazon Web Services (AWS) provides a serverless, fully managed infrastructure for running Apache Flink workloads with ease and scalability. Developers can effortlessly build highly available, fault-tolerant, and scalable Apache Flink applications without needing to become experts in setting up, configuring, or maintaining Apache Flink clusters on Amazon Web Services (AWS).

Real-time Internet of Things (IoT) data processing is a pressing concern as the sheer volume and velocity of sensor-generated information continues to grow exponentially.

Currently, PostNL’s Internet of Things (IoT) platform efficiently monitors over 380,000 assets utilizing Bluetooth Low Energy (BLE) technology in near real-time. The IoT platform was engineered to deliver real-time asset tracking, leveraging telemetry data from sensors such as GPS coordinates and accelerometers originating from Bluetooth devices, ensuring seamless geofencing and backside state monitoring. These instances enable diverse internal customers to streamline logistical operations, fostering greater efficiency, sustainability, and ease of planning.

How PostNL processes billions of IoT occasions with Amazon Managed Service for Apache Flink

As a result of monitoring an overwhelming volume of diverse assets generating distinct sensor outputs, the IoT platform and subsequent systems are faced with processing an enormous influx of raw IoT events. Processing this data load repeatedly throughout the entire IoT ecosystem, including all downstream processes, proved neither cost-effective nor straightforward to maintain. The IoT platform reduces occasion cardinality by leveraging stream processing to aggregate data across mounted time windows. The aggregations must primarily rely on the timestamp of the event emission by the system, i.e., at the second level. When aggregating data primarily based on occasion time, the system becomes increasingly complex as messages can be delayed and arrive out of sequence, a common issue with IoT devices prone to brief disconnections.

The following diagram outlines the overall circulation pattern from edge to downstream methods.

PostNL IoT workflow

The workflow comprises the following components:

  1. The Sting structure comprises IoT BLE devices that serve as sources of telemetry data, and gateway devices that connect these IoT devices to the IoT platform.
  2. AWS Inlets are a collection of companies that enable the aggregation of IoT detections, leveraging MQTT or HTTPS protocols, and transmit them to designated data streams via Kafka.
  3. The aggregation software processes IoT event streams, condensing detected anomalies into meaningful aggregates over a specified time frame before forwarding these consolidated insights to designated knowledge pipelines.
  4. Occasion producers are a combination of various stateful entities that generate Internet of Things (IoT) events based on geofencing, availability, bottom-line state, and in-transit scenarios.
  5. Companies, along with shops corresponding to , , and Kinesis Knowledge Streams, deliver curated events to customers.
  6. Shopper segments, comprising internal groups, perceive IoT occurrences and derive business logic from these events to inform their decisions.

The pivotal component of this architecture is the aggregation software platform. The initial application of this element relied on outdated stream processing expertise from a bygone era. Due to various reasons that will soon become apparent, PostNL decided to transform this critical aspect of its operations. The primary focus of this publication will be on transforming the legacy stream processing architecture to leverage the capabilities of Managed Service for Apache Flink.

Should we transition the current Flink deployment to a Managed Service?

As the proliferation of connected devices continues to accelerate, the requirement for a robust and elastic infrastructure capable of processing and consolidating vast amounts of Internet of Things data becomes increasingly pressing. Following a rigorous assessment, PostNL decided to transition to a Managed Service for Apache Flink, driven by several key strategic considerations that align with the shifting needs of modern enterprises.

  • By leveraging Apache Flink’s robust real-time data processing capabilities, PostNL efficiently integrates raw IoT data from diverse sources. The ability to extend beyond existing boundaries of data integration offers a gateway to uncovering hidden insights and fostering more informed strategic decisions.
  • The managed service enables seamless scaling of your software as needed, accommodating increased demands with ease. As the proliferation of IoT devices continues to accelerate, this innovation empowers PostNL to seamlessly manage burgeoning volumes of data. This scalability enables the continuous scaling of knowledge processing capacities to match the growing needs of the enterprise.
  • By leveraging a managed service, IoT platform staff are free to focus on developing enterprise-specific logic and creating innovative use cases. The significant learning curve and operational complexity of running Apache Flink at scale would have siphoned off valuable resources and attention from the relatively limited team, thereby hindering the adoption process.
  • Our managed service for Apache Flink operates on a flexible, pay-per-use basis, allowing customers to better align their expenses with their operational budgets. This adaptability is particularly valuable in adjusting costs in response to shifting demands for knowledge processing.

Managing last-minute events poses significant difficulties, requiring swift reactions and a high degree of adaptability.

Frequently used instances in stream processing necessitate aggregating events according to the timestamp of their generation. The terminology used in reference materials. Implementing such logic may lead to delayed occurrences, where events arrive at your processing system belatedly, often well after other events have been generated concurrently.

Late occurrences are prevalent in IoT due to factors inherent to the environment, including network latency, system malfunctions, temporary disconnections of devices, and downtime? IoT devices commonly communicate via Wi-Fi connections, which can potentially lead to latency issues in the transmission of data packets. However, they may experience periodic connectivity disruptions, resulting in the buffering of information that is then transmitted in chunks upon reconnection. Occasions may be processed out of sequence, potentially resulting in some instances being handled several minutes after others that were generated simultaneously.

Combine occurrences triggered by devices within a specified 10-second timeframe. Since occurrences may run several minutes behind schedule, how can one be absolutely sure they’ve captured every instance that emerged within this ten-second window?

An easy implementation could consider a specific number of minutes, allowing for flexibility in accommodating late arrivals. However, this methodology’s limitation is that you cannot calculate the results of your aggregation until several minutes later, thereby increasing the output latency significantly. All subsequent resolutions will be finalized within mere seconds, thereby superseding any subsequent opportunities that may arise.

Faced with rising latency and dropping occasions that threaten to compromise critical data, enterprises are left with few palatable options. The optimal solution lies in striking a balance between latency and comprehensiveness, finding a sweet spot where the two competing demands are harmoniously reconciled.

Apache Flink provides out-of-the-box occasion-based time semantics. Unlike other stream processing frameworks, Flink offers several options for handling late events. What happens to late events in Apache Flink?

A robust stream processing API

Apache Flink provides a comprehensive suite of operators and libraries for large-scale data processing tasks, featuring windowing, joins, filtering, and transformation capabilities. With its robust architecture, the framework seamlessly integrates with more than 40 connectors from diverse knowledge sources and sinks, including real-time streaming technologies such as Kinesis Data Firehose and Amazon Kinesis Knowledge Streams, as well as traditional databases, file systems, and object stores like Amazon S3.

Since data processing has become increasingly complex and time-sensitive, having Apache Flink as a powerful attribute enables PostNL to process large volumes of data quickly and efficiently. To uncover new insights, start by lifting your thinking to a higher level of generality. The APIs summarize streaming knowledge into easily digestible formats, simplifying the learning process for seamless integration. When your logic escalates to an extremely high level of sophistication, you may choose to transition to a lower level of abstraction where streams are natively represented, situated closer to the processing occurring within Apache Flink itself? For those who demand an exceptionally high level of control over each situation, the option exists.

A crucial finding in software studies is that choosing a particular level of abstraction for your system doesn’t necessarily constitute an immutable architectural decision. Within this identical software, you can seamlessly integrate diverse APIs, contingent upon the level of control you require at each juncture.

Scaling horizontally

To accommodate an astonishing number of transactions and evolve with the business, PostNL required the scalability to sustain its growth trajectory. Apache Flink is engineered to seamlessly scale horizontally, dispersing processing and software state across multiple processing nodes, allowing for effortless scalability as workloads increase.

To effectively process vast amounts of unprocessed data, PostNL required combining similar events over time to reduce its cardinality and render the information flow manageable for subsequent systems. Aggregations surpass simplistic transformations that focus on a single occurrence at a time. Can they leverage Apache Kafka’s scalable and fault-tolerant architecture for stateful stream processing? Apache Flink was specifically engineered to address this exact type of complex data processing requirement.

Superior occasion time semantics

Apache Flink emphasizes occasion-time processing, enabling accurate and consistent handling of data with regard to its occurrence in a precise timeline. Flink’s built-in assistance for occasion time semantics enables seamless handling of out-of-order events and late data, thereby ensuring robust processing in complex event-driven applications. This functionality was a fundamental cornerstone of PostNL’s operations. As discussed, IoT-generated events may arrive late and out of order? Notwithstanding, the aggregation logic must ultimately rely on the precise moment the measurement was recorded by the system – the timestamp – rather than when it is subsequently processed.

Resiliency and ensures

To ensure seamless operations, PostNL requires that any dispatched knowledge from their system must be accurately verified as not misplaced, even in the event of a system failure or restart. Apache Flink features robust fault tolerance through a distributed, snapshot-based checkpointing mechanism. Within occasions of failure, Flink can successfully recover the state of computations and guarantee exactly-once semantics for the consequences. On rare occasions, a tool’s output is never overlooked or recorded multiple times, not even during software malfunctions.

What’s the best way to navigate the complexity of Apache Flink APIs?

A fundamental imperative of the migration was faithfully replicating the behavioral patterns of the legacy aggregation software, given the unalterable nature of the downstream techniques that relied on these precise habits. This triggered a plethora of additional hurdles, notably surrounding the complexities of windowing semantics and addressing the nuances of handling late occurrences.

In IoT scenarios, events can occasionally become desynchronized by several minutes. Apache Flink proposes two core concepts for handling out-of-order events in event-time processing: watermarking and late joining.

Apache Flink provides a range of versatile APIs with varying levels of abstraction. After preliminary analysis, the data was discarded. The increased abstraction levels offered more comprehensive windowing and temporal semantics, yet failed to provide the granular control needed to replicate the precise behaviors of their legacy systems, a key requirement for PostNL’s exacting standards.

By decreasing the level of abstraction, the system also offers windowing aggregation capabilities, allowing users to tailor behaviors with customizable, hierarchical, and temporal settings to effectively manage late occurrences.

Regrettably, the antiquated software was crafted to handle tardy situations in a unique manner. A complex event processing workflow arose, rendering straightforward replication using Apache Flink’s higher-level abstractions impossible.

Luckily, Apache Flink offers a higher level of abstraction than traditional batch processing systems. With this API, you enjoy the most precise control over software state, empowering you to seamlessly integrate almost any custom time-dependent functionality.

PostNL has decided to take this path forward. Using an aggregator enables the execution of arbitrary stateful processing on keyed streams, which are logically partitioned into separate entities. Data from every IoT source is collected and processed in real-time, with timestamps provided by the originating systems serving as the basis for aggregation. The results are then emitted according to a sliding window approach, which utilizes the current system time for processing purposes?

Fine-tuned management ultimately enabled PostNL to cultivate exactly the desired behaviors required by its downstream goals.

The journey to manufacturing readiness

What’s required to embark on a transformative migration to managed services for Apache Flink? The odyssey begins with defining our motivations and goals, followed by careful planning and execution.

Figuring out necessities

During the initial stage of the migration process, the primary focus is on thoroughly grasping the existing system’s architecture and identifying key performance indicators to ensure a seamless transition. The goal was to enable a smooth migration to Managed Service for Apache Flink with negligible impact on existing workflows.

Understanding Apache Flink

PostNL aimed to gain hands-on experience with the Managed Service for Apache Flink software, exploring its real-time streaming processing features, inclusive of aggregation capabilities, adaptability options, and strategies for handling latency issues.

Different options were explored, leveraging fundamental building blocks provided by Apache Flink, such as time logic and late events, to craft innovative solutions. One of the primary imperatives was to cultivate an exact replica of the behavioral patterns inherent in the existing software. The ability to shift towards employing a lower level of abstraction proved beneficial. By leveraging the advanced process-level control offered by the ProcessFunction API, PostNL was able to manage tardy events with precision, building upon the strengths of its legacy software infrastructure.

Designing and implementing ProcessFunction

The enterprise logic leverages a ProcessFunction to replicate the idiosyncratic behavior of legacy software when handling late occurrences, thereby minimizing delays in initial results. Given PostNL’s requirement for seamless integration with Apache Flink, the company decided to utilize Java for development, leveraging its status as the primary programming language for this open-source platform. Developed with Apache Flink, you can create and test your application locally within your preferred integrated development environment (IDE), leveraging all available debugging tools, before deploying it to Managed Service for Apache Flink. The Java 11 platform, utilizing the Maven compiler, served as the foundation for this project’s implementation. To learn more about the essential features of an Integrated Development Environment (IDE), please refer to our comprehensive guide.

Testing and validation

The following diagram illustrates the architecture employed for verifying the novel software’s integrity.

Testing architecture

To ensure the consistency of ProcessFunction behavior and handle late-occurrence scenarios effectively, integration tests were crafted to concurrently execute both the legacy software and the managed Flink solution (Steps 3 and 4). By employing parallel execution, PostNL was able to concurrently assess the outputs produced by each software under identical conditions. Multiple integrations concurrently process instances, forwarding knowledge to the supply stream (2) and pausing until their designated aggregation windows are complete. Subsequently, they retrieve the aggregated results from the destination stream to synchronize (8). Integration exams are automatically triggered by the continuous integration and continuous delivery (CI/CD) pipeline once the infrastructure deployment is complete. During the integration exams, initial attention centered on achieving consistent knowledge transfer and processing precision between the legacy software and the managed Flink solution. Comparative analyses of output streams, aggregated knowledge, and processing latencies validated the successful migration by ensuring no sudden discrepancies were introduced. A combination of open-source automation frameworks were used for writing and operating the exams.

After the combination exams are administered, there is an additional validation layer: end-to-end exams. After the successful deployment of the platform infrastructure, the end-to-end exams are automatically triggered by the continuous integration and continuous delivery (CI/CD) pipeline, mirroring the combination exams’ seamless execution. Multiple end-to-end testing instances concurrently transmit data to AWS IoT Core and then validate the aggregated results by retrieving and matching them against those stored in an S3 bucket.

Deployment

PostNL has made a firm decision to deploy the latest Flink software. The newly developed software operated concurrently with its legacy counterpart, processing the same inputs and transmitting outputs to an Amazon S3-based data repository. By leveraging their practical manufacturing expertise, they were able to correlate the consequences of both approaches and also verify the reliability and effectiveness of the novel methodology.

Efficiency optimization

As PostNL’s IoT platform team navigated the migration process, they optimised Flink’s performance by carefully considering factors such as data volume, processing speed, and effective handling of late events. It was intriguing to verify that the state variable’s growth rate did not exhibit exponential expansion indefinitely. A potential pitfall of leveraging ProcessFunction with granular control is that. Here is the rewritten text in a professional style: Without proper handling, this can lead to the state developing unboundedly. As a consequence of their constant-running nature, streaming purposes can suffer from decreasing performance and eventually exhaust available memory or native disk space as the state continues to grow in size.

Through rigorous testing, PostNL found that software parallelism harmoniously paired with compute, memory, and storage capabilities allowed it to efficiently process daily workloads without lag, while also adeptly handling periodic spikes without excessive provisioning, ultimately achieving a balance between performance and cost-effectiveness.

Remaining change

Following a thorough testing period in shadow mode, the team confirmed that the newly installed software performed consistently, producing the expected results with stability. The PostNL IoT platform has successfully transitioned to production mode and discontinued use of its legacy software.

Key takeaways

Throughout our experience with implementing Managed Service for Apache Flink, several crucial takeaways emerged, particularly as we scaled up to multiple use cases.

  • A profound comprehension of occasion-time semantics is crucial in Apache Flink for accurately executing time-sensitive data processing operations. This data ensures that events are processed accurately according to their actual dates of occurrence.
  • Apache Flink’s API enables the development of sophisticated, stateful streaming applications beyond basic windowing and aggregation capabilities. To effectively navigate complex knowledge processing demands, a comprehensive understanding of the API’s advanced features is essential.
  • The exceptional performance capabilities of Apache Flink’s API yield substantial benefits. Constructors ought to prioritize the development of sustainable, long-lasting, and eco-conscious structures, necessitating judicious resource allocation and compliance with industry-recognized best practices in software engineering and architectural design.
  • Combining occurrence timing with processing time to aggregate knowledge poses distinct challenges. The text does not utilize higher-level functionalities supplied outside the field by Apache Flink. At the most fundamental level, Apache Flink’s APIs enable developers to craft customised time-sensitive logic, necessitating a deliberate design approach to ensure precision and timely results, accompanied by rigorous testing to verify optimal performance.

Conclusion

As PostNL embarked on the path of embracing Apache Flink, they discovered that its robust APIs empower the implementation of sophisticated enterprise logic with unparalleled ease. To gain insight into Apache Flink’s application, the team arrived to address a multitude of challenges; subsequently, they are poised to expand its utilization in additional streaming processing scenarios.

With Managed Service for Apache Flink, teams can focus on delivering business value and crafting essential enterprise logic, unencumbered by the complexity of setting up and managing a scalable Apache Flink cluster?

To delve deeper into managed services for Apache Flink and select the ideal solution and API for your specific application, consult. To gain expertise in hands-on approaches for developing, deploying, and operating Apache Flink applications on Amazon Web Services (AWS), consult the.


Concerning the Authors

Çağrı Çakır As a lead software program engineer for PostNL’s IoT platform, he oversees the architecture that handles billions of events daily. As a seasoned AWS Licensed Options Architect, he excels in crafting and deploying large-scale, real-time event-driven architectures and scalable stream processing solutions. He is passionate about leveraging real-time expertise to optimize operational efficiency and develop scalable solutions.

Ozge Kavalci As a seasoned Senior Answer Engineer at PostNL’s IoT platform, this individual excels in designing innovative solutions that seamlessly integrate with the vast expanse of IoT technology. As a licensed Amazon Web Services (AWS) options architect, she excels in crafting and deploying ultra-scalable serverless solutions and real-time data streaming architectures that effectively handle unpredictable workload demands. To fully harness the power of real-time knowledge, she is passionately committed to charting a course for seamless IoT integration.

Amit Singh As a Senior Options Architect at Amazon Web Services (AWS), he/she leverages their expertise to craft compelling value propositions that resonate with large-scale customers, engaging in in-depth architectural debates to ensure innovative solutions are cloud-ready and poised for successful implementation. Constructing strong relationships with influential senior technical individuals enables these thought leaders to become cloud champions, driving adoption and innovation within their organizations. When not occupied by professional duties, he enjoys spending quality time with his family while also dedicating himself to learning more about the intricacies of cloud computing.

Lorenzo Nicora Serves as a Senior Streaming Options Architect at Amazon Web Services (AWS), supporting customers across Europe, the Middle East, and Africa (EMEA). For several years, he has developed and refined cloud-based, data-driven methodologies, leveraging his expertise through both consulting roles and positions at fintech product companies operating within the finance sector. Having leveraged open-source applied sciences widely, he has made significant contributions to various initiatives, including the development of Apache Flink.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles