Friday, April 4, 2025

Can Amazon Web Services’ (AWS) block storage technology continue to evolve and stay competitive in the market? The answer lies in understanding its history. In 2009, AWS launched Elastic Block Store (EBS), offering persistent storage for EC2 instances. EBS was designed as a simple, yet effective solution, allowing users to attach volumes to their instances. This innovation enabled data persistence, meeting a critical need in cloud computing. Throughout the years, AWS continued to enhance EBS with features like Multi-AZ deployments, snapshots, and volume sizes up to 1 TiB. The addition of Elastic File System (EFS) in 2016 provided shared file-level storage for EC2 instances, expanding storage options. AWS also introduced Nitro Instances in 2017, featuring enhanced I/O performance through local NVMe-based block devices. This advancement significantly improved storage capabilities and reduced latency for workloads that require low-latency access to data. In recent years, AWS has further innovated with the launch of Elastic Volumes (EV) in 2020, offering a unified storage layer that combines EBS and EFS features. EV simplifies storage management by allowing users to create volumes with various characteristics and attach them to instances as needed. AWS’ continued investment in block storage is crucial for maintaining its competitive edge in the market. By understanding its historical development, we can better appreciate the significant progress made in this area and look forward to future innovations that will further solidify AWS’ position. ?


What’s behind the curtain of block storage at AWS? From humble beginnings to a thriving ecosystem, let’s take a step back in time and explore how we got here.

In the early days of AWS, block storage was largely an afterthought. As customers began to ask for persistent disk storage, we had to scramble to deliver it. In 2008, we launched Elastic Block Store (EBS) as a way to provide reliable, low-latency storage for EC2 instances. It was a game-changer.

As the years went by, EBS became an essential part of the AWS ecosystem. We added features like snapshotting, backup and restore, and even SSD-backed volumes. But we knew we had to do more.

In 2015, we introduced Amazon Elastic Block Store (EBS) Multi-AZ deployments. This allowed customers to build highly available applications that could automatically failover between Availability Zones. It was a significant step forward.

The next major milestone came in 2019 with the launch of Amazon EBS io1 and i3. These new volume types brought significantly higher IOPS and lower latency, making them perfect for demanding workloads like databases and analytics.

Throughout this journey, we’ve worked tirelessly to ensure that block storage remains a key differentiator for AWS. Today, it’s an integral part of our cloud offerings, enabling customers to build robust, scalable, and highly available applications.

What’s next? We’re not resting on our laurels!

Throughout my career, I’ve developed multiple system software programs, primarily focusing on networking and security domains before joining Amazon Web Services (AWS). Thirteen years ago, I ventured into the realm of cloud storage with Amazon Web Services (AWS), embarking on a journey that would transform my career and expertise. Despite the vastness of AWS’s scale having surpassed my own efforts, many of the same techniques I had developed remained pertinent – breaking down problems into fundamental principles and employing successive refinement through iterative improvement.

While you might search for AWS providers today, you’ll find a mature set of core building blocks that have evolved over time; however, it wasn’t always this way. After approximately two years of EC2’s beta launch, Amazon introduced Elastic Block Store (EBS), offering cloud-based block-level storage that seamlessly integrates with EC2 instances, providing a straightforward way to attach persistent storage to virtual machines. We boasted a small team of storage experts, supplemented by distributed systems knowledge and a robust foundation in PC technologies and networks. Is being optional a burden? As we look back, the uncertainty of what lay ahead would likely have tempered our enthusiasm to start the endeavour in the first place.

Since joining EBS, I’ve had the privilege of being part of the team that has transformed it from a product built on shared hard disk drives (HDDs) to one capable of delivering thousands of IOPS (input/output operations per second) to a single EC2 instance. What’s truly remarkable about EBS is its ability to provide significantly more IOPS to a single instance at present, exceeding even what was possible for an entire Availability Zone in the early days of high-performance HDDs. Notably, EBS consistently performs an astonishing 140 trillion operations daily across its distributed SSD infrastructure. Despite our best efforts, the transformation wasn’t a one-day, overnight sensation, nor was it a uniform, across-the-board achievement. Upon joining the EBS team, my initial responsibility was working on the EBS Consumer, a critical component that translates event-level input/output (IO) requests from application instances into efficient storage operations on Elastic Block Store (EBS). Since embarking on this journey, I’ve dedicated myself to refining every aspect of EBS, feeling privileged to be an integral part of its development and growth from the very beginning.

While EBS is often categorized as a storage system, its unique features set it apart from traditional storage solutions. Our signature offering stems from our primary focus on provisioning system disks for Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instances, driven by the need to replace cumbersome physical storage devices that previously occupied space within server racks in traditional datacenters. Many storage service providers prioritize durability above all else, deliberately sacrificing performance or availability in order to safeguard data integrity. AWS EBS prospects prioritize durability, relying on our primitive tools to ensure high durability through io2 Block Specific volumes and quantity snapshots. Moreover, they also place significant emphasis on the performance and availability of EBS volumes. The tight integration between EBS and EC2 means that the performance and reliability of EBS volumes directly impact the efficiency and availability of EC2 instances, ultimately influencing the overall functionality and delivery of applications built on top of EC2 infrastructure. At its core, the EBS narrative revolves around cultivating and refining operational excellence within an enormous, decentralized framework that seamlessly integrates strategies from front-end user behavior to cutting-edge storage solutions at the foundation. During our collaborative journey, I’d like to share with you the valuable lessons learned and some memorable takeaways that may be relevant to your approaches. Despite the challenges, measuring techniques’ efficiency remains a complex and nuanced area, requiring a sophisticated understanding across multiple disciplines.

Queueing principle, briefly

Before we delve into the intricacies of PC technology, let’s first establish a solid foundation by examining the interplay between computer systems and storage devices. While computer architecture’s fundamental principles remain unchanged, the specific components and their interactions have evolved significantly over time—the relationship between a storage system, bus, and CPU remains a cornerstone of computing, with advancements in technology continually refining this essential triangle. The central processing unit (CPU) stores incoming requests in a queue before forwarding them to the system via the bus. The storage system efficiently retrieves information from the CPU’s memory and ultimately stores it on a durable medium, or conversely, retrieves data from the medium and transfers it to the CPU’s memory.

Architecture with direct attached disk
What is the primary concern behind this setup? Is it for gaming, video editing, or perhaps a heavy workload?

To clarify: The current wording could imply a few different things, such as a custom-built PC with multiple components or a pre-assembled system. Can you provide more context about the structure in question and what kind of disk is directly connected to it?

Consider this as a leading financial institution? Upon entering the financial institution, you’re greeted by a line of customers waiting to make deposits. Before proceeding with your transaction, you must navigate through this queue and reach the front, where you’ll finally be able to converse with a bank teller who will assist you with your needs. In an ideal scenario, customers seamlessly enter the financial institution at the exact moment when their inquiry can be efficiently addressed, eliminating the need for queues. Unfortunately, the actual world isn’t always excellent. The actual world is asynchronous. It’s equally likely that several people enter the bank at the same moment. Perhaps fate has brought them together on the same familiar route or at a shared training session. As multiple customers simultaneously enter the bank, a natural backlog forms as some individuals must wait patiently behind others whose financial transactions require processing by the teller before they can proceed with their own needs.

While processing times were satisfactory across the board, with the queue cleared and average latency appearing reasonable, it’s clear that the experience was vastly different for the first individual in line versus the last, with the latter facing a significantly longer wait. The financial institution can take various steps to elevate its expertise and appeal to a broader range of customers. The financial institution can augment its staffing by assigning additional tellers to process increased requests in tandem, potentially streamlining workflows to minimize transaction duration, thereby reducing overall and average processing times, or it may establish separate queues for latency-insensitive customers or consolidate transactions to manage queue volumes effectively. While each option incurs additional costs – hiring extra tellers for a peak that may never materialize, or acquiring extra real estate to establish distinct queues. While imperfections may arise, having an abundance of resources does not obviate the need for queues to absorb peak loads effectively?

Simple diagram of EC2 and EBS queueing from 2012
What happens when a request to launch an EC2 instance is made, the system checks if there are any available instances in the desired Availability Zone (AZ). If not, it places the request in a queue waiting for an instance to become available.

The same holds true for EBS volumes. When you ask AWS to provision a new volume or attach one to your instance, they check their availability and if none is free, they put your request in a queue.

This is an important concept as it directly impacts how quickly your instances can be launched and volumes attached 2012)

Within the realm of community storage techniques, multiple queues exist within the stack, interposed between the working system kernel and the storage adapter, as well as between the host storage adapter, the storage fabric, the target storage adapter, and the storage media itself. Legacy community storage techniques often featured disparate suppliers for each component, with varying approaches to managing queues. You may leverage a dedicated, lossless network fabric such as fibre channel, or opt for iSCSI or NFS over TCP/IP, leveraging your operating system’s networking stack or a custom driver. Tuning a storage community requires specialized knowledge, distinct from configuring an appliance or optimizing storage media.

Following the initial deployment of EBS in 2008, the prevailing storage landscape consisted mainly of hard disk drives (HDDs), which dictated the latency profile of our cloud-based service at that time. In his final year, Andy Warhol delved deeply into the… As a seasoned engineer, I find myself in awe of the intricate details that converge to create a hard drive, yet ultimately, they remain mechanical devices governed by the fundamental laws of physics, which impose natural limitations on their performance. A dizzying array of platters whirs by at breakneck speed. These vinyl records feature plates with detailed information etched into their grooves. Relative to the dimensions of a monitor (<100 nanometers), there’s a big arm that swings backwards and forwards to search out the best monitor to learn or write your information. Despite the physical constraints, the IOPS efficiency of a traditional hard drive has plateaued over the past few years, typically hovering around 120-150 operations per second, with average IO latency ranging from 6-8 milliseconds. One of the most significant hurdles in working with HDDs is the potential for tail latencies to gradually increase by several milliseconds due to queuing and command reordering issues within the drive, leading to significant delays.

Since end-to-end EBS latency was largely driven by HDDs, we weren’t too concerned about the community’s performance, with measurements consistently falling within tens of milliseconds. Although our initial network infrastructure has proven robust enough to accommodate our users’ latency and bandwidth requirements. The incremental delay of tens of microseconds had a negligible impact on overall system latency.

The variability of drive efficiency is compounded by latency issues, with onerous performance further hindered by the unpredictable nature of concurrent transactions in the queue. Requests dispersed across different platforms require more time-consuming searches and entries compared to a cluster of consecutive requests in a single location. This arbitrary efficiency led to widely varying and unpredictable behavior. To achieve cost-effectiveness early on, we envisioned a strategy to spread out opportunities across multiple discs. Although this innovation delivered a profit by significantly reducing latency for the most common workloads, it unfortunately revealed inconsistencies in its behavior, ultimately affecting numerous customers.

As interdependent workloads converge, they often generate a phenomenon known as “noisy neighbors,” which can prove detrimental to an organization’s overall performance. As AWS advanced, we prioritized cultivating exceptional buyer expertise by focusing intently on quality, which led us to strive for robust efficiency isolation to shield our buyers’ workloads from noisy neighbor interference.

At the scale of AWS, we consistently encounter complex and demanding challenges stemming from the sheer magnitude and diversity of our methods, as well as the imperative to preserve a superior customer experience. Surprisingly, subtle changes can yield significant improvements when one thoroughly understands the underlying system, leading to a profound impact due to the amplifying effects of scaling elements at work. By leveraging optimized scheduling algorithms and distributing workload evenly across multiple spindles, we’ve successfully implemented enhancements. Despite these efforts, only modest improvements were achieved. Although we made efforts to eliminate noisy neighbors, our approach lacked the decisive impact we had hoped for. Despite efforts to achieve consistency, buyer workloads have remained frustratingly unpredictable. We sought to uncover something entirely novel.

Establish ambitious goals for the future, yet remain open to refining and building upon them in a gradual manner.

Since joining AWS in 2011, the widespread adoption of solid-state drives (SSDs) led to their increased availability in larger capacities, making them increasingly attractive to us. In an SSD, physical movement is unnecessary, as random access requests are nearly as swift as sequential ones, thanks to multiple channels connecting the controller directly to the NAND chips. By replacing an HDD with an SSD in a financial institution’s infrastructure, you’re essentially building a gigantic facility and hiring elite teams capable of processing transactions exponentially faster than ever before. Twelve months on, we made the switch to solid-state drives (SSDs), and never looked back.

We kicked off by achieving a notable first: deploying a novel storage server architecture built on solid-state drives (SSDs) and introducing a new Elastic Block Store (EBS) volume type called Provisioned IOPS. Introducing a novel product variant requires significant effort, effectively constraining the potential workload opportunities. While Early Bird Savings (EBS) initially showed promise, the overall results fell short of our expectations.

Upgrading to solid-state drives (SSDs) proved an effective solution to address the majority of our problems, effectively resolving the issues stemming from the mechanical limitations of traditional hard disk drives. What surprised us was the system’s limited enhancement, failing to live up to our expectations, and the lack of routine noise reduction measures for nearby neighbors. By shifting our focus away from individual solutions and onto the broader ecosystem, we were forced to reevaluate the interplay between our community and software in light of the newly enhanced storage capabilities.

Although we had initially hoped to incorporate these changes, we nonetheless decided to launch in August 2012, introducing a maximum of 1,000 IOPS, a tenfold increase over standard EBS volumes, and average latency of approximately 2-3 milliseconds – a five-to-ten-fold improvement with significantly enhanced outlier management capabilities. Although our prospects have expressed enthusiasm about the potential EBS quantity, allowing them to begin building mission-critical functions, we remained unsatisfied, recognizing that the efficiency engineering work within our system was just beginning to take shape. To assess whether this approach would work, we had to first quantify our system’s performance.

If you can’t measure it, you probably can’t manage it.

At that point in EBS’s history (2012), we were limited to basic telemetry capabilities. To effectively address the issues at hand, a thorough understanding of what requires repair is essential, allowing for the subsequent prioritization of fixes based on a careful consideration of both the effort required and the potential rewards that can be gained as a result. To kickstart the project, we developed a method for instrumenting every input/output (IO) operation across multiple facets within each subsystem – specifically, our consumer initiator, community stack, storage resiliency engine, and core system. To ensure optimal performance, we simultaneously monitored buyer workloads while developing a suite of canary tests that continuously ran, enabling us to gauge the impact of modifications – both positive and negative – under familiar workload conditions.

With the deployment of our advanced telemetry system, we have pinpointed key regions that merit initial financial allocation. Given the existing complexity of your system, it was essential that we limit the number of queues from the onset. Although the Xen hypervisor performed well in EC2, its general-purpose design and additional features exceeded our requirements for that specific environment. By allocating sufficient resources, we expect to streamline the I/O path within the hypervisor, thereby enhancing overall performance and efficiency. Moreover, we sought to refine the community software by optimizing its sturdiness engine, tackling various organizational and coding tasks, including on-disk data formatting, cache line optimization, and fully adopting an asynchronous programming model.

At Amazon Web Services (AWS), we’ve learned that system efficiency often spans multiple layers across both hardware and software stacks, yet even skilled engineers typically concentrate on specific, focused areas of expertise. While the notion of a “full stack engineer” is widely acclaimed, it’s often far more valuable to assemble teams of specialists who can work together to drive innovation across the entire tech ecosystem and its various sub-specialties.

At this stage, we had already established distinct groups for our storage server and consumer-facing aspects, thereby enabling us to tackle both simultaneously. We collaborated with EC2 hypervisor engineers to form a multidisciplinary AWS community efficiency team. We established a comprehensive plan by outlining both immediate, strategic remedies and sustained, structural alterations.

Divide and conquer

Whiteboard showing how the team removed the contronl from from the IO path with Physalia
What are you trying to achieve with the Physalia framework?

As an undergraduate student, I developed a complex sentiment towards certain courses – some I cherished deeply, while others elicited a mixed response. “Algorithmic concepts were introduced as part of the curriculum at both undergraduate and graduate levels in my alma mater.” While I initially found the coursework challenging, my passion for the subject grew, and the widely respected CLR textbook became a valuable resource that I still occasionally consult. What wasn’t clear to me until I started working at Amazon was the notion that you can actually build a company using the same principles used to design a software system. Algorithms with distinct approaches yield varying benefits and compromises regarding your feature groups’ performance. Amazon employs a divide-and-conquer strategy, partitioning teams into small, self-sufficient units focused on specific, well-defined components with clearly defined APIs.

While effective for retail websites and management aircraft methods, its application to building high-performance data planes is less intuitive, potentially hindering efficiency enhancements. Within the EBS storage server, we transformed our monolithic improvement team into agile groups focused on specific domains: data replication, resilience, and snapshot hydration. Each crew focused on its unique hurdles, breaking down the quest for efficiency improvement into manageable chunks. With rigorous testing built up over time, these groups are able to iterate and commit their modifications independently. To drive meaningful advancements for our stakeholders, we initiated a strategic planning phase, crafting a roadmap that outlined our desired outcomes before embarking on the iterative process of refining and integrating distinct components through successive enhancements.

The beauty of incremental supply lies in allowing you to make a change and assess its impact before introducing the next alteration, enabling refinement and improvement with each subsequent modification. If your initial approach fails to yield the desired results, it’s often easy to backtrack and pivot towards a new strategy. Although our original blueprint from 2013 did not ultimately result in the EBS of today, it still provided a foundation for us to start moving towards that direction. For instance, had we previously envisioned it, we wouldn’t have been surprised when Amazon later leveraged its tailored expertise to cater specifically to the needs of EBS.

Constantly interrogate your premises!

Challenging our initial presumptions yielded significant improvements across every layer of the system.

We started by exploring software program virtualization. Until late 2017, Amazon Web Services’ (AWS) Elastic Compute Cloud (EC2) instances relied exclusively on the Xen hypervisor. In the Xen virtualization platform, a ring-based queue is established to facilitate interdomain communication between guest operating systems (domains) and the privileged domain zero (dom0), enabling the sharing of data necessary for I/O operations and emulation of various devices. The EBS consumer operated within dom0 as a kernel-level block subsystem. When fulfilling an I/O request from an instance, we must navigate multiple queues to exit the EC2 host efficiently, including the occasion’s block system queue, the Xen ring, the dom0 kernel block system queue, and ultimately, the EBS consumer community queue. In many approaches, efficiency gains compound over time, making it advantageous to address individual components separately.

We initiated by crafting “loopback” units, isolating each queue to assess the impact on the Xen ring, dom0 block system stack, and community. We were taken aback by the sudden realization that even slight delays in the dom0 system driver could lead to a cascade effect, where multiple simultaneous I/O operations could slow down the entire system’s performance to a crawl. It turned out that we had another rowdy neighbour to contend with. Initially, our team embarrassingly introduced EC2 by adopting Xen’s default configurations for block-level system queues and entries, settings that had been established several years earlier in response to the limited storage infrastructure available to the Cambridge lab developing Xen at the time. The unexpected limitation became apparent when we discovered that the total number of input/output operations was capped at an astonishingly low 64 per host, rather than per system – a far cry from what our most resource-intensive applications require.

Despite employing software-based virtualization to anchor our key findings, we ultimately found it was insufficient. In 2013, we had fully immersed ourselves in our inaugural event focused on networking.

By leveraging this primary card, we successfully migrated the processing of VPC, our virtualized software-defined community, from the Xen dom0 kernel to a dedicated hardware pipeline. To separate packet processing for aircraft systems from the hypervisor’s functionality, we aimed to conserve CPU resources and optimize system performance by minimizing interference with other applications. By leveraging Xen’s capability to virtualize a digital PCI system on-demand, we were able to seamlessly integrate the new infrastructure into our existing architecture.

This unexpected victory in terms of latency and efficiency prompted us to replicate this achievement for EBS storage. By offloading additional processing tasks to dedicated hardware, we were able to eliminate numerous system queues within the hypervisor, even though we didn’t quite achieve a seamless transition at first. Without bypassing the standard interrupt handling mechanism, the hypervisor significantly reduced its service time for requests by delegating excess workload to the hardware’s dedicated interrupt processing capabilities. The second Nitro card offers hardware-accelerated capabilities to seamlessly handle EBS-encrypted volumes without compromising EBS performance or storage capacity. By utilizing our robust hardware capabilities for encryption, we ensured that the encryption key material was securely stored separately from the hypervisor, providing an additional layer of protection for customer data.

Diagram showing experiments in network tuning to improve throughput and reduce latency
Refining community configurations to optimize performance and minimize delay.

While transitioning EBS to Nitro was a significant achievement, it simultaneously imposed substantial responsibilities on the community itself. The problem arose suddenly on the floor. To fine-tune our wire protocol, we sought to leverage the latest and most comprehensive knowledge on TCP optimization parameters, while identifying the optimal congestion control algorithm. Several seismic shifts are converging on us: Amazon Web Services is pioneering innovative networking architectures, while our Availability Zones are evolving into de facto single entities, transcending geographical boundaries. By introducing a judicious amount of randomness in our approach, we discovered that adding a small quantity of latency to requests directed towards storage servers unexpectedly reduced both average latency and outliers, thanks to the inherent smoothing effect it had on the network. As we rapidly scaled up our system’s efficiency, these modifications proved to be short-lived, necessitating regular measurement and monitoring to ensure we didn’t revert back, thereby verifying our progress.

In 2014, recognizing the need for a networking protocol surpassing TCP, we initiated research into Scalable Reliable Diagram (SRD), a concept that would ultimately revolutionize data transmission. Initially, we established a series of fundamental requirements, accompanied by a protocol designed to facilitate recovery and rerouting around failures, with the need for something that could be seamlessly offloaded onto hardware. During our investigation, we discovered two crucial findings: Firstly, we did not need to design with the end goal of the final web, allowing us to focus specifically on optimizing information heart community designs. Secondly, when it comes to storage, we observed that the execution of IO requests in flight can be reordered. Although we avoided incurring the penalty imposed by TCP’s insistence on maintaining packet order, we did sacrifice predictability by shipping distinct requests via disparate network pathways, only to process them upon receipt. Any potential obstacles can be addressed with the customer before being dispatched to the public. We ultimately developed a protocol that proves equally effective in facilitating both data storage and network communication. When implemented within a network infrastructure, SRD enhances the performance of Transmission Control Protocol (TCP) stack functionality in visited networks by optimizing data transmission and reducing latency. By optimizing SRD’s workflow and streamlining community pathways, SRD can significantly increase efficiency, reducing wait times and minimizing congestion within intermediate community units to achieve higher overall utilization.

Efficiency enhancements encompass a broad range of multifaceted initiatives. It’s a self-discipline that involves continuously challenging one’s assumptions, rigorously measuring and understanding them, and redirecting attention towards the most pivotal alternative possibilities.

Constraints breed innovation

Unfortunately, we were dissatisfied with the fact that only a limited number of volumes and prospects consistently demonstrated better performance. To effectively communicate the benefits of Solid-State Drives (SSDs) to a broad audience. In this challenging environment, finding a suitable solution is crucial? Our organization managed a vast array of hundreds of storage servers, collectively processing millions of non-provisioned IOPS for various buyers, driving significant volume demands. Some of these very same books still linger today. Replacing all that existing hardware would prove a prohibitively expensive endeavour.

Within the chassis, an unoccupied space existed; yet, one region remained undisturbed by the cooling airflow – the narrow gap situated between the motherboard and the fans. One significant advantage of SSDs is their compact size and minimal weight, but it’s crucial to secure them properly in the casing to prevent loose movement. After exploring various options with the guidance of our materials scientists, we successfully found a warmth-resistant, high-performance hook and loop fastening tape that enables us to maintain these SSDs for the entire lifespan of the servers, streamlining future maintenance processes.

An SSD in one of our servers
We manually install a solid-state drive (SSD) in each server.

With the provided data and substantial human input spanning several months in 2013, EBS successfully installed a single solid-state drive (SSD) in each of hundreds of servers. To optimize performance, we implemented a minor update to our software program, which enabled write staging on the SSD, allowing for seamless completion of tasks and subsequently flushing data to the slower, storage-intensive disk in an asynchronous manner. We seamlessly transitioned from a conventional propeller-driven aircraft to a modern jet-powered one while still in mid-flight, ensuring zero impact on our customers’ experiences. What enabled us to achieve this was deliberately designing our system from its inception with regular maintenance intervals in mind, ensuring seamless upkeep without disrupting operations. We can rehome EBS volumes to newly deployed storage servers, and reconfigure or rebuild the decommissioned servers as needed.

This feature’s ability to migrate buyer volumes to new storage servers has proven useful multiple times throughout EBS’s history as we’ve discovered new, more efficient data structures for our on-disk format, or brought in new hardware to replace old hardware. Despite the initial enthusiasm, there remain volumes nonetheless energized from the early years of EBS’s launch in 2008. Throughout the years, these volumes have resided on numerous servers across multiple generations of hardware, seamlessly transitioning from one to another without disrupting their workload operations.

Reflecting on scaling efficiency

I’ve walked a solitary path before, with no guiding star to light the way? Prior to joining Amazon, my professional experience was primarily rooted in the entrepreneurial spirit of early startups and smaller firms. Having built managed providers and developed distributed techniques out of necessity, I had never worked on a project of such enormous scale as EBS – not even the EBS of 2011, considering both technological complexity and team size. In the past, I was accustomed to resolving problems independently, or occasionally collaborating with one or two like-minded engineers who shared my enthusiasm for finding solutions.

While I relish delving deeply into complex matters and tackling them until they’re fully resolved, a watershed moment arrived when a trusted colleague pointed out that I had become an efficiency bottleneck for our team. With a deep understanding of the system having evolved through expertise, I found myself enthusiastically involved at every escalation, meticulously examining each commit and proposed design alteration with an intense passion for EBS’s multifaceted aspects. To attain success, I had to develop the capacity to scale my efforts – mere momentum and a preference for action wouldn’t be enough to resolve this complex issue.

This led to significantly increased experimentation, yet not within the confines of the code itself. I once collaborated with diverse groups of individuals, but I also felt it was essential to reflect on how to optimize their performance as well. One of my favorite techniques for debugging code in a team setting is peer debugging. In a collaborative setting, I recall a meeting with a small group of engineers gathered in one of our lounge areas, surrounded by coding materials and terminals displayed prominently on the wall. As we delved into the issue, an engineer’s exasperated cry, “There’s no proper approach!”, resonated with our team, echoing a concern that had been plaguing us for weeks. We had overlooked the premises and failed to regularly update critical information in our building databases. While our design typically did not prompt issues, occasional slow reactions to queries still occurred, but resolving these hiccups significantly reduced the likelihood of system instability. Although we don’t always employ this system, its value lies in allowing us to combine our collective expertise when faced with particularly challenging situations.

When enabling people to freely explore and take calculated risks, it often yields far more impressive results than initially expected. Throughout my professional tenure, I’ve dedicated a substantial amount of time to developing strategies that eliminate obstacles while maintaining essential safeguards, thereby encouraging engineers to venture beyond their comfort zones and explore innovative solutions. While there’s a little-known aspect of psychology in engineering management that I previously underestimated. As I embarked on this unexpected journey, I discovered a profound sense of fulfillment in fostering growth and resilience in others.

Conclusion

Upon re-examining our starting point, we wondered whether we could achieve greater heights, but uncertainty lingered about just how much more we could improve. We chose to approach the issue in a piecemeal manner, implementing a series of incremental enhancements rather than attempting a massive overhaul at once. This enabled us to deliver value to buyers more swiftly, and proved crucial in understanding how to effectively adjust their workload processes. With our innovative io2 Block-Specific technology, we’ve successfully transformed the EBS latency expertise, migrating it from a lengthy average of over 10 milliseconds per I/O operation to consistently fast, sub-millisecond I/O operations, setting a new standard with our top-performing storage solutions. Without disrupting our service, we successfully deployed an entirely new infrastructure.

We are still far from achieving our full potential. Will our prospects always require more, driving our motivation to continually innovate and improve?

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles