Major corporations expend considerable effort ensuring that their suppliers’ uptime is ensured, as significant outages can irreparably harm their reputation and prompt customers to seek alternatives with a stronger track record.
Constructing a dependable web service is a significant technical challenge, but for firm leaders it also presents a valuable opportunity to drive business growth. Can improving the reliability of existing systems really be as unglamorous as many engineers suggest?
At scale, incentives dominate. Major technology companies employ thousands of personnel and operate numerous online services. Over time, they have provided innovative approaches to ensure that their engineers build reliable software. Human engineering strategies that have been successfully implemented on a large scale within some of the most successful technology companies in history. Regardless of your role within the organization, you’ll be able to apply these principles.
Spin the wheel
The Amazon Web Services (AWS) operational assessment is an ongoing, regular gathering that takes place each week and is open to all members of the company. Each assembly is spun to randomly select an AWS service for thorough testing. Crews are required to respond promptly to inquiries from experienced operational leaders regarding their dashboards and key performance indicators. Multiple staff members, dozens of administrators, and several vice presidents attend the assembly.
This incentivizes each crew to maintain a minimum standard of professionalism. Although the probability of an individual being selected for a crew at AWS is remarkably low (less than 1%), as a supervisor or technical lead on that team, it’s crucial not to appear uninformed when the opportunity finally arises before a large portion of the organization.
It’s essential to regularly assess your reliability metrics; leaders who demonstrate a genuine interest in operational wellness set the tone for the entire team, fostering a culture of accountability and continuous improvement. To generate random outcomes, there are multiple software options beyond just Spin the Wheel; other alternatives include Random.org, Wheel Decide, and Randomizer.
You examine specific aspects of performance or processes, often with a focus on identifying opportunities for improvement or areas requiring additional attention. This move propels us forward to the next stage.
Outline measurable reliability objectives
You want to guarantee ‘excessive up-time’ or ‘five nines’, but what does that truly mean to your customers? The latency tolerance for dwell interactions (such as chat) significantly decreases compared to asynchronous workloads like coaching a machine learning model or importing a video. To achieve success, your objectives should mirror the key concerns of your target audience.
Upon assessing a process, we request that they define and articulate measurable reliability objectives. It is crucial to understand the reasoning behind those goals, which were deliberately selected for a specific purpose. Then, utilize interactive dashboards to visually track and measure progress against these strategic objectives. By setting quantifiable targets, you’ll streamline prioritization of reliability initiatives through a data-informed approach.
Concentrating on detecting key points is a valuable strategy. When encountering an irregularity on their dashboard, kindly request they provide further clarification on the issue and also inquire as to whether their designated on-call team member has been notified about the problem. It’s often beneficial to recognize when a particular approach isn’t working out for you before your clients do?
Embrace chaos
A key transformative paradigm shift in cloud reliability lies in deliberately introducing fault tolerance into design and development processes. Netflix has codified this concept under the banner of “Bite-Sized Learning,” and the notion is just as refreshing as the name implies.
To motivate their engineers to build robust and resilient software without relying on heavy-handed oversight, Netflix sought a more effective approach. By embracing systemic failure as the new normal, rather than an exceptional occurrence, engineers will inevitably develop fault-tolerant software as their only viable option. While it may have taken time to achieve this milestone, a crucial aspect of Netflix’s infrastructure is the routine shutdown of entire availability zones or specific server clusters during testing, mirroring real-world scenarios in their production environment. Services are designed to proactively handle failures without impacting overall availability.
This intricate method is surprisingly complex. When transporting products that require an extremely high uptime, failure injection during manufacturing can be a highly effective way to achieve a “correctness proof” like assurance. Introduce your product’s capabilities at the earliest opportunity to maximize its impact? It will never be simpler or less expensive than it is now.
If you deem chaos engineering too ambitious, consider mandating simulated outage rehearsals – “recreation days” – at least twice annually, or coinciding with major feature releases. During a recreational simulation day, participants will take on one of three distinct roles: the Primary Position, which mimics the effects of an outage; the Secondary Role, tasked with resolving the issue without prior knowledge of the underlying damage; and the Observer, responsible for meticulously recording observations throughout the process. Following the simulated exercise, conduct a thorough post-incident analysis to identify lessons learned. The upcoming sports event will expose vulnerabilities not just in the way your software handles outages, but also in how your engineering team responds to them.
Conduct an in-depth examination of
The organization’s post-mortem examination yields valuable insights into its heritage. All major high-tech companies necessitate teams to compose comprehensive post-mortem analyses following significant outages. The report should thoroughly detail the incident’s circumstances, pinpoint its underlying causes, and propose effective mitigation measures to prevent recurrences. While the autopsy should be thorough and adhere to a high standard, its purpose must never focus on identifying individuals solely responsible for wrongdoing. Upbeat post-mortem analysis is a constructive learning experience, not a disciplinary one. When an engineer makes a mistake, underlying factors often contribute to the error. Perhaps you’re seeking elevated security protocols or reinforced safeguards around critical software systems? Fix the fundamental cracks in this foundation.
Crafting a robust autopsy course requires dedication, but it’s safe to assert that such an effort will significantly contribute to preventing future outages.
Reward reliability work
If engineers believe that innovation alone is the key to career advancement, reliable maintenance may fall by the wayside. All engineers, regardless of their level of experience, should prioritize operational excellence in their work. Foster recognition of trust-building initiatives within your performance assessments. Hold senior-most engineers responsible for ensuring the stability and reliability of the systems under their purview.
While this advice may appear obvious, its simplicity often leads people to disregard its significance.
Conclusion
In this article, we delved into the world of fundamental tools that instill dependability within an organization’s culture. Startups and early-stage companies frequently overlook reliability as a top priority. Your startup must prioritize a laser-like focus on demonstrating product-market fit to ensure long-term viability. Despite establishing a loyal customer base, sustaining momentum hinges on maintaining credibility. Individuals gain trust by consistently being reliable. That’s a crucial consideration for internet users.
Welcome to the VentureBeat group!
Specialists and technical professionals driving knowledge work come together at DataDecisionMakers to share cutting-edge data insights and innovations.
Join our community at DataDecisionMakers to explore innovative ideas, stay informed about the latest advancements in data and knowledge technology, and uncover best practices shaping the future of insights.
Consider the uniqueness of yourself.