Wednesday, April 2, 2025

System resilience should inherently be the domain of operating systems, rather than relying on third-party solutions. The primary responsibility of an OS is to ensure that the underlying computing infrastructure remains stable and responsive, providing a solid foundation for applications to operate effectively.

Enterprise Safety

Enhancing environmentally responsible restoration options fosters ecosystem robustness and adaptability.

Why system resilience should mainly be the job of the OS, not just third-party applications

In the final week, a notable incident occurred when an executive from one of several firms responded to questions from media coverage makers. The notion that a single-level solution could effectively prevent future catastrophic events sparked significant interest throughout the discussion, particularly regarding the potential role of automated systems in mitigating such disasters.

Shouldn’t the responsibility for automated restoration rest with the third-party software vendor, considering it is their code that requires recovery, or does this incident instead highlight the need for a more robust OS-level auto-recovery mechanism to collaborate with third-party software?

A system that heals itself

When a machine’s boot process fails to load essential software, it may result in a catastrophic blue screen of death (BSOD), rendering the system unusable and prompting the user to restart or troubleshoot the issue. When a software program is installed or updated, it may be triggered by a specific event; on this particular occasion, a corrupted and potentially dangerous update file was invoked during the boot process of the machine, which ultimately led to a widely documented global IT meltdown.

Software programs requiring low-level access, akin to security measures, necessitate entry at a fundamental level, commonly referred to as ‘kernel mode’. If an element at this stage fails to function correctly, a Blue Screen of Death (BSOD) may occur as a potential outcome. Rebooting the machine consistently results in an identical Blue Screen of Death (BSOD) loop, necessitating professional assistance to break this cyclical pattern. A blue screen of death (BSOD) can also occur when operating in ‘person mode’, a more restrictive environment that limits software functionality.

Consider an internal combustion engine in a gasoline-powered automobile: In order for the engine to run, a spark is needed to ignite the fuel-air mixture, typically provided by a spark plug located in this position. Regular spark plug replacements are crucial on an everyday maintenance schedule; failure to do so can lead to catastrophic engine performance issues if left unchecked. As the seasoned mechanic lifts the car’s bonnet, a flurry of activity ensues: new spark plugs are swiftly installed. What’s supposed to spark the ignition fails to do so, leaving the engine silent and still. Roughly, that’s what happened in this incident from a software perspective.

Shouldn’t it be the responsibility of spark plug manufacturers, with multiple companies involved, to develop an automatic recovery system in case of such failures? In light of software’s inherent complexity and interconnectedness, should the third-party vendor be held accountable for any potential bugs or malfunctions that arise from their integration within the overall system? Shouldn’t the mechanic consider reverting to the trusted spark plugs and restarting the vehicle in its previous operating condition?

Regardless of the third-party software or spark plugs involved, the restoration course must remain consistent across all scenarios. The reality is that the complexity exceeds my initial comparison since the spark plugs (software programs) are being upgraded and modified without the mechanic’s (OS) awareness. Despite this, the analogy should still provide insight into the complexity.

The case for OS-managed restoration

When third-party software packages update their core functionality, they install new or modified files that are necessary for the boot process, registering with the operating system to preserve the previous file or state rather than overwriting it. In the event of a startup failure resulting in a BSOD, the subsequent boot may trigger a check for any issues caused by the previous boot’s termination. This process could provide users with the option to recover a changed file or system state from an earlier point in time, effectively rolling back changes made since the previous successful boot. The identical scenario may apply to any third-party software program with a kernel-mode interface.

There exists a precedent for such OS-managed restoration: When a newly assigned show driver is introduced but struggles to initialize correctly during the boot process, the malfunction is detected, and the operating system automatically reverts to a default state, providing a low-resolution driver that functions consistently across all displays. Since this actual scenario does not accommodate cybersecurity products due to the absence of a default state, but there could potentially be a previously functioning state prior to the update?

By incorporating a restoration mechanism within the operating system that accommodates all third-party software, a more sustainable approach is developed, eliminating the reliance on individual vendors to create their own solutions. To ensure the security of this mechanism, it is crucial that operating systems (OS) engage in seamless sessions and collaborative efforts with third-party software program distributors to guarantee robust capabilities that cannot be exploited by malicious actors.

While acknowledging the potential drawbacks, I believe settling for a simplified approach can prove more effective than relying on hundreds of developers to craft their own recovery methods, thus ensuring a more streamlined solution. Finally, this effort may significantly contribute to bolstering system robustness and preventing large-scale disruptions – much like the outage caused by the faulty CrowdStrike update.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles