CrowdStrike/Microsoft Outage: The Danger of Single Points of Failure

Key points to rememberAccording to Jimmy Su, Binance’s Chief Security Officer, the July 19 technology outage, which highlights the global interconnectedness and critical reliance on IT infrastructure, also underscores the need for strong and resilient systems.
Binance’s systems were not interrupted by the disruptions caused by CrowdStrike, and while we must of course remain vigilant, such an incident is unlikely to occur at Binance due to our robust quality assurance and deployment procedures.
The outage highlights the dangers of centralized, single-point-of-failure architectures and suggests that adopting more distributed system designs, such as those based on blockchain, could improve their security and reliability.
The disruptions to critical systems and services around the world resulting from the July 19 technology failure are a stark reminder of the interconnectedness and global nature of our information technology infrastructure, and our close reliance on these systems in critical areas such as healthcare, transportation, security and finance.
Unlike many other financial services around the world, Binance's systems were fortunately unaffected and suffered no downtime as a result of the incident, ensuring uninterrupted service to its users.
Complex systems have, and always will, experience failures, and no technology platform is completely immune to sudden outages. Still, the CrowdStrike debacle highlights important facets of the architecture of the global technology ecosystem: its high centralization and interconnectedness, a dangerous combination that a more distributed system design could partially mitigate.
Crowdstrike Outage AnalysisCybersecurity firm CrowdStrike, a provider of software for a wide range of industries, was behind last week’s outages. A glitch occurred in an update to CrowdStrike’s flagship product, Falcon Sensor, knocking out Windows computers and causing widespread technology failures around the world. Binance’s Linux infrastructure was not affected.
At the tactical level, two flaws appear likely: First, the company’s QA team appears to have poorly tested the update regression, creating a breeding ground for a critical error. Second, CrowdStrike’s deployment did not follow the established principle of first deploying the update to a small subset of users, which would have allowed the outage to be limited to a small number of machines and restored to their pre-update state without causing major damage.
A similar incident could happen at Binance if a new security rule governing login was deployed, and suddenly no one could log in; but this does not happen because we have a very thorough regression testing process in place and follow staggered deployment procedures. Human errors cannot be avoided, but it is perfectly possible to implement processes to minimize their impact.
Crypto markets remain active all the time: we therefore absolutely need to design systems that can be updated as soon as necessary without creating risk for our users.
A systemic problemAs we’ve seen, a single software update failure can simultaneously strand planes on tarmacs, delay surgeries, and derail transactions around the world. Could all of this be avoided by changing the way systems are designed?
Many observers in the crypto and Web3 space have rightly pointed out that while traditional industries have been struggling with the aftermath of the CrowdStrike outage, the major blockchain networks have continued to operate as normal. This means that none of the nodes that support these networks run on Windows software; while it is likely that some of them were affected, this has not impacted the blockchain as a whole and its ability to function due to its distributed nature.
In fact, the last Bitcoin network outage occurred over 4,150 days ago, meaning the network has been operating uninterrupted for over 11 years.
It is precisely because the nodes are independent of each other and interchangeable that it does not matter if 5% or 15% of them go down: the network will remain fully operational. By contrast, the 8.5 million devices affected on July 19 represent only about 1% of the machines running Windows, and it is difficult to imagine the scale of the chaos that an outage of this type on a larger scale could cause.
As long as the majority of the world’s interconnected and interdependent computing systems rely on a centralized, single-point-of-failure architecture, the risk of similar incidents is high. Of course, some critical systems will and should be centralized; yet the CrowdStrike outage suggests that shifting the balance between centralized and distributed elements in the global computing architecture could improve the robustness and resilience of networks for everyone, everywhere. In any case, for distributed networks, no technology is as effective at creating them as blockchain.
Jimmy Su, Binance's Chief Security Officer
Key points to remember

Crowdstrike Outage Analysis

A systemic problem

Explore More From Creator

Latest News

.css-1iqe90x{box-sizing:border-box;margin:0;min-width:0;color:#EAECEF;}Key points to remember

Crowdstrike Outage Analysis

A systemic problem

Explore More From Creator

Latest News

Trending Articles

Key points to remember