The Day the Earth Stood Still (Again)

31 Jul 2024, by Sheldon Dyer

Perhaps emblematic of a post-apocalyptic movie or aliens creating global upheaval, nearly everything came to a complete standstill a week ago.  It would have felt surreal in this modern age to pop into the local supermarket for groceries just before dinner, only to be turned away at the POS machines whilst an eerie blue glow emanated from them. Then to carry on and stop by a service station to fill up one’s petrol tank only to find that the browser was not operating.  And then once home, turning on the news to hear that airlines were being grounded!  At least working here at the Micron21 data centre, the news reached us all pretty quickly through the cybersecurity and IT channels, so we knew what to expect.  Still, it is quite remarkable that a failed update from CrowdStrike could create such a widespread global IT outage.

In the world of cybersecurity there are a few names as large as CrowdStrike.  This company is a global provider of what is known as Endpoint Protection, specifically Endpoint Detection and Response (EDR).  Everyone is familiar with having antivirus and malware protection on their computers, however EDRs, aside from being a new IT buzz term, are more advanced in detecting modern computer threats.  Whereas antivirus looks specifically to match a particular virus signature (a string of code) in a file, EDRs also look for abnormal behaviours on a computer, for instance a suspicious file opening not related to current work. While there are several companies now selling EDRs, CrowdStrike thought differently to deliver their service from the cloud.  As a result, they were able to deliver better performance, offer offline protection from the cloud and keep costs down. Hence, many large companies adopted CrowdStrike to protect their computers and devices.

Where it went horribly wrong was that CrowdStrike released an update without due care and testing.  As George Kurtz their CEO stated, “A piece of software shouldn’t be able to take everything out”.   Unfortunately for his company, it did!  In fact, it took out over 8 million devices according to Microsoft.1  As the software is deeply integrated into the Windows Operating System to do its job, so to speak, the update caused Windows to crash with the insidious Blue Screen Of Death (BSOD) and then fail to reboot.  The temporary fix was easy in the end, just delete the rogue file from the computer.  However, its reputation was irrevocably damaged on the back the millions in revenue lost by their corporate customer base.

While most people accept a level of bugs in computers as “features”, the fact that proper release cycle and test procedure was missing is quite culpable.  Large software firms should know better and adhere to a due process to minimise such risks.  However, having said that, is CrowdStrike fully to blame here?  Large companies employ skilled IT professionals to manage their technology infrastructure.  ITIL, the Information Technology Infrastructure Library, represents what is widely known as best practices in IT management.  Those responsible for managing servers are generally aware of ITIL or at least good governance around processes such as Change Management (CM).

Most of the techniques in CM revolve around a structured approach to deploy new software updates.  This can include what is known as “Canary” deployments that test the release on a small population to see if it fails, akin to the canary in the mine (no actual IT canaries are harmed!).  However, what is consistent here is to validate the software before releasing to all production services.  Instead, it appears many large companies have taken shortcuts abrogating their own responsibilities to the vendor by enabling “Auto-Updates” to be performed instead of scrutinising what is to be released.  In the current geopolitical landscape, it seems quite remiss that essential services are not effectively protected from rogue software.   If I was a malicious state actor, I’d be thinking how can I get my programmers into these type of companies – seems a lot easier than trying to hack devices!

Another useful principle to consider is that of “Zero Trust”.  As the name suggests, it’s a security model that maintains very strict controls to access systems and networks by not trusting anyone.  While this framework is usually in the context of users and systems within a network, it advocates providing the right level of access to the right resources for the right people.  In this context, it has to be questioned why end devices are connected directly to the internet.  This harks back to the Optus breach where an internet facing test environment was not isolated to their customer systems.  While most large organisations do deploy updates centrally, it’s also the case that these central systems are also always connected to the internet to receive patches that in turn promulgates them to the end devices. Internet facing systems should be cordoned off from the rest of the network and the passage of data from environment to the next must be protected and deliberate by the company.

Will we learn from this?  Hopefully those of us in IT are spending time to reflect on this outage and either testing existing controls or placing more stringent measures with good governance. Most of the concepts described above are not difficult, and in the case of software updates, it just involves avoiding shortcuts.  We’ll all bounce back from this outage, but heaven help the company that crashes my streaming service!

1 CEO Crowdstrike Errant Software Patch

https://www.wsj.com/tech/crowdstrikes-ceo-george-kurtz-failure-global-tech-outage-microsoft-windows-07d27a4a

See it for yourself.

Australia’s first Tier IV Data Centre
in Melbourne!

Speak to our Australian based team.

24 hours a day, 7 days a week
1300 769 972

Sign up for the Micron21 Newsletter