To avoid the challenges of legacy modernization, many companies outsource key components, including cybersecurity. Unfortunately, these third-party companies are not always infallible, as many organizations learned with the recent CrowdStrike debacle. Understanding how this disaster occurred is key to preventing similar outages in the future. In looking critically at the events leading up to the outages that happened in July 2024, there are many stages at which the problem could have been avoided. Moreover, looking at the CrowdStrike response can also teach the industry important lessons about humility and how to handle global disasters like the one that occurred.
The Outage Felt Around the World
On July 19, 2024, a tech outage left airlines unable to fly their planes, news channels unable to give their reports, banks unable to complete transactions, and even health and emergency organizations unable to respond to people in need. This massive outage affecting people in many different countries and time zones originated from a single software update from a cybersecurity firm based in Austin, Texas, known as CrowdStrike. The update caused computers running Windows to give operators the blue screen of death and lock them out of the operating system completely. CrowdStrike found the defect to be related to a single content update, which was a relief as the outage was initially thought to be related to a security problem or, worse, a cyberattack. While a fix had been deployed, it also required customization for each impacted customer and thus took significant time to implement.
While CrowdStrike pledged to help each of its clients recover from the incident, the outage left many people shocked by the fact that a single update could take out so many different types of systems. Even more shocking is the fact that the update is directly related to cybersecurity and the client organizations’ pledge to protect access to their products, as well as the individual information of their own customers. CrowdStrike’s CEO apologized for the problem but offered little by way of explanation or acknowledgement of the deep harm the outage caused. Many organizations were relieved to discover the outage was not the result of an attack, but that does not explain how the problem came to be in the first place.
The Key Things to Know About CrowdStrike
To understand how a single application update could cause a global outage, you need to be familiar with CrowdStrike. This company is one of the largest cybersecurity firms currently in existence and its primary service is the detection and prevention of malicious hacks. Many Fortune 500 companies and other organizations use CrowdStrike to protect their Windows devices, and the ubiquity of this service is evidenced by how many people were affected by the outage. Because of the nature of the service, operations not using the CrowdStrike security platform outright were still affected by the outage. If an organization depends on a digital tool that runs CrowdStrike software, that tool became unavailable during the outage.
In the July outage, the Falcon Sensor software in particular was implicated. This software is an antivirus platform that secures endpoint terminals like mobile devices, laptops, and point-of-sale systems. CrowdStrike has deep access to the operating systems on these devices to help monitor the endpoint terminals. This access, known as kernel-level, deals with the interactions between software and hardware. Cybersecurity software typically needs this kind of access so it can access the parts of a system frequently targeted by hackers. The CrowdStrike update that proved faulty dealt with this kernel-level access and resulted in faulty interactions with Windows operating systems leading to a crash. The system could not complete its reboot process and was stuck in an endless loop resulting in the blue screen that many people experienced.
A Difficult Fix for a Preventable Problem
The looping reboot process is what caused the CrowdStrike issue to be so disastrous. Systems were unable to get online long enough to download the solution offered by the company. Thus, organizations needed to manually enter their machines and change the code provided by the update. For organizations with employees experienced in this sort of operation, the implementation of a fix was not that difficult. However, many organizations look to a company like CrowdStrike for solutions they cannot handle in-house and thus do not have the staff they need to implement this change with confidence.
The bigger question here, and the one that CrowdStrike has yet to address, is how such a problem was introduced in the first place. Typically, updates are deployed in a virtual setting before getting released to clients to ensure its implementation does not have unforeseen implications, exactly like those seen in this situation. Moreover, why was the update deployed on a Friday morning, a business day, rather than on a Sunday when many clients, such as banks, would not be operating, or overnight, which would similarly avoid the mass panic issue.