CrowdStrike IT Outage Explained by a Windows Developer

1,385,547

125,212 0

Published 2024-07-21

Dave explains the Crowdstrike IT outage, focusing in on its role as a kernel mode driver. For my book on the spectrum, see: amzn.to/3XLJ8kY

Get the shirt: amzn.to/4bRUgAn

Follow me for updates!
Twitter: @davepl1968 davepl1968
Facebook: fb.com/davepl

Opinions are mine only, not a spokesperson!

All Comments (21)

@Yandarval yesterday

"Agile, ambitious and aggressive" the sarcasm with which this phase was uttered, wonderful.
@Vladimir_Kv yesterday

The most funny thing is that CEO of Crowdstrike was a CTO at McAfee... during their worldwide faceplant.
@dorothythompson927 7 hours ago

Hi Dave, I’m also a retired Windows developer. It was fun listening to you talk about all those old system components that used to be part of our daily life experience. I was impressed that I remembered enough to understand what you were talking about! Thanks a lot for your explanation. I confess that I feel kind of angry at the CrowdStrike developers for taking such liberties with the kernel code. Seems kind of arrogant. No doubt someone thought they were being super clever by defining their code as something required to run when the kernel starts up. Imagine if the CrowdStrike developer had just arranged a meeting with a Windows kernel expert at Microsoft to discuss what they were planning to do. A whole lot of suffering could have been avoided.
@mhewett5193 22 hours ago

I am a network systems engineer that had to deal with this for 14 hours that day. This was one of the most informative videos I have ever seen. You helped simplify Windows OS in 15 minutes in a way that hours of reading hasn't. Something about real world scenarios to tag the concept with in my memory really helps. Thanks!
@NealB123 yesterday

3 days ago no one outside of IT had ever heard of Crowdstrike. Now the entire world knows the name. Reputation destroyed in an instant.
@zug-zug yesterday

While this is technically what crashed machines it isn't the worst part. CS Falcon has a way to control the staging of updates across your environment. businesses who don't want to go out of business have a N-1 or greater staging policy and only test systems get the latest updates immediately. My work for example has a test group at N staging, a small group of noncritical systems at N-1, and the rest of our computers at N-2. This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies. So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers' staging policies which would have prevented this type of widespread damage. Unbelievable.
@Samararabinovich 7 hours ago

Finally an "Air Crash Investigation" style explanation of what actually happened. I now understand WHAT, WHY and HOW. Thank you, Dave!
@GiaFulford 23 hours ago

Just love it when the deeper technicalities are explained for the most of us to at least get a sense of the problem. No magic, just machinery
@StarLightDotPhotos yesterday

As a former CrowdStrike employee this is the best explanation I have heard and is 100% accurate.
@elgabacho73 yesterday

I work in IT. Crowdstrike sales has been calling me trying to get us to switch to them. I don't think they'll be calling us for a bit.
@indylmc 22 hours ago

An OS coder that can describe an issue that a non-OS coder can understand .. sheer brilliance. Well done Dave.
@kehlarn6478 15 hours ago

accurate summary. the source of the zeroed file is either a crash during writeout during the build process (full disk/stopped vm scenario likely) or a cdn corruption. both would have been caught by the inclusion of a checksum/manifest pair to validate the payloads were intact. the moment the driver decided to bypass certification and dynamically include contents to speed up the process they should have known they needed to supplant it with a checksum manifest but chose not to for unknown reasons. this is sadly a VERY common outcome in cdn mapped content due a variety of corruption vectors and the trust modern software has in network integrity is rather poorly misplaced. always verify your content is intact regardless of how small/large
@tazzybod yesterday

"It's a fair bet that update 291 will never be needed or used again" Dave you're a legend 😂🤣
@dmitripogosian5084 yesterday

Funny how defence against potential distributed threats created a single point of failure across such heterogenic deployment
@sfoldy 14 hours ago

I'm not even a programmer, but between you and Steve Gibson, I feel like an engineer. This is by far the most clear and in-depth explanation of what happened (based on the current knowledge) that I have heard. Thank you!
@MrKvasi 3 hours ago

The company I work at got bought by a bigger one. They required us to install Crowdstrike on all servers. We found a memory leak, that Crowdstrike still hasn't fixed after 6 months so I have refused to install it until then. I was on vacation when I saw all URGENT emails from other divisions. Thank you Crowdstrike for not fixing your memory leaks, it saved my vacation. =P
@QualityDoggo yesterday

"They have a bug they don't protect against" is the key line. CrowdStrike added kernel drivers, but did not make them robust enough. Kernel code, especially when running such complex functionality, should be able to take more abuse from user code without causing a BugCheck. Very disappointing. Great explanation!
@alleneng yesterday

for some reason dave's explanation was waaay easier to understand than every other video about this
@PoiutVioe 23 hours ago

Had my college instructors imparted knowledge in such a relaxed and approachable manner, I may have graduated with honors in computer science and maintained employment! Dave, you are a treasure! Thank you!
@hagner75 20 hours ago

You're one of the few actual technical YouTubers. Thanks for explained it a bit more in depths.