CrowdStrike IT Outage Explained by a Windows Developer

1,385,547
0
Published 2024-07-21
Dave explains the Crowdstrike IT outage, focusing in on its role as a kernel mode driver. For my book on the spectrum, see: amzn.to/3XLJ8kY

Get the shirt: amzn.to/4bRUgAn

Follow me for updates!
Twitter: @davepl1968 davepl1968
Facebook: fb.com/davepl

Opinions are mine only, not a spokesperson!

All Comments (21)
  • @Yandarval
    "Agile, ambitious and aggressive" the sarcasm with which this phase was uttered, wonderful.
  • @Vladimir_Kv
    The most funny thing is that CEO of Crowdstrike was a CTO at McAfee... during their worldwide faceplant.
  • Hi Dave, I’m also a retired Windows developer. It was fun listening to you talk about all those old system components that used to be part of our daily life experience. I was impressed that I remembered enough to understand what you were talking about! Thanks a lot for your explanation. I confess that I feel kind of angry at the CrowdStrike developers for taking such liberties with the kernel code. Seems kind of arrogant. No doubt someone thought they were being super clever by defining their code as something required to run when the kernel starts up. Imagine if the CrowdStrike developer had just arranged a meeting with a Windows kernel expert at Microsoft to discuss what they were planning to do. A whole lot of suffering could have been avoided.
  • @mhewett5193
    I am a network systems engineer that had to deal with this for 14 hours that day. This was one of the most informative videos I have ever seen. You helped simplify Windows OS in 15 minutes in a way that hours of reading hasn't. Something about real world scenarios to tag the concept with in my memory really helps. Thanks!
  • @NealB123
    3 days ago no one outside of IT had ever heard of Crowdstrike. Now the entire world knows the name. Reputation destroyed in an instant.
  • @zug-zug
    While this is technically what crashed machines it isn't the worst part. CS Falcon has a way to control the staging of updates across your environment. businesses who don't want to go out of business have a N-1 or greater staging policy and only test systems get the latest updates immediately. My work for example has a test group at N staging, a small group of noncritical systems at N-1, and the rest of our computers at N-2. This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies. So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers' staging policies which would have prevented this type of widespread damage. Unbelievable.
  • Finally an "Air Crash Investigation" style explanation of what actually happened. I now understand WHAT, WHY and HOW. Thank you, Dave!
  • @GiaFulford
    Just love it when the deeper technicalities are explained for the most of us to at least get a sense of the problem. No magic, just machinery
  • As a former CrowdStrike employee this is the best explanation I have heard and is 100% accurate.
  • @elgabacho73
    I work in IT. Crowdstrike sales has been calling me trying to get us to switch to them. I don't think they'll be calling us for a bit.
  • @indylmc
    An OS coder that can describe an issue that a non-OS coder can understand .. sheer brilliance. Well done Dave.
  • @kehlarn6478
    accurate summary. the source of the zeroed file is either a crash during writeout during the build process (full disk/stopped vm scenario likely) or a cdn corruption. both would have been caught by the inclusion of a checksum/manifest pair to validate the payloads were intact. the moment the driver decided to bypass certification and dynamically include contents to speed up the process they should have known they needed to supplant it with a checksum manifest but chose not to for unknown reasons. this is sadly a VERY common outcome in cdn mapped content due a variety of corruption vectors and the trust modern software has in network integrity is rather poorly misplaced. always verify your content is intact regardless of how small/large
  • @tazzybod
    "It's a fair bet that update 291 will never be needed or used again" Dave you're a legend 😂🤣
  • Funny how defence against potential distributed threats created a single point of failure across such heterogenic deployment
  • @sfoldy
    I'm not even a programmer, but between you and Steve Gibson, I feel like an engineer. This is by far the most clear and in-depth explanation of what happened (based on the current knowledge) that I have heard. Thank you!
  • @MrKvasi
    The company I work at got bought by a bigger one. They required us to install Crowdstrike on all servers. We found a memory leak, that Crowdstrike still hasn't fixed after 6 months so I have refused to install it until then. I was on vacation when I saw all URGENT emails from other divisions. Thank you Crowdstrike for not fixing your memory leaks, it saved my vacation. =P
  • @QualityDoggo
    "They have a bug they don't protect against" is the key line. CrowdStrike added kernel drivers, but did not make them robust enough. Kernel code, especially when running such complex functionality, should be able to take more abuse from user code without causing a BugCheck. Very disappointing. Great explanation!
  • @alleneng
    for some reason dave's explanation was waaay easier to understand than every other video about this
  • @PoiutVioe
    Had my college instructors imparted knowledge in such a relaxed and approachable manner, I may have graduated with honors in computer science and maintained employment! Dave, you are a treasure! Thank you!
  • @hagner75
    You're one of the few actual technical YouTubers. Thanks for explained it a bit more in depths.