since everyone (including "professional C++ programmers") has already written about so I will also write it down (for memory)
This was not first time: proof
Official report is very vague but we can conclude that root cause was bug in config parser leading to dereference of bad (it's unclear if it contained zero or just was uninitialized) pointer, as I assumed
Well, this is not big deal - I made thousands bsod/kernel panics. There usually long path from developers machine to production, right?
Then I read second report - and this is pure madness. They DIDN'T TESTED IT AT ALL!!! Proof is very simple - lets assume we have bug with probability of bsod P. Then for N test machines probability to not catch this bug is (1-P) ^ N. for P = 0.5 (bsod happens or not, he-he) and 8 test machines this probability 0.4%. Given that in this incident probability of bsod was close to 1 - size of their test cluster is exactly 0
Instead they used some kind of spell-checkar Content Validator. I'm almost sure that they were bitten by adherents of totalitarian eBPF sect with their “20,000 lines of code” verifier (which exists solely to drink blood and ruin lives also unable to prevents cases like this)
Well, this is already serious problems but still not disaster - like you can deploy only for very small percent of users, employ fuse to prevent spreading of malicious update (for example by tracking heartbeats from updated machines) and so on. Right? ...
So right recipe for catastrophe from group of rebels:
- never test anything - static validator is enough
- deploy in Friday
- deploy to all users at the same time