Skip to content


Links for 2021-11-23

  • Google Cloud incident was caused by a race condition which triggered 30 minutes before the bugfix deployment was due to complete

    Wow, this was tragic! “A Google engineer discovered this bug on 12 November, which caused us to declare an internal high-priority incident because of the latent risk to production systems. After analyzing the bug, we froze a part of our configuration system to make the likelihood of the race condition even lower. Since the race condition had existed in the fleet for several months already, the team believed that this extra step made the risk even lower. Thus the team believed the lowest-risk path […] was to roll out fixes in a controlled manner as opposed to a same-day emergency patch. […] Gradual rollouts of both patches started on Monday, 15 November, and patch B completed rollout by that evening. On Tuesday, 16 November, as the patch A rollout was within 30 minutes of completing, the race condition did manifest in an unpatched cluster, and the outage started.”

    (tags: cloud outages tragic google race-conditions gclb patching deployment ops)

  • “Risk compensation” is garbage

    Risk compensation does occur in very narrow and specific circumstances, but all the studies purporting to show that it is a widespread, predictable outcome of any safety regulation have failed to replicate. […] Risk compensation and health-and-safety panic are both part of a safety nihilism campaign that serves big business’s deregulatory agenda, and the cruel moralizing of right wing religious maniacs, the traditional turkeys-voting-for-Christmas coalition. But risk compensation is especially salient in these covid days, where it’s being used to fight rapid testing (“encourages risky behavior”).

    (tags: risk-compensation risks safety)

Comments closed