good docs for large-scale graphite use: ‘Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01’
On June 3, 2013, trading in SPY exploded at 09:59:59.985, which is 15 milliseconds before the ISM’s Manufacturing number released at 10:00:00. Activity in the eMini (traded in Chicago), exploded at 09:59:59.992, which is 8 milliseconds before the news release, but 7 milliseconds after SPY. Note how SPY and the eMini traded within a millisecond for the Consumer Confidence release last week, but the eMini lagged SPY by about 7 milliseconds for the ISM Manufacturing release. The simultaneous trading on Consumer Confidence is because that number is released at the same time in both NYC and Chicago. The ISM Manufacturing number is probably released on a low latency feed in NYC, and then takes 5-7 milliseconds, due to the speed of light, to reach Chicago. Either the clock used to release the ISM number was 15 milliseconds fast, or someone (correctly) jumped the gun. Update: […] The clock used to release the ISM was indeed, 15 milliseconds fast. This could be from using the default setting of many NTP clients, which allows the clock to drift up to about 16 milliseconds before adjusting time.
Neat, I didn’t realise this was publicly visible. A single corrupted bit infected the S3 gossip network, taking down the whole S3 service in (iirc) one region:
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether [gossip state] had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it. During our post-mortem analysis we’ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we’re taking: (a) we’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we’ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we’ve added additional monitoring and alarming of gossip rates and failures; and, (d) we’re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.This is why you checksum all the things ;)