Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
Spotify wrote their own metrics store on ElasticSearch and Cassandra. Sounds very similar to Prometheus
ELS measures the following things: Success latency and success rate of each machine; Number of outstanding requests between the load balancer and each machine. These are the requests that have been sent out but we haven’t yet received a reply; Fast failures are better than slow failures, so we also measure failure latency for each machine. Since users care a lot about latency, we prefer machines that are expected to answer quicker. ELS therefore converts all the measured metrics into expected latency from the client’s perspective.[…] In short, the formula ensures that slower machines get less traffic and failing machines get much less traffic. Slower and failing machines still get some traffic, because we need to be able to detect when they come back up again.
great research from LMAX: xfs/ext4 are the best choices, and they explain why in detail, referring to the code