Links for 2014-07-22

  • Metrics-Driven Development

    we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.
    Broken down into the following principles: ‘Instrumentation-as-Code’, ‘Single Source of Truth’, ‘Developers Curate Visualizations and Alerts’, ‘Alert on What You See’, ‘Show me the Graph’, ‘Don’t Measure Everything (YAGNI)’. We do all of these at Swrve, naturally (a technique I happily stole from Amazon).

    (tags: metrics coding graphite mdd instrumentation yagni alerting monitoring graphs)

  • Auto Scale DynamoDB With Dynamic DynamoDB

    Nicely-packaged auto-scaler for DynamoDB

    (tags: dynamodb autoscaling scalability provisioning aws ec2 cloudformation)

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments

  1. Keith Brady
    Posted July 23, 2014 at 13:21 | Permalink

    I disagree with a couple of the points in the MDD article.

    The most important is that it shouldn’t be the developers producing the dashboards. Usually that results in a 10×10 (I’ve seen worse) matrix of very small graphs recording things that seem important to a developer but aren’t useful in actually diagnosing a problem in an emergency. I’ve even seen dashboards where there were two sets of dense graphs separated by a set of tables of raw values. Assuming they’ve been trained in the system or helped design it, SRE (or whatever name) usually have a much better idea of the critical metrics and where to present them (above the fold is important).

    Another issue that seems either brushed over or is possibly wrong is what to alert on. It’s important to be clear that alerting on symptoms is the best option (though not a strict rule). I’ve almost never seen a developer come up with useful symptom-based alerts, they always seem to prefer ones that reflect their internal state assumptions (like asserts in code), e.g. alerting if a backend is problematic rather than alerting on the problem the iffy backend causes.

  2. Posted July 23, 2014 at 16:46 | Permalink

    hey Keith —

    It may be coming from a very devops-y POV. In Amazon, most services are operated by their developers (ie. the original devops approach) — it’s very much in the interest of the dev to make sure that the dashboard surfaces the important operational metrics right at the top, so that when they themselves are paged at 3am they don’t have to perform chart spelunking.

    This also drives the addition of good, alertable metrics to the code, too, since they tend to get pissed off when they’ve been paged at 3am 3 nights running for some internal P99 latency hitting 100ms, when everything’s fine at the service level ;)

    Good point though. Maybe that only works when the devs are opsing.

    (also: 10×10. jaysus. At least with Amazon or Graphite it quickly becomes obvious that the dashboard is not viable with that many graphs when it OOMs your browser tab!)