Continuous deployment

This is awesome, if a little insane. Continuous Deployment at IMVU: Doing the impossible fifty times a day:

Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.

OK, so far, so sensible. But this is where it gets really hairy:

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.

Mental. So what we’ve got here is:

  • phased rollout: automated gradual publishing of a new version to small subsets of the grid.

  • stats-driven: rollout/rollback is controlled by statistical analysis of error rates, again on an automated basis.

Worth noting some stuff from the comments. MySQL schema changes break this system:

Schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master’s standby and failing over. It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that’s a bottleneck to agility. If I started this process today, I’d probably invest some time in testing the limits of distributed key value stores which in theory don’t have any expensive manual processes.

They use an interesting two-phased approach to publishing of the deploy file tree:

We have a fixed queue of 5 copies of the website on each frontend. We rsync with the “next” one and then when every frontend is rsync’d we go back through them all and flip a symlink over.

All in all, this is very intriguing stuff, and way ahead of most sites. Cool!

(thanks to Chris for the link)

This entry was posted in Uncategorized and tagged , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

4 Comments

  1. Posted February 12, 2009 at 15:16 | Permalink

    I think it’s great to see how far you can take take things.

    For data management a “deploy to layers” might work. Deploy the data layer+database, run that in, then deploy the service code,run that in, then deploy the UI. Roll back in reverse order – being able to turn features off in the UI is great if it the db/storage is melting in production – graceful degradation! But need need very strong contracts at each layer.

  2. Posted February 12, 2009 at 23:00 | Permalink

    Bill: Ultimately, we don’t need that level of sophistication at all. UI and service goes out together (which makes things easier to think about). Schema goes out separately and is applied “by hand”, aka running a script if things are easy or phasing it out if things are hard. (and our test cluster runs the latest schema, even if it’s not out in production, that’s cost us a couple times but we don’t have a solution for the specifics of our setup)

  3. Posted February 13, 2009 at 10:47 | Permalink

    hi Timothy! thanks for commenting.

    I think schema changes are ultimately a bitch to deploy automatically; a 5-hour ALTER command just isn’t something that you can run on a production database while it’s live :( you need to do stuff like the hot-spare-switching tricks mentioned in the comments on the original post.

    The NG databases definitely win on this point.

  4. Posted February 23, 2009 at 18:41 | Permalink

    The “a statistically significant regression” rollback idea is fascinating. I wonder how far you could push that from “internal” regression tests to A/B testing of user performance with the best performing variant going live 100% and the loser being axed. Then evolving the code from there with another A/B branch.