This is awesome, if a little insane. Continuous Deployment at IMVU: Doing the impossible fifty times a day:
Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.
OK, so far, so sensible. But this is where it gets really hairy:
Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.
Mental. So what we’ve got here is:
phased rollout: automated gradual publishing of a new version to small subsets of the grid.
stats-driven: rollout/rollback is controlled by statistical analysis of error rates, again on an automated basis.
Worth noting some stuff from the comments. MySQL schema changes break this system:
Schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master’s standby and failing over. It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that’s a bottleneck to agility. If I started this process today, I’d probably invest some time in testing the limits of distributed key value stores which in theory don’t have any expensive manual processes.
They use an interesting two-phased approach to publishing of the deploy file tree:
We have a fixed queue of 5 copies of the website on each frontend. We rsync with the “next” one and then when every frontend is rsync’d we go back through them all and flip a symlink over.
All in all, this is very intriguing stuff, and way ahead of most sites. Cool!
(thanks to Chris for the link)